linux

Author	SHA1	Message	Date
Zhi Yong Wu	0a94da24b9	xfs: fix the comment of xlog_find_head() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 15:34:34 -05:00
Zhi Yong Wu	34be5ff378	xfs: fix the comment of xlog_recover_buffer_pass2() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 15:30:53 -05:00
Zhi Yong Wu	5c75390924	xfs: remove two unused macro definitions in xfs_linux.h Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 15:30:23 -05:00
Zhi Yong Wu	1cb9386354	xfs: fix the comment of xfs_btree_get_iroot() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 15:17:35 -05:00
Zhi Yong Wu	f6c2734924	xfs: fix the comment of xfs_iroot_realloc() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 15:17:11 -05:00
Zhi Yong Wu	7c3e664051	xfs: remove one blank line in xfs_btree_make_block_unfull() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 15:16:07 -05:00
Zhi Yong Wu	ac0e300fa5	xfs: fix the comment of xlog_write_setup_copy() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 14:59:31 -05:00
Zhi Yong Wu	99e738b783	xfs: fix the comment of xfs_mod_incore_sb_unlocked() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 14:59:05 -05:00
Zhi Yong Wu	49d3da149c	xfs: fix the comment of xfs_btree_lookup() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 14:45:23 -05:00
Zhi Yong Wu	b46fe8259b	xfs: fix the comment of xfs_buf_free() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 14:44:39 -05:00
Zhi Yong Wu	0471f62e38	xfs: fix the comment of xfs_check_sizes() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-20 14:42:16 -05:00
Dave Chinner	2ad01f53dc	xfs: use reference counts to free clean buffer items When a transaction is cancelled and the buffer log item is clean in the transaction, the buffer log item is unconditionally freed. If the log item is in the AIL, however, this leads to a use after free condition as the item still has other users. In this case, xfs_buf_item_relse() should only be called on clean buffer items if the reference count has dropped to zero. This ensures only the last user frees the item. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-15 16:42:29 -05:00
Dwight Engen	8c567a7fab	xfs: add capability check to free eofblocks ioctl Check for CAP_SYS_ADMIN since the caller can truncate preallocated blocks from files they do not own nor have write access to. A more fine grained access check was considered: require the caller to specify their own uid/gid and to use inode_permission to check for write, but this would not catch the case of an inode not reachable via path traversal from the callers mount namespace. Add check for read-only filesystem to free eofblocks ioctl. Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-15 14:25:01 -05:00
Dwight Engen	b9fe505258	xfs: create internal eofblocks structure with kuid_t types Have eofblocks ioctl convert uid_t to kuid_t into internal structure. Update internal filter matching to compare ids with kuid_t types. Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-15 14:24:10 -05:00
Dwight Engen	7aab1b2887	xfs: convert kuid_t to/from uid_t for internal structures Use uint32 from init_user_ns for xfs internal uid/gid representation in xfs_icdinode, xfs_dqid_t. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-15 14:22:40 -05:00
Dwight Engen	fd5e2aa865	xfs: ioctl check for capabilities in the current user namespace Use inode_capable() to check if SUID\|SGID bits should be cleared to match similar check in inode_change_ok(). The check for CAP_LINUX_IMMUTABLE was not modified since all other file systems also check against init_user_ns rather than current_user_ns. Only allow changing of projid from init_user_ns. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-15 14:19:25 -05:00
Dwight Engen	288bbe0eeb	xfs: convert kuid_t to/from uid_t in ACLs Change permission check for setting ACL to use inode_owner_or_capable() which will additionally allow a CAP_FOWNER user in a user namespace to be able to set an ACL on an inode covered by the user namespace mapping. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-15 14:18:31 -05:00
Dwight Engen	c5eeb7ec3e	xfs: create wrappers for converting kuid_t to/from uid_t Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-15 14:17:34 -05:00
Dave Chinner	4bb928cdb9	xfs: split the CIL lock The xc_cil_lock is used for two purposes - to protect the CIL itself, and to protect the push/commit state and lists. These are two logically separate structures and operations, so can have their own locks. This means that pushing on the CIL and the commit wait ordering won't contend for a lock with other transactions that are completing concurrently. As the CIL insertion is the hottest path throught eh CIL, this is a big win. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 16:21:21 -05:00
Dave Chinner	991aaf65ff	xfs: Combine CIL insert and prepare passes Now that all the log item preparation and formatting is done under the CIL lock, we can get rid of the intermediate log vector chain used to track items to be inserted into the CIL. We can already find all the items to be committed from the transaction handle, so as long as we attach the log vectors to the item before we insert the items into the CIL, we don't need to create a log vector chain to pass around. This means we can move all the item insertion code into and optimise it into a pair of simple passes across all the items in the transaction. The first pass does the formatting and accounting, the second inserts them all into the CIL. We keep this two pass split so that we can separate the CIL insertion - which must be done under the CIL spinlock - from the formatting. We could insert each item into the CIL with a single pass, but that massively increases the number of times we have to grab the CIL spinlock. It is much more efficient (and hence scalable) to do a batch operation and insert all objects in a single lock grab. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 16:20:09 -05:00
Dave Chinner	f5baac354d	xfs: avoid CIL allocation during insert Now that we have the size of the log vector that has been allocated, we can determine if we need to allocate a new log vector for formatting and insertion. We only need to allocate a new vector if it won't fit into the existing buffer. However, we need to hold the CIL context lock while we do this so that we can't race with a push draining the currently queued log vectors. It is safe to do this as long as we do GFP_NOFS allocation to avoid avoid memory allocation recursing into the filesystem. Hence we can safely overwrite the existing log vector on the CIL if it is large enough to hold all the dirty regions of the current item. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 16:19:03 -05:00
Dave Chinner	7492c5b42d	xfs: Reduce allocations during CIL insertion Now that we have the size of the object before the formatting pass is called, we can allocation the log vector and it's buffer in a single allocation rather than two separate allocations. Store the size of the allocated buffer in the log vector so that we potentially avoid allocation for future modifications of the object. While touching this code, remove the IOP_FORMAT definition. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 16:12:30 -05:00
Dave Chinner	166d13688a	xfs: return log item size in IOP_SIZE To begin optimising the CIL commit process, we need to have IOP_SIZE return both the number of vectors and the size of the data pointed to by the vectors. This enables us to calculate the size ofthe memory allocation needed before the formatting step and reduces the number of memory allocations per item by one. While there, kill the IOP_SIZE macro. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 16:10:21 -05:00
Eric Sandeen	050a1952c3	xfs:free bp in xlog_find_tail() error path xlog_find_tail() currently leaks a bp on one error path. There is no error target, so manually free the bp before returning the error. Found by Coverity. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 15:49:51 -05:00
Eric Sandeen	5d0a654974	xfs: free bp in xlog_find_zeroed() error path xlog_find_zeroed() currently leaks a bp on one error path. Using the bp_err: target resolves this. Found by Coverity. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 15:48:51 -05:00
Eric Sandeen	6dd93e9e5e	xfs: avoid double-free in xfs_attr_node_addname xfs_attr_node_addname()'s error handling tests whether it should free "state" in the out: error handling label: out: if (state) xfs_da_state_free(state); but an earlier free doesn't set state to NULL afterwards; this could lead to a double free. Fix it by setting state to NULL after it's freed. This was found by Coverity. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 15:48:01 -05:00
Jie Liu	2c2bcc0735	xfs: call roundup_64() to calculate the min_logblks Replace roundup() with roundup_64() as we calculate min_logblks with 64-bit divisions. Hence, call roundup() will cause the following error while compiling a 32-bit kernel: fs/built-in.o: In function `xfs_log_calc_minimum_size': fs/xfs/xfs_log_rlimit.c:140: undefined reference to `__udivdi3' Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-13 14:19:11 -05:00
Jie Liu	3e7b91cf8c	xfs: Validate log space at mount time Validate log space during log mount stage, the underlying function will drop a warning message via syslog in critical level if the log space is too small or too large. [ dchinner: For CRC enable filesystems, abort the mounting of the filesystem as mkfs should never make a log too small for the given filesystem configuration. ] [ dchinner: make a note of the fact that the log size limits in block counts are in units of filesystem blocks, not basic blocks. ] Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:50:35 -05:00
Jie Liu	5a96a94547	xfs: Add xfs_log_rlimit.c Add source files for xfs_log_rlimit.c The new file is used for log size calculations and validation shared with userspace. [dchinner: xfs_log_calc_max_attrsetm_res() does not modify the tr_attrsetm reservation, just calculates the maximum. ] [dchinner: rework loop in xfs_log_get_max_trans_res() ] Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:49:38 -05:00
Jie Liu	e773fc934f	xfs: Refactor xfs_ticket_alloc() to extract a new helper Refactor xlog_ticket_alloc() to extract a new helper, i.e. xfs_log_calc_unit_res(). This helper would be used to calculate the total log reservation size by adding extra log operation/transation headers for a new log ticket. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:49:02 -05:00
Jie Liu	f749278c5a	xfs: Get rid of all XFS_XXX_LOG_RES() macro Get rid of all XFS_XXX_LOG_RES() macros since they are obsoleted now. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:48:08 -05:00
Jie Liu	3d3c8b5222	xfs: refactor xfs_trans_reserve() interface With the new xfs_trans_res structure has been introduced, the log reservation size, log count as well as log flags are pre-initialized at mount time. So it's time to refine xfs_trans_reserve() interface to be more neat. Also, introduce a new helper M_RES() to return a pointer to the mp->m_resv structure to simplify the input. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:47:34 -05:00
Jie Liu	783cb6d172	xfs: Make writeid transaction use tr_writeid tr_writeid is defined at mp->m_resv structure, however, it does not really being used when it should be.. This patch changes it to tr_writeid to fetch the correct log reservation size. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:46:59 -05:00
Jie Liu	20996c9326	xfs: Introduce tr_fsyncts to m_reservation A preparation step. For now fsync_ts transaction use the pre-calculated log reservation size of tr_swrite. This patch introduce a new item tr_fsyncts to mp->m_reservations structure so that we can fetch the log reservation value for it in a same manner to others. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:46:29 -05:00
Jie Liu	0eadd10288	xfs: Introduce a new structure to hold transaction reservation items Introduce a new structure xfs_trans_res to hold transaction reservation item info per log ticket. We also need to improve xfs_trans_resv_calc() by initializing the log count as well as log flags for permanent log reservation. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:45:49 -05:00
Dave Chinner	9356fe22af	xfs: make struct xfs_perag kernel only The struct xfs_perag has many kernel-only definitions in it, requiring a __KERNEL__ guard so userspace can use it to. Move it to xfs_mount.h so that it it kernel-only, and let userspace redefine it's own version of the structure containing only what it needs. This gets rid of another __KERNEL__ check in the XFS header files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:44:36 -05:00
Dave Chinner	4f3d71f68b	xfs: move kernel specific type definitions to xfs.h xfs_types.h is shared with userspace, so having kernel specific types defined in it is problematic. Move all the kernel specific defines to xfs_linux.h so we can remove the __KERNEL__ guards from xfs_types.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:04:08 -05:00
Dave Chinner	9b90b0d9da	xfs: xfs_filestreams.h doesn't need __KERNEL__ Because it is only used within the kernel. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 17:00:11 -05:00
Dave Chinner	cb9eabff58	xfs: remove __KERNEL__ check from xfs_dir2_leaf.c It's actually an ifndef section, which means it is only included in userspace. however, it's deep within the libxfs code, so it's unlikely that the condition checked in userspace can actually occur (search an empty leaf) through the libxfs interfaces. i.e. if it can happen in usrspace, it can happen in the kernel, so remove it from userspace too.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:59:14 -05:00
Dave Chinner	b49a0c1883	xfs: remove __KERNEL__ from debug code There is no reason the remaining kernel-only debug code needs to remain kernel-only. Kill the __KERNEL__ part of the defines, and let userspace handle the debug code appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:58:37 -05:00
Dave Chinner	63d20d6e36	xfs: kill __KERNEL__ check for debug code in allocation code Userspace running debug builds is relatively rare, so there's need to special case the allocation algorithm code coverage debug switch. As it is, userspace defines random numbers to 0, so invert the logic of the switch so it is effectively a no-op in userspace. This kills another couple of __KERNEL__ users. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:57:51 -05:00
Dave Chinner	94b406091b	xfs: don't special case shared superblock mounts Neither kernel or userspace support shared read-only mounts, so don't bother special casing the support check to be different between kernel and userspace. The same check can be used as neither like it... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:57:16 -05:00
Dave Chinner	a133d952b4	xfs: consolidate extent swap code So we don't need xfs_dfrag.h in userspace anymore, move the extent swap ioctl structure definition to xfs_fs.h where most of the other ioctl structure definitions are. Now that we don't need separate files for extent swapping, separate the basic file descriptor checking code to xfs_ioctl.c, and the code that does the extent swap operation to xfs_bmap_util.c. This cleanly separates the user interface code from the physical mechanism used to do the extent swap. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:56:06 -05:00
Dave Chinner	e546cb79ef	xfs: consolidate xfs_utils.c There are a few small helper functions in xfs_util, all related to xfs_inode modifications. Move them all to xfs_inode.c so all xfs_inode operations are consiolidated in the one place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:55:17 -05:00
Dave Chinner	f6bba2017a	xfs: consolidate xfs_rename.c Move the rename code to xfs_inode.c to continue consolidating all the kernel xfs_inode operations in the one place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:54:09 -05:00
Dave Chinner	c24b5dfadc	xfs: kill xfs_vnodeops.[ch] Now we have xfs_inode.c for holding kernel-only XFS inode operations, move all the inode operations from xfs_vnodeops.c to this new file as it holds another set of kernel-only inode operations. The name of this file traces back to the days of Irix and it's vnodes which we don't have anymore. Essentially this move consolidates the inode locking functions and a bunch of XFS inode operations into the one file. Eventually the high level functions will be merged into the VFS interface functions in xfs_iops.c. This leaves only internal preallocation, EOF block manipulation and hole punching functions in vnodeops.c. Move these to xfs_bmap_util.c where we are already consolidating various in-kernel physical extent manipulation and querying functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:53:39 -05:00
Dave Chinner	836a94ad59	xfs: fix issues that cause userspace warnings Some of the code shared with userspace causes compilation warnings from things turned off in the kernel code, such as differences in variable signedness. Fix those issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:52:54 -05:00
Dave Chinner	c5c249b424	xfs: minor cleanups These come from syncing the shared userspace and kernel code. Small whitespace and trivial cleanups. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:46:08 -05:00
Dave Chinner	6898811459	xfs: create xfs_bmap_util.[ch] There is a bunch of code in xfs_bmap.c that is kernel specific and not shared with userspace. To minimise the difference between the kernel and userspace code, shift this unshared code to xfs_bmap_util.c, and the declarations to xfs_bmap_util.h. The biggest issue here is xfs_bmap_finish() - userspace has it's own definition of this function, and so we need to move it out of xfs_bmap.[ch]. This means several other files need to include xfs_bmap_util.h as well. It also introduces and interesting dance for the stack switching code in xfs_bmapi_allocate(). The stack switching/workqueue code is actually moved to xfs_bmap_util.c, so that userspace can simply use a #define in a header file to connect the dots without needing to know about the stack switch code at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:45:17 -05:00
Dave Chinner	ff55068c20	xfs: introduce xfs_sb.c for sharing with libxfs xfs_mount.c is shared with userspace, but the only functions that are shared are to do with physical superblock manipulations. This means that less than 25% of the xfs_mount.c code is actually shared with userspace. Move all the superblock functions to xfs_sb.c and share that instead with libxfs. Note that this will leave all the in-core transaction related superblock counter modifications in xfs_mount.c as none of that is shared with userspace. With a few more small changes, xfs_mount.h won't need to be shared with userspace anymore, either. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:44:11 -05:00
Dave Chinner	1fb7e48db6	xfs: split out the remote symlink handling The remote symlink format definition and manipulation needs to be shared with userspace, but the in-kernel interfaces do not. Split the remote symlink format handling out into xfs_symlink_remote.[ch] fo it can easily be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:43:38 -05:00
Dave Chinner	fde2227ce1	xfs: split out attribute fork truncation code into separate file The attribute inactivation code is not used by userspace, so like the attribute listing, split it out into a separate file to minimise the differences between the filesystem shared with libxfs in userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:42:30 -05:00
Dave Chinner	abec5f2bf9	xfs: split out attribute listing code into separate file The attribute listing code is not used by userspace, so like the directory readdir code, split it out into a separate file to minimise the differences between the filesystem shared with libxfs in userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:41:29 -05:00
Dave Chinner	2b9ab5ab9c	xfs: reshuffle dir2 definitions around for userspace Many of the definitions within xfs_dir2_priv.h are needed in userspace outside libxfs. Definitions within xfs_dir2_priv.h are wholly contained within libxfs, so we need to shuffle some of the definitions around to keep consistency across files shared between user and kernel space. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:40:57 -05:00
Dave Chinner	4a8af273de	xfs: move getdents code into it's own file The directory readdir code is not used by userspace, but it is intermingled with files that are shared with userspace. This makes it difficult to compare the differences between the userspac eand kernel files are the userspace files don't have the getdents code in them. Move all the kernel getdents code to a separate file to bring the shared content between userspace and kernel files closer together. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:39:56 -05:00
Dave Chinner	1fd7115eda	xfs: introduce xfs_inode_buf.c for inode buffer operations The only thing remaining in xfs_inode.[ch] are the operations that read, write or verify physical inodes in their underlying buffers. Move all this code to xfs_inode_buf.[ch] and so we can stop sharing xfs_inode.[ch] with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:39:05 -05:00
Dave Chinner	7bb85ef360	xfs: move unrelated definitions out of xfs_inode.h Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:37:57 -05:00
Dave Chinner	5c4d97d01a	xfs: move inode fork definitions to a new header file The inode fork definitions are a combination of on-disk format definition and in-memory tracking and manipulation. They are both shared with userspace, so move them all into their own file so sharing is easy to do and track. This removes all inode fork related information from xfs_inode.h. Do the same for the all the C code that currently resides in xfs_inode.c for the same reason. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:37:32 -05:00
Dave Chinner	7fd36c4418	xfs: split out transaction reservation code The transaction reservation size calculations is used by both kernel and userspace, but most of the transaction code in xfs_trans.c is kernel specific. Split all the transaction reservation code out into it's own files to make sharing with userspace simpler. This just leaves kernel-only definitions in xfs_trans.h, so it doesn't need to be shared with userspace anymore, either. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:36:16 -05:00
Dave Chinner	d386b32b55	xfs: sync minor header differences needed by userspace. Little things like exported functions, __KERNEL__ protections, and so on that ensure user and kernel shared headers are identical. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:35:41 -05:00
Dave Chinner	76456fc2a6	xfs: introduce xfs_quota_defs.h There are a lot of quota flag definitions that are shared by user and kernel space. Move them all to xfs_quota_defs.h so we can unshare xfs_quota.h and remove the __KERNEL__ regions from it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:20:18 -05:00
Dave Chinner	c7298202e5	xfs: introduce xfs_rtalloc_defs.h There are quite a few realtime device definitions shared with userspace. Move them from xfs_rtalloc.h to xfs_rt_alloc_defs.h so we don't need to share xfs_rtalloc.h with userspace anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:13:10 -05:00
Dave Chinner	2a3c0acc35	xfs: split out on-disk transaction definitions There's a bunch of definitions in xfs_trans.h that define on-disk formats - transaction headers that get written into the log, log item type definitions, etc. Split out everything into a separate file so that all which remains in xfs_trans.h are kernel only definitions. Also, remove the duplicate magic number definitions for XFS_TRANS_MAGIC... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:11:57 -05:00
Dave Chinner	9cd047f3a3	xfs: separate icreate log format definitions from xfs_icreate_item.h The on disk log format definitions for the icreate log item are intertwined with the kernel-only in-memory log item definitions. Separate the log format definitions out into their own header file so they can easily be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:10:35 -05:00
Dave Chinner	6ca1c9063d	xfs: separate dquot on disk format definitions out of xfs_quota.h The on disk format definitions of the on-disk dquot, log formats and quota off log formats are all intertwined with other definitions for quotas. Separate them out into their own header file so they can easily be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:09:52 -05:00
Dave Chinner	9fbe24d95e	xfs: split out EFI/EFD log item format definition The EFI/EFD item format definitions are shared with userspace. Split the out of header files that contain kernel only defintions to make it simple to shared them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:07:13 -05:00
Dave Chinner	a8da0da25c	xfs: split out buf log item format definitions Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:06:37 -05:00
Dave Chinner	69432832fd	xfs: split out inode log item format definition The log item format definitions are shared with userspace. Split them out of header files that contain kernel only defintions to make it simple to shared them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:05:19 -05:00
Dave Chinner	fc06c6d064	xfs: separate out log format definitions The on-disk format definitions for the log are spread randoms through a couple of header files. Consolidate it all in a single file that can be shared easily with userspace. This means that xfs_log.h and xfs_log_priv.h no longer need to be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-08-12 16:03:51 -05:00
Tejun Heo	7a378c9aea	xfs: WQ_NON_REENTRANT is meaningless and going away `dbf2576e37` ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: xfs@oss.sgi.com Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-30 13:11:17 -05:00
Dave Chinner	e1b4271ac2	xfs: di_flushiter considered harmful When we made all inode updates transactional, we no longer needed the log recovery detection for inodes being newer on disk than the transaction being replayed - it was redundant as replay of the log would always result in the latest version of the inode would be on disk. It was redundant, but left in place because it wasn't considered to be a problem. However, with the new "don't read inodes on create" optimisation, flushiter has come back to bite us. Essentially, the optimisation made always initialises flushiter to zero in the create transaction, and so if we then crash and run recovery and the inode already on disk has a non-zero flushiter it will skip recovery of that inode. As a result, log recovery does the wrong thing and we end up with a corrupt filesystem. Because we have to support old kernel to new kernel upgrades, we can't just get rid of the flushiter support in log recovery as we might be upgrading from a kernel that doesn't have fully transactional inode updates. Unfortunately, for v4 superblocks there is no way to guarantee that log recovery knows about this fact. We cannot add a new inode format flag to say it's a "special inode create" because it won't be understood by older kernels and so recovery could do the wrong thing on downgrade. We cannot specially detect the combination of zero mode/non-zero flushiter on disk to non-zero mode, zero flushiter in the log item during recovery because wrapping of the flushiter can result in false detection. Hence that makes this "don't use flushiter" optimisation limited to a disk format that guarantees that we don't need it. And that means the only fix here is to limit the "no read IO on create" optimisation to version 5 superblocks.... Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `e60896d8f2`)	2013-07-25 10:41:42 -05:00
Dave Chinner	e60896d8f2	xfs: di_flushiter considered harmful When we made all inode updates transactional, we no longer needed the log recovery detection for inodes being newer on disk than the transaction being replayed - it was redundant as replay of the log would always result in the latest version of the inode would be on disk. It was redundant, but left in place because it wasn't considered to be a problem. However, with the new "don't read inodes on create" optimisation, flushiter has come back to bite us. Essentially, the optimisation made always initialises flushiter to zero in the create transaction, and so if we then crash and run recovery and the inode already on disk has a non-zero flushiter it will skip recovery of that inode. As a result, log recovery does the wrong thing and we end up with a corrupt filesystem. Because we have to support old kernel to new kernel upgrades, we can't just get rid of the flushiter support in log recovery as we might be upgrading from a kernel that doesn't have fully transactional inode updates. Unfortunately, for v4 superblocks there is no way to guarantee that log recovery knows about this fact. We cannot add a new inode format flag to say it's a "special inode create" because it won't be understood by older kernels and so recovery could do the wrong thing on downgrade. We cannot specially detect the combination of zero mode/non-zero flushiter on disk to non-zero mode, zero flushiter in the log item during recovery because wrapping of the flushiter can result in false detection. Hence that makes this "don't use flushiter" optimisation limited to a disk format that guarantees that we don't need it. And that means the only fix here is to limit the "no read IO on create" optimisation to version 5 superblocks.... Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-24 12:15:23 -05:00
Chandra Seetharaman	d892d5864f	xfs: Start using pquotaino from the superblock. Start using pquotino and define a macro to check if the superblock has pquotino. Keep backward compatibilty by alowing mount of older superblock with no separate pquota inode. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-22 14:46:26 -05:00
Chandra Seetharaman	0102629776	xfs: Initialize all quota inodes to be NULLFSINO mkfs doesn't initialize the quota inodes to NULLFSINO as it does for the other internal inodes. This leads to two in-core values (0 and NULLFSINO) to be checked against, to make sure if a quota inode is valid. Solve that problem by initializing the in-core values of all quotaino values to NULLFSINO if they are 0 in the disk. Note that these values are not written back to on-disk superblock unless some quota is enabled on the filesystem. Even in that case sb_pquotino is written to disk only if the on-disk superblock supports pquotino Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-22 14:10:53 -05:00
Chandra Seetharaman	297aa63769	xfs: Fix a deadlock in xfs_log_commit_cil() code path While testing and rearranging pquota/gquota code, I stumbled on a xfs_shutdown() during a mount. But the mount just hung. Debugged and found that there is a deadlock involving &log->l_cilp->xc_ctx_lock. It is in a code path where &log->l_cilp->xc_ctx_lock is first acquired in read mode and some levels down the same semaphore is being acquired in write mode causing a deadlock. This is the stack: xfs_log_commit_cil -> acquires &log->l_cilp->xc_ctx_lock in read mode xlog_print_tic_res xfs_force_shutdown xfs_log_force_umount xlog_cil_force xlog_cil_force_lsn xlog_cil_push_foreground xlog_cil_push - tries to acquire same semaphore in write mode This patch fixes the deadlock by changing the reason code for xfs_force_shutdown in xlog_print_tic_res() to SHUTDOWN_LOG_IO_ERROR. SHUTDOWN_LOG_IO_ERROR is the right reason code to be set since we are in the log path. Thanks to Dave for suggesting this solution. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-22 13:58:10 -05:00
Jie Liu	58e59854a3	xfs: fix assertion failure in xfs_vm_write_failed() In xfs_vm_write_failed(), we evaluate the block_offset of pos with PAGE_MASK which is an unsigned long. That is fine on 64-bit platforms regardless of whether the request pos is 32-bit or 64-bit. However, on 32-bit platforms the value is 0xfffff000 and so the high 32 bits in it will be masked off with (pos & PAGE_MASK) for a 64-bit pos. As a result, the evaluated block_offset is incorrect which will cause this failure ASSERT(block_offset + from == pos); and potentially pass the wrong block to xfs_vm_kill_delalloc_range(). In this case, we can get a kernel panic if CONFIG_XFS_DEBUG is enabled: XFS: Assertion failed: block_offset + from == pos, file: fs/xfs/xfs_aops.c, line: 1504 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:100! invalid opcode: 0000 [#1] SMP ........ Pid: 4057, comm: mkfs.xfs Tainted: G O 3.9.0-rc2 #1 EIP: 0060:[<f94a7e8b>] EFLAGS: 00010282 CPU: 0 EIP is at assfail+0x2b/0x30 [xfs] EAX: 00000056 EBX: f6ef28a0 ECX: 00000007 EDX: f57d22a4 ESI: 1c2fb000 EDI: 00000000 EBP: ea6b5d30 ESP: ea6b5d1c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 094f3ff4 CR3: 2bcb4000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process mkfs.xfs (pid: 4057, ti=ea6b4000 task=ea5799e0 task.ti=ea6b4000) Stack: 00000000 f9525c48 f951fa80 f951f96b 000005e4 ea6b5d7c f9494b34 c19b0ea2 00000066 `f3d6c620` c19b0ea2 00000000 e9a91458 00001000 00000000 00000000 00000000 c15c7e89 00000000 1c2fb000 00000000 00000000 1c2fb000 00000080 Call Trace: [<f9494b34>] xfs_vm_write_failed+0x74/0x1b0 [xfs] [<c15c7e89>] ? printk+0x4d/0x4f [<f9494d7d>] xfs_vm_write_begin+0x10d/0x170 [xfs] [<c110a34c>] generic_file_buffered_write+0xdc/0x210 [<f949b669>] xfs_file_buffered_aio_write+0xf9/0x190 [xfs] [<f949b7f3>] xfs_file_aio_write+0xf3/0x160 [xfs] [<c115e504>] do_sync_write+0x94/0xd0 [<c115ed1f>] vfs_write+0x8f/0x160 [<c115e470>] ? wait_on_retry_sync_kiocb+0x50/0x50 [<c115f017>] sys_write+0x47/0x80 [<c15d860d>] sysenter_do_call+0x12/0x28 ............. EIP: [<f94a7e8b>] assfail+0x2b/0x30 [xfs] SS:ESP 0068:ea6b5d1c ---[ end trace cdd9af4f4ecab42f ]--- Kernel panic - not syncing: Fatal exception In order to avoid this, we can evaluate the block_offset of the start of the page by using shifts rather than masks the mismatch problem. Thanks Dave Chinner for help finding and fixing this bug. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-22 13:12:19 -05:00
Linus Torvalds	239dab4636	xfs: update (#2 ) for 3.11-rc1 - fix for xfs_fsr returning -EINVAL - cleanup in xfs_bulkstat - cleanup in xfs_open_by_handle - update mount options documentation - clean up local format handling in xfs_bmapi_write - fix dquot log reservations which were too small - fix sgid inheritance for subdirectories when default acls are in use - add project quota fields to various structures - fix teardown of quotainfo structures when quotas are turned off -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJR4FDjAAoJENaLyazVq6ZOhZIQAMLWC4Wz+3PpRLkIlXZ83wdG LYDNxdYntSVDXNLCrfIhuavgW1yhseLcQZD34g0hgOLQRsmjvUw2ikO6u7qlyoV1 GZZVwdVhsZNicycvuEE0Sva9Jmjgxe0XUOCxJBNAWN0fm5Jzg7w0OGWyzxn2obRX L4LNPqnQS/phCrtNYfUnXpCuUcy0KbpX4GYrdt5tThIWHm7AcyRnOArEkFvVuwdf N3OSN/jSMaEF/GsAqmYFYSXhuL9P1vyBSlyFW82YbyFFd4FKZJbRiFbOsrgbFSkA Ssum7N+rfMc9DCbJsrztgxFaYpj42JR5eCm+jvTejx8nJWKiGjMVtzjq4QtwQQ6e vby7MzdjZ+l2oJclA0y8hOjeg0R7sPVP7xZziZRuK4PHsjtBH3N2FtCOeBtlGhyW 14LK+z+5YXU/gEwmxV5LaknODb2mxvWycf70jaQ6bvrQRUiFnPNIxYKvgAx8YJxl jgYSassHHKtLg0S54P/T91tRsyDVOhy5lqgeogzK5uYa+v+xlMloG+fLzb9GmIgS hXgUIAo+lNlHZkw1FdD4aRgh3OMiUvLQN6woBMbbXfS5XaNpF1UG30YAXeeIqV5e cLChzY+jQiCsmcktb3YQs9C5yfxEciFFSYmZOaKCTgQRWWnyI4/lAu1gd2KtTYUt ZfV0niME4wp0kBWZHOEH =QiZy -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.11-rc1-2' of git://oss.sgi.com/xfs/xfs Pull more xfs updates from Ben Myers: "Here are a fix for xfs_fsr, a cleanup in bulkstat, a cleanup in xfs_open_by_handle, updated mount options documentation, a cleanup in xfs_bmapi_write, a fix for the size of dquot log reservations, a fix for sgid inheritance when acls are in use, a fix for cleaning up quotainfo structures, and some more of the work which allows group and project quotas to be used together. We had a few more in this last quota category that we might have liked to get in, but it looks there are still a few items that need to be addressed. - fix for xfs_fsr returning -EINVAL - cleanup in xfs_bulkstat - cleanup in xfs_open_by_handle - update mount options documentation - clean up local format handling in xfs_bmapi_write - fix dquot log reservations which were too small - fix sgid inheritance for subdirectories when default acls are in use - add project quota fields to various structures - fix teardown of quotainfo structures when quotas are turned off" * tag 'for-linus-v3.11-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: Fix the logic check for all quotas being turned off xfs: Add pquota fields where gquota is used. xfs: fix sgid inheritance for subdirectories inheriting default acls [V3] xfs: dquot log reservations are too small xfs: remove local fork format handling from xfs_bmapi_write() xfs: update mount options documentation xfs: use get_unused_fd_flags(0) instead of get_unused_fd() xfs: clean up unused codes at xfs_bulkstat() xfs: use XFS_BMAP_BMDR_SPACE vs. XFS_BROOT_SIZE_ADJ	2013-07-13 11:40:24 -07:00
Chandra Seetharaman	c31ad439e8	xfs: Fix the logic check for all quotas being turned off During the review of seperate pquota inode patches, David noticed that the test to detect all quotas being turned off was incorrect, and hence the block was not freeing all the quota information. The check made sense in Irix, but in Linux, quota is turned off one at a time, which makes the test invalid for Linux. This problem existed since XFS was ported to Linux. David suggested to fix the problem by detecting when all quotas are turned off by checking m_qflags. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-11 16:49:10 -05:00
Chandra Seetharaman	92f8ff73f1	xfs: Add pquota fields where gquota is used. Add project quota changes to all the places where group quota field is used: * add separate project quota members into various structures * split project quota and group quotas so that instead of overriding the group quota members incore, the new project quota members are used instead * get rid of usage of the OQUOTA flag incore, in favor of separate group and project quota flags. * add a project dquot argument to various functions. Not using the pquotino field from superblock yet. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-11 10:35:32 -05:00
Carlos Maiolino	42c49d7f24	xfs: fix sgid inheritance for subdirectories inheriting default acls [V3] XFS removes sgid bits of subdirectories under a directory containing a default acl. When a default acl is set, it implies xfs to call xfs_setattr_nonsize() in its code path. Such function is shared among mkdir and chmod system calls, and does some checks unneeded by mkdir (calling inode_change_ok()). Such checks remove sgid bit from the inode after it has been granted. With this patch, we extend the meaning of XFS_ATTR_NOACL flag to avoid these checks when acls are being inherited (thanks hch). Also, xfs_setattr_mode, doesn't need to re-check for group id and capabilities permissions, this only implies in another try to remove sgid bit from the directories. Such check is already done either on inode_change_ok() or xfs_setattr_nonsize(). Changelog: V2: Extends the meaning of XFS_ATTR_NOACL instead of wrap the tests into another function V3: Remove S_ISDIR check in xfs_setattr_nonsize() from the patch Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-10 10:21:51 -05:00
Dave Chinner	b0a9dab78a	xfs: dquot log reservations are too small During review of the separate project quota inode patches, it became obvious that the dquot log reservation calculation underestimated the number dquots that can be modified in a transaction. This has it's roots way back in the Irix quota implementation. That is, when quotas were first implemented in XFS, it only supported user and project quotas as Irix did not have group quotas. Hence the worst case operation involving dquot modification was calculated to involve 2 user dquots and 1 project dquot or 1 user dequot and 2 project dquots. i.e. 3 dquots. This was determined back in 1996, and has remained unchanged ever since. However, back in 2001, the Linux XFS port dropped all support for project quota and implmented group quotas over the top. This was effectively done with a search-and-replace of project with group, and as such the log reservation was not changed. However, with the advent of group quotas, chmod and rename now could modify more than 3 dquots in a single transaction - both could modify 4 dquots. Hence this log reservation has been wrong for a long time. In 2005, project quota support was reintroduced into Linux, but it was implemented to be mutually exclusive to group quotas and so this didn't add any new changes to the dquot log reservation. Hence when project quotas were in use (rather than group quotas) the log reservation was again valid, just like in the Irix days. Now, with the addition of the separate project quota inode, group and project quotas are no longer mutually exclusive, and hence operations can now modify three dquots per inode where previously it was only two. The worst case here is the rename transaction, which can allocate/free space on two different directory inodes, and if they have different uid/gid/prid configurations and are world writeable, then rename can actually modify 6 different dquots now. Further, the dquot log reservation doesn't take into account the space used by the dquot log format structure that precedes the dquot that is logged, and hence further underestimates the worst case log space required by dquots during a transaction. This has been missing since the first commit in 1996. Hence the worst case log reservation needs to be increased from 3 to 6, and it needs to take into account a log format header for each of those dquots. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-09 16:43:16 -05:00
Dave Chinner	f3508bcddf	xfs: remove local fork format handling from xfs_bmapi_write() The conversion from local format to extent format requires interpretation of the data in the fork being converted, so it cannot be done in a generic way. It is up to the caller to convert the fork format to extent format before calling into xfs_bmapi_write() so format conversion can be done correctly. The code in xfs_bmapi_write() to convert the format is used implicitly by the attribute and directory code, but they specifically zero the fork size so that the conversion does not do any allocation or manipulation. Move this conversion into the shortform to leaf functions for the dir/attr code so the conversions are explicitly controlled by all callers. Now we can remove the conversion code in xfs_bmapi_write. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-09 16:40:22 -05:00
Yann Droneaud	862a62937e	xfs: use get_unused_fd_flags(0) instead of get_unused_fd() Macro get_unused_fd() is used to allocate a file descriptor with default flags. Those default flags (0) can be "unsafe": O_CLOEXEC must be used by default to not leak file descriptor across exec(). Instead of macro get_unused_fd(), functions anon_inode_getfd() or get_unused_fd_flags() should be used with flags given by userspace. If not possible, flags should be set to O_CLOEXEC to provide userspace with a default safe behavor. In a further patch, get_unused_fd() will be removed so that new code start using anon_inode_getfd() or get_unused_fd_flags() with correct flags. This patch replaces calls to get_unused_fd() with equivalent call to get_unused_fd_flags(0) to preserve current behavor for existing code. The hard coded flag value (0) should be reviewed on a per-subsystem basis, and, if possible, set to O_CLOEXEC. Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-09 15:53:57 -05:00
Jie Liu	9cee4c5b7b	xfs: clean up unused codes at xfs_bulkstat() There are some unused codes at xfs_bulkstat(): - Variable bp is defined to point to the on-disk inode cluster buffer, but it proved to be of no practical help. - We process the chunks of good inodes which were fetched by iterating btree records from an AG. When processing inodes from each chunk, the code recomputing agbno if run into the first inode of a cluster, however, the agbno is not being used thereafter. This patch tries to clean up those things. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-09 15:36:21 -05:00
Linus Torvalds	da89bd213f	xfs: update for 3.11-rc1 - part of the work to allow project quotas and group quotas to be used together - inode change count - inode create transaction - block queue plugging in buffer readahead and bulkstat - ordered log vector support - removal of dead code in and around xfs_sync_inode_grab, xfs_ialloc_get_rec, XFS_MOUNT_RETERR, XFS_ALLOCFREE_LOG_RES, XFS_DIROP_LOG_RES, xfs_chash, ctl_table, and xfs_growfs_data_private - don't keep silent if sunit/swidth can not be changed via mount - fix a leak of remote symlink blocks into the filesystem when xattrs are used on symlinks - fix for fiemap to return FIEMAP_EXTENT_UNKOWN flag on delay extents - part of a fix for xfs_fsr - disable speculative preallocation with small files - performance improvements for inode creates and deletes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJR3F9pAAoJENaLyazVq6ZOiKAP/jyfPbVj5AOiLVtTLhJUQ3Qf urCiMjl87BToixFxa/yxeOBrUbBiOgUQ/2om4b5cryYhN2RtQWiEi/iVdeBZw3rR 1J6VxH09R25GVRobIj2AwJ87eXyqKvi2DaVBrQbgva0BH8fmWKhQISfwLTPUIXcK GTrVpS1amTWxh69/5p5bxaxo4u2y7DbZwYC4xQOcPeSrfxizMQmN2JqtUjtLDyKp J08Md04vNcxJvfbFopcSFAncStIr6xnKMiaqvFttybLJf9HL9MqoaO9Xp+6aX1QI Pibae3oFctH1tGOmjgyg30AbPjzAHaGDSw9vuRaommQZbMwiZXAV2VGuJ5QWtAbi wXh5GIzaap1Z0EYuD9qY0hsFizOpmL+1YH+F20qIqa3i5CiDMeYMWlmzhfN63zH9 Zk2j1YUkbxLmKMEgl8s++eLfT1yenuzVAp5zUodApOPp161FOMJlVhhD55WDvKtP 2/3Ig25fz5CLwcIZT1MoZ9B7UP9dfrAAK30AcUmO3Iumj6b8bsxh4pzSpJ02U1cG cMIe+OhxC6KqDZlXeLmDXodnOo9Wc4glwdivcynxCNFv6WS5Ez/ufm+4oN+OVUo7 PwyNtoJfsptSYZyIRorXNez54ww3cOvvqjifUasTZTGkTSqCbf1LclzuMxcXDlyO 3YzE/DCK8ZWwJc/ysbn/ =v6KW -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.11-rc1' of git://oss.sgi.com/xfs/xfs Pull xfs update from Ben Myers: "This includes several bugfixes, part of the work for project quotas and group quotas to be used together, performance improvements for inode creation/deletion, buffer readahead, and bulkstat, implementation of the inode change count, an inode create transaction, and the removal of a bunch of dead code. There are also some duplicate commits that you already have from the 3.10-rc series. - part of the work to allow project quotas and group quotas to be used together - inode change count - inode create transaction - block queue plugging in buffer readahead and bulkstat - ordered log vector support - removal of dead code in and around xfs_sync_inode_grab, xfs_ialloc_get_rec, XFS_MOUNT_RETERR, XFS_ALLOCFREE_LOG_RES, XFS_DIROP_LOG_RES, xfs_chash, ctl_table, and xfs_growfs_data_private - don't keep silent if sunit/swidth can not be changed via mount - fix a leak of remote symlink blocks into the filesystem when xattrs are used on symlinks - fix for fiemap to return FIEMAP_EXTENT_UNKOWN flag on delay extents - part of a fix for xfs_fsr - disable speculative preallocation with small files - performance improvements for inode creates and deletes" * tag 'for-linus-v3.11-rc1' of git://oss.sgi.com/xfs/xfs: (61 commits) xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD xfs: Change xfs_dquot_acct to be a 2-dimensional array xfs: Code cleanup and removal of some typedef usage xfs: Replace macro XFS_DQ_TO_QIP with a function xfs: Replace macro XFS_DQUOT_TREE with a function xfs: Define a new function xfs_is_quota_inode() xfs: implement inode change count xfs: Use inode create transaction xfs: Inode create item recovery xfs: Inode create transaction reservations xfs: Inode create log items xfs: Introduce an ordered buffer item xfs: Introduce ordered log vector support xfs: xfs_ifree doesn't need to modify the inode buffer xfs: don't do IO when creating an new inode xfs: don't use speculative prealloc for small files xfs: plug directory buffer readahead xfs: add pluging for bulkstat readahead xfs: Remove dead function prototype xfs_sync_inode_grab() xfs: Remove the left function variable from xfs_ialloc_get_rec() ...	2013-07-09 12:29:12 -07:00
Eric Sandeen	a69c7c0772	xfs: use XFS_BMAP_BMDR_SPACE vs. XFS_BROOT_SIZE_ADJ XFS_BROOT_SIZE_ADJ is an undocumented macro which accounts for the difference in size between the on-disk and in-core btree root. It's much clearer to just use the newly-added XFS_BMAP_BMDR_SPACE macro which gives us the on-disk size directly. In one case, we must test that the if_broot exists before applying the macro, however. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-07-09 13:21:15 -05:00
Linus Torvalds	790eac5640	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...	2013-07-03 09:10:19 -07:00
Jie Liu	46a1c2c7ae	vfs: export lseek_execute() to modules For those file systems(btrfs/ext4/ocfs2/tmpfs) that support SEEK_DATA/SEEK_HOLE functions, we end up handling the similar matter in lseek_execute() to update the current file offset to the desired offset if it is valid, ceph also does the simliar things at ceph_llseek(). To reduce the duplications, this patch make lseek_execute() public accessible so that we can call it directly from the underlying file systems. Thanks Dave Chinner for this suggestion. [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back] v2->v1: - Add kernel-doc comments for lseek_execute() - Call lseek_execute() in ceph->llseek() Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Ben Myers <bpm@sgi.com> Cc: Ted Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-07-03 16:23:27 +04:00
Linus Torvalds	9e239bb939	Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/ zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7 XrddRLGlk4Hf+2o7/D7B =SwaI -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...	2013-07-02 09:39:34 -07:00
Al Viro	b8227554c9	[readdir] convert xfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-06-29 12:57:00 +04:00
Chandra Seetharaman	83e782e1a1	xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD Remove all incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Instead, start using XFS_GQUOTA_.* XFS_PQUOTA_.* counterparts for GQUOTA and PQUOTA respectively. On-disk copy still uses XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Read and write of the superblock does the conversion from OQUOTA to [PG]QUOTA. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-28 17:39:22 -05:00
Chandra Seetharaman	0e6436d99e	xfs: Change xfs_dquot_acct to be a 2-dimensional array In preparation for combined pquota/gquota support, for the sake of readability, change xfs_dquot_acct to be a 2-dimensional array. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-28 14:12:22 -05:00
Chandra Seetharaman	113a56835d	xfs: Code cleanup and removal of some typedef usage In preparation for combined pquota/gquota support, for the sake of readability, do some code cleanup surrounding the affected code. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-28 13:13:59 -05:00
Chandra Seetharaman	995961c451	xfs: Replace macro XFS_DQ_TO_QIP with a function In preparation for combined pquota/gquota support, for the sake of readability, change the macro to an inline function. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-28 13:12:42 -05:00
Chandra Seetharaman	329e087528	xfs: Replace macro XFS_DQUOT_TREE with a function In preparation for combined pquota/gquota support, for the sake of readability, change the macro to an inline function. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-28 13:05:07 -05:00
Chandra Seetharaman	9cad19d2cb	xfs: Define a new function xfs_is_quota_inode() In preparation for combined pquota/gquota support, define a new function to check if the given inode is a quota inode. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-28 13:03:49 -05:00
Dave Chinner	dc037ad7d2	xfs: implement inode change count For CRC enabled filesystems, add support for the monotonic inode version change counter that is needed by protocols like NFSv4 for determining if the inode has changed in any way at all between two unrelated operations on the inode. This bumps the change count the first time an inode is dirtied in a transaction. Since all modifications to the inode are logged, this will catch all changes that are made to the inode, including timestamp updates that occur during data writes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-28 13:00:05 -05:00
Dave Chinner	ddf6ad0143	xfs: Use inode create transaction Replace the use of buffer based logging of inode initialisation, uses the new logical form to describe the range to be initialised in recovery. We continue to "log" the inode buffers to push them into the AIL and ensure that the inode create transaction is not removed from the log before the inode buffers are written to disk. Update the transaction identifier and reservations to match the changed implementation. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 14:27:18 -05:00
Dave Chinner	28c8e41af6	xfs: Inode create item recovery When we find a icreate transaction, we need to get and initialise the buffers in the range that has been passed. Extract and verify the information in the item record, then loop over the range initialising and issuing the buffer writes delayed. Support an arbitrary size range to initialise so that in future when we allocate inodes in much larger chunks all kernels that understand this transaction can still recover them. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 14:26:21 -05:00
Dave Chinner	b8402b4729	xfs: Inode create transaction reservations Define the log and space transaction sizes. Factor the current create log reservation macro into the two logical halves and reuse one half for the new icreate transactions. The icreate transaction is transparent to all the high level create code - the pre-calculated reservations will correctly set the reservations dependent on whether the filesystem supports the icreate transaction. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:36:37 -05:00
Dave Chinner	3ebe7d2d73	xfs: Inode create log items Introduce the inode create log item type for logical inode create logging. Instead of logging the changes in buffers, pass the range to be initialised through the log by a new transaction type. This reduces the amount of log space required to record initialisation during allocation from about 128 bytes per inode to a small fixed amount per inode extent to be initialised. This requires a new log item type to track it through the log and the AIL. This is a relatively simple item - most callbacks are noops as this item has the same life cycle as the transaction. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:34:12 -05:00
Dave Chinner	5f6bed76c0	xfs: Introduce an ordered buffer item If we have a buffer that we have modified but we do not wish to physically log in a transaction (e.g. we've logged a logical change), we still need to ensure that transactional integrity is maintained. Hence we must not move the tail of the log past the transaction that the buffer is associated with before the buffer is written to disk. This means these special buffers still need to be included in the transaction and added to the AIL just like a normal buffer, but we do not want the modifications to the buffer written into the transaction. IOWs, what we want is an "ordered buffer" that maintains the same transactional life cycle as a physically logged buffer, just without the transcribing of the modifications to the log. Hence we need to flag the buffer as an "ordered buffer" to avoid including it in vector size calculations or formatting during the transaction. Once the transaction is committed, the buffer appears for all intents to be the same as a physically logged buffer as it transitions through the log and AIL. Relogging will also work just fine for such an ordered buffer - the logical transaction will be replayed before the subsequent modifications that relog the buffer, so everything will be reconstructed correctly by recovery. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:33:11 -05:00
Dave Chinner	fd63875cc4	xfs: Introduce ordered log vector support And "ordered log vector" is a log vector that is used for tracking a log item through the CIL and into the AIL as part of the log checkpointing. These ordered log vectors are special in that they are not written to to journal in any way, and are not accounted to the checkpoint being written. The reason for this behaviour is to allow operations to attach items to transactions and have them follow the normal transactional lifecycle without actually having to write them to the journal. This allows logging of items that track high level logical changes and writing them to the log, while the physical items being modified pass through into the AIL and pin the tail of the log (and therefore the logical item in the log) until all the modified items are physically written to disk. IOWs, it allows us to write metadata without physically logging every individual change but still maintain the full transactional integrity guarantees we currently have w.r.t. crash recovery. This change modifies some of the CIL item insertion loops, as ordered log vectors introduce some new constraints as they don't track any data. One advantage of this change is that it combines two log vector chain walks into a single pass, so there is less overhead in the transaction commit pass as well. It also kills some unused code in the log vector walk loop when committing the CIL. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:32:08 -05:00
Dave Chinner	1baaed8fa9	xfs: xfs_ifree doesn't need to modify the inode buffer Long ago, bulkstat used to read inodes directly from the backing buffer for speed. This had the unfortunate problem of being cache incoherent with unlinks, and so xfs_ifree() had to mark the inode as free directly in the backing buffer. bulkstat was changed some time ago to use inode cache coherent lookups, and so will never see unlinked inodes in it's lookups. Hence xfs_ifree() does not need to touch the inode backing buffer anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:31:04 -05:00
Dave Chinner	cca9f93a52	xfs: don't do IO when creating an new inode When we are allocating a new inode, we read the inode cluster off disk to increment the generation number. We are already using a random generation number for newly allocated inodes, so if we are not using the ikeep mode, we can just generate a new generation number when we initialise the newly allocated inode. This avoids the need for reading the inode buffer during inode creation. This will speed up allocation of inodes in cold, partially allocated clusters as they will no longer need to be read from disk during allocation. It will also reduce the CPU overhead of inode allocation by not having the process the buffer read, even on cache hits. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:28:20 -05:00
Dave Chinner	133eeb1747	xfs: don't use speculative prealloc for small files Dedicated small file workloads have been seeing significant free space fragmentation causing premature inode allocation failure when large inode sizes are in use. A particular test case showed that a workload that runs to a real ENOSPC on 256 byte inodes would fail inode allocation with ENOSPC about about 80% full with 512 byte inodes, and at about 50% full with 1024 byte inodes. The same workload, when run with -o allocsize=4096 on 1024 byte inodes would run to being 100% full before giving ENOSPC. That is, no freespace fragmentation at all. The issue was caused by the specific IO pattern the application had - the framework it was using did not support direct IO, and so it was emulating it by using fadvise(DONT_NEED). The result was that the data was getting written back before the speculative prealloc had been trimmed from memory by the close(), and so small single block files were being allocated with 2 blocks, and then having one truncated away. The result was lots of small 4k free space extents, and hence each new 8k allocation would take another 8k from contiguous free space and turn it into 4k of allocated space and 4k of free space. Hence inode allocation, which requires contiguous, aligned allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k (1024 byte inodes) can fail to find sufficiently large freespace and hence fail while there is still lots of free space available. There's a simple fix for this, and one that has precendence in the allocator code already - don't do speculative allocation unless the size of the file is larger than a certain size. In this case, that size is the minimum default preallocation size: mp->m_writeio_blocks. And to keep with the concept of being nice to people when the files are still relatively small, cap the prealloc to mp->m_writeio_blocks until the file goes over a stripe unit is size, at which point we'll fall back to the current behaviour based on the last extent size. This will effectively turn off speculative prealloc for very small files, keep preallocation low for small files, and behave as it currently does for any file larger than a stripe unit. This completely avoids the freespace fragmentation problem this particular IO pattern was causing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:27:37 -05:00
Dave Chinner	34eefc06a0	xfs: plug directory buffer readahead Similar to bulkstat inode chunk readahead, we need to plug directory data buffer readahead during getdents to ensure that we can merge adjacent readahead requests and sort out of order requests optimally before they are dispatched. This improves the readahead efficiency and reduces the IO load it generates as the IO patterns are significantly better for both contiguous and fragmented directories. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:27:24 -05:00
Dave Chinner	cbb2864aa4	xfs: add pluging for bulkstat readahead I was running some tests on bulkstat on CRC enabled filesystems when I noticed that all the IO being issued was 8k in size, regardless of the fact taht we are issuing sequential 8k buffers for inodes clusters. The IO size should be 16k for 256 byte inodes, and 32k for 512 byte inodes, but this wasn't happening. blktrace showed that there was an explict plug and unplug happening around each readahead IO from _xfs_buf_ioapply, and the unplug was causing the IO to be issued immediately. Hence no opportunity was being given to the elevator to merge adjacent readahead requests and dispatch them as a single IO. Add plugging around the inode chunk readahead dispatch loop in bulkstat to ensure that we don't unplug the queue between adjacent inode buffer readahead IOs and so we get fewer, larger IO requests hitting the storage subsystem for bulkstat. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-27 13:26:23 -05:00
Jie Liu	80a4049813	xfs: Remove dead function prototype xfs_sync_inode_grab() Remove dead function prototype xfs_sync_inode_grab() from xfs_icache.h. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-26 12:29:27 -05:00
Jie Liu	43df2ee659	xfs: Remove the left function variable from xfs_ialloc_get_rec() This patch clean out the left function variable as it is useless to xfs_ialloc_get_rec(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-26 12:22:41 -05:00
Eric Sandeen	427d9fe233	xfs: check on-disk (not incore) btree root size in dfrag.c xfs_swap_extents_check_format() contains checks to make sure that original and the temporary files during defrag are compatible; Gabriel VLASIU ran into a case where xfs_fsr returned EINVAL because the tests found the btree root to be of size 120, while the fork offset was only 104; IOW, they overlapped. However, this is just due to an error in the xfs_swap_extents_check_format() tests, because it is checking the in-memory btree root size against the on-disk fork offset. We should be checking the on-disk sizes in both cases. This patch adds a new macro to calculate this size, and uses it in the tests. With this change, the filesystem image provided by Gabriel allows for proper file degragmentation. Reported-by: Gabriel VLASIU <gabriel@vlasiu.net> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-20 13:26:09 -05:00
Jie Liu	39a45d8463	xfs: Remove XFS_MOUNT_RETERR XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)" branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so we always be emitting a warning and returning an error. Hence, we can remove it and get rid of a couple of redundant check up against it at xfs_upate_alignment(). Thanks Dave Chinner for the suggestions of simplify the code in xfs_parseargs(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-19 14:54:17 -05:00
Jie Liu	2fb8b5027d	xfs: Remove two dead transaction log reservaion macros Upstream commit `5b292ae3a9` xfs: make use of xfs_calc_buf_res() in xfs_trans.c Beginning from above commit, neither XFS_ALLOCFREE_LOG_RES() nor XFS_DIROP_LOG_RES() is used by those routines for calculating transaction space reservations, so it's safe to remove them now. Also, with a slightly update for the relevant comments to reflect the ideas of why those log count numbers should be. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-19 14:26:16 -05:00
Jie Liu	635c4d0bd9	xfs: return FIEMAP_EXTENT_UNKNOWN for delayed allocation extent For FIEMAP ioctl(2), if an extent is in delayed allocation state, we need to return the FIEMAP_EXTENT_UNKNOWN flag except the FIEMAP_EXTENT_DELALLOC because its data location is unknown. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-19 14:18:32 -05:00
Mark Tinguely	725eb1eb2a	xfs: fix the symbolic link assert in xfs_ifree Adding an extended attribute to a symbolic link can force that link to an remote extent. xfs_inactive() incorrectly assumes that any symbolic link small enough to be in the inode core is incore, resulting in the remote extent to not be removed. xfs_ifree() will assert on presence of this leaked remote extent. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-19 14:14:43 -05:00
Jeff Liu	1ebdf3611c	xfs: Remove struct xfs_chash from xfs_mount Remove struct xfs_chash from struct xfs_mount as there is no user of it nowadays. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-17 17:54:21 -05:00
Jie Liu	34d7f603b9	xfs: Don't keep silent if sunit/swidth can not be changed via mount As per the mount man page, sunit and swidth can be changed via mount options. For XFS, on the face of it, those options seems works if the specified alignments is properly, e.g. # mount -o sunit=4096,swidth=8192 /dev/sdb1 /mnt # mount \| grep sdb1 /dev/sdb1 on /mnt type xfs (rw,sunit=4096,swidth=8192) However, neither sunit nor swidth is shown from the xfs_info output. # xfs_info /mnt meta-data=/dev/sdb1 isize=256 agcount=4, agsize=262144 blks = sectsz=512 attr=2 data = bsize=4096 blocks=1048576, imaxpct=25 = sunit=0 swidth=0 blks ^^^^^^^^^^^^^^^^^^^^^^^^^^ naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 The reason is that the alignment can only be changed if the relevant super block is already configured with alignments, otherwise, the given value is silently ignored. With this fix, the attempt to mount a storage without strip alignment setup on a super block will get an error with a warning in syslog to indicate the true cause, e.g. # mount -o sunit=4096,swidth=8192 /dev/sdb1 /mnt mount: wrong fs type, bad option, bad superblock on /dev/sdb1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg \| tail or so ....... XFS (sdb1): cannot change alignment: superblock does not support data alignment Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Tinguely <tinguely@sgi.com> Cc: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-17 17:49:02 -05:00
Jie Liu	897366f0e4	xfs: Remove redundant error variable from xfs_growfs_data_private() Commit `eab4e633` "xfs: uncached buffer reads need to return an error". Remove redundant error variable, using the function level error variable to store bp->b_error instead. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-17 17:43:04 -05:00
Joe Perches	b2410e92b7	xfs: Convert use of typedef ctl_table to struct ctl_table This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-17 17:42:25 -05:00
Dave Chinner	d302cf1d31	xfs: don't shutdown log recovery on validation errors Unfortunately, we cannot guarantee that items logged multiple times and replayed by log recovery do not take objects back in time. When they are taken back in time, the go into an intermediate state which is corrupt, and hence verification that occurs on this intermediate state causes log recovery to abort with a corruption shutdown. Instead of causing a shutdown and unmountable filesystem, don't verify post-recovery items before they are written to disk. This is less than optimal, but there is no way to detect this issue for non-CRC filesystems If log recovery successfully completes, this will be undone and the object will be consistent by subsequent transactions that are replayed, so in most cases we don't need to take drastic action. For CRC enabled filesystems, leave the verifiers in place - we need to call them to recalculate the CRCs on the objects anyway. This recovery problem can be solved for such filesystems - we have a LSN stamped in all metadata at writeback time that we can to determine whether the item should be replayed or not. This is a separate piece of work, so is not addressed by this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `9222a9cf86`)	2013-06-14 15:59:45 -05:00
Dave Chinner	088c9f67c3	xfs: ensure btree root split sets blkno correctly For CRC enabled filesystems, the BMBT is rooted in an inode, so it passes through a different code path on root splits than the freespace and inode btrees. This is much less traversed by xfstests than the other trees. When testing on a 1k block size filesystem, I've been seeing ASSERT failures in generic/234 like: XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP \|\| cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317 which are generally preceded by a lblock check failure. I noticed this in the bmbt stats: $ pminfo -f xfs.btree.block_map xfs.btree.block_map.lookup value 39135 xfs.btree.block_map.compare value 268432 xfs.btree.block_map.insrec value 15786 xfs.btree.block_map.delrec value 13884 xfs.btree.block_map.newroot value 2 xfs.btree.block_map.killroot value 0 ..... Very little coverage of root splits and merges. Indeed, on a 4k filesystem, block_map.newroot and block_map.killroot are both zero. i.e. the code is not exercised at all, and it's the only generic btree infrastructure operation that is not exercised by a default run of xfstests. Turns out that on a 1k filesystem, generic/234 accounts for one of those two root splits, and that is somewhat of a smoking gun. In fact, it's the same problem we saw in the directory/attr code where headers are memcpy()d from one block to another without updating the self describing metadata. Simple fix - when copying the header out of the root block, make sure the block number is updated correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `ade1335afe`)	2013-06-14 15:59:31 -05:00
Dave Chinner	5170711df7	xfs: fix implicit padding in directory and attr CRC formats Michael L. Semon has been testing CRC patches on a 32 bit system and been seeing assert failures in the directory code from xfs/080. Thanks to Michael's heroic efforts with printk debugging, we found that the problem was that the last free space being left in the directory structure was too small to fit a unused tag structure and it was being corrupted and attempting to log a region out of bounds. Hence the assert failure looked something like: ..... #5 calling xfs_dir2_data_log_unused() 36 32 #1 4092 4095 4096 #2 8182 8183 4096 XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 Where #1 showed the first region of the dup being logged (i.e. the last 4 bytes of a directory buffer) and #2 shows the corrupt values being calculated from the length of the dup entry which overflowed the size of the buffer. It turns out that the problem was not in the logging code, nor in the freespace handling code. It is an initial condition bug that only shows up on 32 bit systems. When a new buffer is initialised, where's the freespace that is set up: [ 172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname() [ 172.316346] #9 calling xfs_dir2_data_log_unused() [ 172.316351] #1 calling xfs_trans_log_buf() 60 63 4096 [ 172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096 Note the offset of the first region being logged? It's 60 bytes into the buffer. Once I saw that, I pretty much knew that the bug was going to be caused by this. Essentially, all direct entries are rounded to 8 bytes in length, and all entries start with an 8 byte alignment. This means that we can decode inplace as variables are naturally aligned. With the directory data supposedly starting on a 8 byte boundary, and all entries padded to 8 bytes, the minimum freespace in a directory block is supposed to be 8 bytes, which is large enough to fit a unused data entry structure (6 bytes in size). The fact we only have 4 bytes of free space indicates a directory data block alignment problem. And what do you know - there's an implicit hole in the directory data block header for the CRC format, which means the header is 60 byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs padding. And while looking at the structures, I found the same problem in the attr leaf header. Fix them both. Note that this only affects 32 bit systems with CRCs enabled. Everything else is just fine. Note that CRC enabled filesystems created before this fix on such systems will not be readable with this fix applied. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Debugged-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `8a1fd2950e`)	2013-06-14 15:59:16 -05:00
Dave Chinner	47ad2fcba9	xfs: don't emit v5 superblock warnings on write We write the superblock every 30s or so which results in the verifier being called. Right now that results in this output every 30s: XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled! Use of these features in this kernel is at your own risk! And spamming the logs. We don't need to check for whether we support v5 superblocks or whether there are feature bits we don't support set as these are only relevant when we first mount the filesytem. i.e. on superblock read. Hence for the write verification we can just skip all the checks (and hence verbose output) altogether. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `34510185ab`)	2013-06-14 15:58:47 -05:00
Dave Chinner	9222a9cf86	xfs: don't shutdown log recovery on validation errors Unfortunately, we cannot guarantee that items logged multiple times and replayed by log recovery do not take objects back in time. When they are taken back in time, the go into an intermediate state which is corrupt, and hence verification that occurs on this intermediate state causes log recovery to abort with a corruption shutdown. Instead of causing a shutdown and unmountable filesystem, don't verify post-recovery items before they are written to disk. This is less than optimal, but there is no way to detect this issue for non-CRC filesystems If log recovery successfully completes, this will be undone and the object will be consistent by subsequent transactions that are replayed, so in most cases we don't need to take drastic action. For CRC enabled filesystems, leave the verifiers in place - we need to call them to recalculate the CRCs on the objects anyway. This recovery problem can be solved for such filesystems - we have a LSN stamped in all metadata at writeback time that we can to determine whether the item should be replayed or not. This is a separate piece of work, so is not addressed by this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-14 15:29:31 -05:00
Dave Chinner	ade1335afe	xfs: ensure btree root split sets blkno correctly For CRC enabled filesystems, the BMBT is rooted in an inode, so it passes through a different code path on root splits than the freespace and inode btrees. This is much less traversed by xfstests than the other trees. When testing on a 1k block size filesystem, I've been seeing ASSERT failures in generic/234 like: XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP \|\| cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317 which are generally preceded by a lblock check failure. I noticed this in the bmbt stats: $ pminfo -f xfs.btree.block_map xfs.btree.block_map.lookup value 39135 xfs.btree.block_map.compare value 268432 xfs.btree.block_map.insrec value 15786 xfs.btree.block_map.delrec value 13884 xfs.btree.block_map.newroot value 2 xfs.btree.block_map.killroot value 0 ..... Very little coverage of root splits and merges. Indeed, on a 4k filesystem, block_map.newroot and block_map.killroot are both zero. i.e. the code is not exercised at all, and it's the only generic btree infrastructure operation that is not exercised by a default run of xfstests. Turns out that on a 1k filesystem, generic/234 accounts for one of those two root splits, and that is somewhat of a smoking gun. In fact, it's the same problem we saw in the directory/attr code where headers are memcpy()d from one block to another without updating the self describing metadata. Simple fix - when copying the header out of the root block, make sure the block number is updated correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-13 14:18:02 -05:00
Dave Chinner	8a1fd2950e	xfs: fix implicit padding in directory and attr CRC formats Michael L. Semon has been testing CRC patches on a 32 bit system and been seeing assert failures in the directory code from xfs/080. Thanks to Michael's heroic efforts with printk debugging, we found that the problem was that the last free space being left in the directory structure was too small to fit a unused tag structure and it was being corrupted and attempting to log a region out of bounds. Hence the assert failure looked something like: ..... #5 calling xfs_dir2_data_log_unused() 36 32 #1 4092 4095 4096 #2 8182 8183 4096 XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 Where #1 showed the first region of the dup being logged (i.e. the last 4 bytes of a directory buffer) and #2 shows the corrupt values being calculated from the length of the dup entry which overflowed the size of the buffer. It turns out that the problem was not in the logging code, nor in the freespace handling code. It is an initial condition bug that only shows up on 32 bit systems. When a new buffer is initialised, where's the freespace that is set up: [ 172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname() [ 172.316346] #9 calling xfs_dir2_data_log_unused() [ 172.316351] #1 calling xfs_trans_log_buf() 60 63 4096 [ 172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096 Note the offset of the first region being logged? It's 60 bytes into the buffer. Once I saw that, I pretty much knew that the bug was going to be caused by this. Essentially, all direct entries are rounded to 8 bytes in length, and all entries start with an 8 byte alignment. This means that we can decode inplace as variables are naturally aligned. With the directory data supposedly starting on a 8 byte boundary, and all entries padded to 8 bytes, the minimum freespace in a directory block is supposed to be 8 bytes, which is large enough to fit a unused data entry structure (6 bytes in size). The fact we only have 4 bytes of free space indicates a directory data block alignment problem. And what do you know - there's an implicit hole in the directory data block header for the CRC format, which means the header is 60 byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs padding. And while looking at the structures, I found the same problem in the attr leaf header. Fix them both. Note that this only affects 32 bit systems with CRCs enabled. Everything else is just fine. Note that CRC enabled filesystems created before this fix on such systems will not be readable with this fix applied. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Debugged-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-13 10:30:03 -05:00
Dave Chinner	0a8aa19397	xfs: increase number of ACL entries for V5 superblocks The limit of 25 ACL entries is arbitrary, but baked into the on-disk format. For version 5 superblocks, increase it to the maximum nuber of ACLs that can fit into a single xattr. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinuguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `5c87d4bc1a`)	2013-06-06 10:52:15 -05:00
Dave Chinner	f763fd440e	xfs: disable noattr2/attr2 mount options for CRC enabled filesystems attr2 format is always enabled for v5 superblock filesystems, so the mount options to enable or disable it need to be cause mount errors. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `d3eaace84e`)	2013-06-06 10:51:34 -05:00
Dave Chinner	ad868afddb	xfs: inode unlinked list needs to recalculate the inode CRC The inode unlinked list manipulations operate directly on the inode buffer, and so bypass the inode CRC calculation mechanisms. Hence an inode on the unlinked list has an invalid CRC. Fix this by recalculating the CRC whenever we modify an unlinked list pointer in an inode, ncluding during log recovery. This is trivial to do and results in unlinked list operations always leaving a consistent inode in the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `0a32c26e72`)	2013-06-06 10:51:19 -05:00
Dave Chinner	7540617075	xfs: fix log recovery transaction item reordering There are several constraints that inode allocation and unlink logging impose on log recovery. These all stem from the fact that inode alloc/unlink are logged in buffers, but all other inode changes are logged in inode items. Hence there are ordering constraints that recovery must follow to ensure the correct result occurs. As it turns out, this ordering has been working mostly by chance than good management. The existing code moves all buffers except cancelled buffers to the head of the list, and everything else to the tail of the list. The problem with this is that is interleaves inode items with the buffer cancellation items, and hence whether the inode item in an cancelled buffer gets replayed is essentially left to chance. Further, this ordering causes problems for log recovery when inode CRCs are enabled. It typically replays the inode unlink buffer long before it replays the inode core changes, and so the CRC recorded in an unlink buffer is going to be invalid and hence any attempt to validate the inode in the buffer is going to fail. Hence we really need to enforce the ordering that the inode alloc/unlink code has expected log recovery to have since inode chunk de-allocation was introduced back in 2003... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `a775ad7780`)	2013-06-06 10:51:07 -05:00
Dave Chinner	ea929536a4	xfs: fix remote attribute invalidation for a leaf When invalidating an attribute leaf block block, there might be remote attributes that it points to. With the recent rework of the remote attribute format, we have to make sure we calculate the length of the attribute correctly. We aren't doing that in xfs_attr3_leaf_inactive(), so fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinuguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `59913f14df`)	2013-06-06 10:50:52 -05:00
Dave Chinner	bb9b8e86ad	xfs: rework dquot CRCs Calculating dquot CRCs when the backing buffer is written back just doesn't work reliably. There are several places which manipulate dquots directly in the buffers, and they don't calculate CRCs appropriately, nor do they always set the buffer up to calculate CRCs appropriately. Firstly, if we log a dquot buffer (e.g. during allocation) it gets logged without valid CRC, and so on recovery we end up with a dquot that is not valid. Secondly, if we recover/repair a dquot, we don't have a verifier attached to the buffer and hence CRCs are not calculated on the way down to disk. Thirdly, calculating the CRC after we've changed the contents means that if we re-read the dquot from the buffer, we cannot verify the contents of the dquot are valid, as the CRC is invalid. So, to avoid all the dquot CRC errors that are being detected by the read verifier, change to using the same model as for inodes. That is, dquot CRCs are calculated and written to the backing buffer at the time the dquot is flushed to the backing buffer. If we modify the dquot directly in the backing buffer, calculate the CRC immediately after the modification is complete. Hence the dquot in the on-disk buffer should always have a valid CRC. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `6fcdc59de2`)	2013-06-06 10:50:35 -05:00
Dave Chinner	5c87d4bc1a	xfs: increase number of ACL entries for V5 superblocks The limit of 25 ACL entries is arbitrary, but baked into the on-disk format. For version 5 superblocks, increase it to the maximum nuber of ACLs that can fit into a single xattr. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinuguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-05 11:26:53 -05:00
Dave Chinner	d3eaace84e	xfs: disable noattr2/attr2 mount options for CRC enabled filesystems attr2 format is always enabled for v5 superblock filesystems, so the mount options to enable or disable it need to be cause mount errors. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-05 11:21:06 -05:00
Dave Chinner	0a32c26e72	xfs: inode unlinked list needs to recalculate the inode CRC The inode unlinked list manipulations operate directly on the inode buffer, and so bypass the inode CRC calculation mechanisms. Hence an inode on the unlinked list has an invalid CRC. Fix this by recalculating the CRC whenever we modify an unlinked list pointer in an inode, ncluding during log recovery. This is trivial to do and results in unlinked list operations always leaving a consistent inode in the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-05 11:19:10 -05:00
Dave Chinner	a775ad7780	xfs: fix log recovery transaction item reordering There are several constraints that inode allocation and unlink logging impose on log recovery. These all stem from the fact that inode alloc/unlink are logged in buffers, but all other inode changes are logged in inode items. Hence there are ordering constraints that recovery must follow to ensure the correct result occurs. As it turns out, this ordering has been working mostly by chance than good management. The existing code moves all buffers except cancelled buffers to the head of the list, and everything else to the tail of the list. The problem with this is that is interleaves inode items with the buffer cancellation items, and hence whether the inode item in an cancelled buffer gets replayed is essentially left to chance. Further, this ordering causes problems for log recovery when inode CRCs are enabled. It typically replays the inode unlink buffer long before it replays the inode core changes, and so the CRC recorded in an unlink buffer is going to be invalid and hence any attempt to validate the inode in the buffer is going to fail. Hence we really need to enforce the ordering that the inode alloc/unlink code has expected log recovery to have since inode chunk de-allocation was introduced back in 2003... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-05 11:13:19 -05:00
Dave Chinner	59913f14df	xfs: fix remote attribute invalidation for a leaf When invalidating an attribute leaf block block, there might be remote attributes that it points to. With the recent rework of the remote attribute format, we have to make sure we calculate the length of the attribute correctly. We aren't doing that in xfs_attr3_leaf_inactive(), so fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinuguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-04 17:36:30 -05:00
Dave Chinner	6fcdc59de2	xfs: rework dquot CRCs Calculating dquot CRCs when the backing buffer is written back just doesn't work reliably. There are several places which manipulate dquots directly in the buffers, and they don't calculate CRCs appropriately, nor do they always set the buffer up to calculate CRCs appropriately. Firstly, if we log a dquot buffer (e.g. during allocation) it gets logged without valid CRC, and so on recovery we end up with a dquot that is not valid. Secondly, if we recover/repair a dquot, we don't have a verifier attached to the buffer and hence CRCs are not calculated on the way down to disk. Thirdly, calculating the CRC after we've changed the contents means that if we re-read the dquot from the buffer, we cannot verify the contents of the dquot are valid, as the CRC is invalid. So, to avoid all the dquot CRC errors that are being detected by the read verifier, change to using the same model as for inodes. That is, dquot CRCs are calculated and written to the backing buffer at the time the dquot is flushed to the backing buffer. If we modify the dquot directly in the backing buffer, calculate the CRC immediately after the modification is complete. Hence the dquot in the on-disk buffer should always have a valid CRC. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-06-04 17:35:51 -05:00
Dave Chinner	7bc0dc271e	xfs: rework remote attr CRCs Note: this changes the on-disk remote attribute format. I assert that this is OK to do as CRCs are marked experimental and the first kernel it is included in has not yet reached release yet. Further, the userspace utilities are still evolving and so anyone using this stuff right now is a developer or tester using volatile filesystems for testing this feature. Hence changing the format right now to save longer term pain is the right thing to do. The fundamental change is to move from a header per extent in the attribute to a header per filesytem block in the attribute. This means there are more header blocks and the parsing of the attribute data is slightly more complex, but it has the advantage that we always know the size of the attribute on disk based on the length of the data it contains. This is where the header-per-extent method has problems. We don't know the size of the attribute on disk without first knowing how many extents are used to hold it. And we can't tell from a mapping lookup, either, because remote attributes can be allocated contiguously with other attribute blocks and so there is no obvious way of determining the actual size of the atribute on disk short of walking and mapping buffers. The problem with this approach is that if we map a buffer incorrectly (e.g. we make the last buffer for the attribute data too long), we then get buffer cache lookup failure when we map it correctly. i.e. we get a size mismatch on lookup. This is not necessarily fatal, but it's a cache coherency problem that can lead to returning the wrong data to userspace or writing the wrong data to disk. And debug kernels will assert fail if this occurs. I found lots of niggly little problems trying to fix this issue on a 4k block size filesystem, finally getting it to pass with lots of fixes. The thing is, 1024 byte filesystems still failed, and it was getting really complex handling all the corner cases that were showing up. And there were clearly more that I hadn't found yet. It is complex, fragile code, and if we don't fix it now, it will be complex, fragile code forever more. Hence the simple fix is to add a header to each filesystem block. This gives us the same relationship between the attribute data length and the number of blocks on disk as we have without CRCs - it's a linear mapping and doesn't require us to guess anything. It is simple to implement, too - the remote block count calculated at lookup time can be used by the remote attribute set/get/remove code without modification for both CRC and non-CRC filesystems. The world becomes sane again. Because the copy-in and copy-out now need to iterate over each filesystem block, I moved them into helper functions so we separate the block mapping and buffer manupulations from the attribute data and CRC header manipulations. The code becomes much clearer as a result, and it is a lot easier to understand and debug. It also appears to be much more robust - once it worked on 4k block size filesystems, it has worked without failure on 1k block size filesystems, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `ad1858d777`)	2013-05-30 17:26:31 -05:00
Dave Chinner	634fd5322a	xfs: fully initialise temp leaf in xfs_attr3_leaf_compact xfs_attr3_leaf_compact() uses a temporary buffer for compacting the the entries in a leaf. It copies the the original buffer into the temporary buffer, then zeros the original buffer completely. It then copies the entries back into the original buffer. However, the original buffer has not been correctly initialised, and so the movement of the entries goes horribly wrong. Make sure the zeroed destination buffer is fully initialised, and once we've set up the destination incore header appropriately, write is back to the buffer before starting to move entries around. While debugging this, the _d/_s prefixes weren't sufficient to remind me what buffer was what, so rename then all _src/_dst. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `d4c712bcf2`)	2013-05-30 17:26:24 -05:00
Dave Chinner	9e80c76205	xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining the entries in two leaves when the destination leaf requires compaction. The temporary buffer ends up being copied back over the original destination buffer, so the header in the temporary buffer needs to contain all the information that is in the destination buffer. To make sure the temporary buffer is fully initialised, once we've set up the temporary incore header appropriately, write is back to the temporary buffer before starting to move entries around. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `8517de2a81`)	2013-05-30 17:26:16 -05:00
Dave Chinner	58a7228155	xfs: correctly map remote attr buffers during removal If we don't map the buffers correctly (same as for get/set operations) then the incore buffer lookup will fail. If a block number matches but a length is wrong, then debug kernels will ASSERT fail in _xfs_buf_find() due to the length mismatch. Ensure that we map the buffers correctly by basing the length of the buffer on the attribute data length rather than the remote block count. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `6863ef8449`)	2013-05-30 17:26:08 -05:00
Dave Chinner	26f714450c	xfs: remote attribute tail zeroing does too much When an attribute data does not fill then entire remote block, we zero the remaining part of the buffer. This, however, needs to take into account that the buffer has a header, and so the offset where zeroing starts and the length of zeroing need to take this into account. Otherwise we end up with zeros over the end of the attribute value when CRCs are enabled. While there, make sure we only ask to map an extent that covers the remaining range of the attribute, rather than asking every time for the full length of remote data. If the remote attribute blocks are contiguous with other parts of the attribute tree, it will map those blocks as well and we can potentially zero them incorrectly. We can also get buffer size mistmatches when trying to read or remove the remote attribute, and this can lead to not finding the correct buffer when looking it up in cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `4af3644c9a`)	2013-05-30 17:25:58 -05:00
Dave Chinner	551b382f53	xfs: remote attribute read too short Reading a maximally size remote attribute fails when CRCs are enabled with this verification error: XFS (vdb): remote attribute header does not match required off/len/owner) There are two reasons for this, the first being that the length of the buffer being read is determined from the args->rmtblkcnt which doesn't take into account CRC headers. Hence the mapped length ends up being too short and so we need to calculate it directly from the value length. The second is that the byte count of valid data within a buffer is capped by the length of the data and so doesn't take into account that the buffer might be longer due to headers. Hence we need to calculate the data space in the buffer first before calculating the actual byte count of data. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `913e96bc29`)	2013-05-30 17:25:50 -05:00
Dave Chinner	9531e2de6b	xfs: remote attribute allocation may be contiguous When CRCs are enabled, there may be multiple allocations made if the headers cause a length overflow. This, however, does not mean that the number of headers required increases, as the second and subsequent extents may be contiguous with the previous extent. Hence when we map the extents to write the attribute data, we may end up with less extents than allocations made. Hence the assertion that we consume the number of headers we calculated in the allocation loop is incorrect and needs to be removed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `90253cf142`)	2013-05-30 17:25:39 -05:00
Dave Chinner	e400d27d16	xfs: fix dir3 freespace block corruption When the directory freespace index grows to a second block (2017 4k data blocks in the directory), the initialisation of the second new block header goes wrong. The write verifier fires a corruption error indicating that the block number in the header is zero. This was being tripped by xfs/110. The problem is that the initialisation of the new block is done just fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2 structure to zero on-disk header fields that xfs_dir3_free_get_buf() has already zeroed. These lined up with the block number in the dir v3 header format. While looking at this, I noticed that the struct xfs_dir3_free_hdr() had 4 bytes of padding in it that wasn't defined as padding or being zeroed by the initialisation. Add a pad field declaration and fully zero the on disk and in-core headers in xfs_dir3_free_get_buf() so that this is never an issue in the future. Note that this doesn't change the on-disk layout, just makes the 32 bits of padding in the layout explicit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `5ae6e6a401`)	2013-05-30 17:22:54 -05:00
Dave Chinner	7c9950fd2a	xfs: disable swap extents ioctl on CRC enabled filesystems Currently, swapping extents from one inode to another is a simple act of switching data and attribute forks from one inode to another. This, unfortunately in no longer so simple with CRC enabled filesystems as there is owner information embedded into the BMBT blocks that are swapped between inodes. Hence swapping the forks between inodes results in the inodes having mapping blocks that point to the wrong owner and hence are considered corrupt. To fix this we need an extent tree block or record based swap algorithm so that the BMBT block owner information can be updated atomically in the swap transaction. This is a significant piece of new work, so for the moment simply don't allow swap extent operations to succeed on CRC enabled filesystems. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `02f75405a7`)	2013-05-30 17:20:08 -05:00
Dave Chinner	e7927e879d	xfs: add fsgeom flag for v5 superblock support. Currently userspace has no way of determining that a filesystem is CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to indicate that the filesystem has v5 superblock support enabled. This will allow xfs_info to correctly report the state of the filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `74137fff06`)	2013-05-30 17:19:45 -05:00
Dave Chinner	1de09d1ae4	xfs: fix incorrect remote symlink block count When CRCs are enabled, the number of blocks needed to hold a remote symlink on a 1k block size filesystem may be 2 instead of 1. The transaction reservation for the allocated blocks was not taking this into account and only allocating one block. Hence when trying to read or invalidate such symlinks, we are mapping a hole where there should be a block and things go bad at that point. Fix the reservation to use the correct block count, clean up the block count calculation similar to the remote attribute calculation, and add a debug guard to detect when we don't write the entire symlink to disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `321a95839e`)	2013-05-30 17:19:07 -05:00
Dave Chinner	7d2ffe80aa	xfs: fix split buffer vector log recovery support A long time ago in a galaxy far away.... .. the was a commit made to fix some ilinux specific "fragmented buffer" log recovery problem: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603 That problem occurred when a contiguous dirty region of a buffer was split across across two pages of an unmapped buffer. It's been a long time since that has been done in XFS, and the changes to log the entire inode buffers for CRC enabled filesystems has re-introduced that corner case. And, of course, it turns out that the above commit didn't actually fix anything - it just ensured that log recovery is guaranteed to fail when this situation occurs. And now for the gory details. xfstest xfs/085 is failing with this assert: XFS (vdb): bad number of regions (0) in inode log format XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583 Largely undocumented factoid #1: Log recovery depends on all log buffer format items starting with this format: struct foo_log_format { __uint16_t type; __uint16_t size; .... As recoery uses the size field and assumptions about 32 bit alignment in decoding format items. So don't pay much attention to the fact log recovery thinks that it decoding an inode log format item - it just uses them to determine what the size of the item is. But why would it see a log format item with a zero size? Well, luckily enough xfs_logprint uses the same code and gives the same error, so with a bit of gdb magic, it turns out that it isn't a log format that is being decoded. What logprint tells us is this: Oper (130): tid: a0375e1a len: 28 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 144 (0x90) len: 16 bmap size: 2 flags: 0x4000 Oper (131): tid: a0375e1a len: 4096 clientid: TRANS flags: none BUF DATA ---------------------------------------------------------------------------- Oper (132): tid: a0375e1a len: 4096 clientid: TRANS flags: none xfs_logprint: unknown log operation type (4e49) ********************************************************************** * ERROR: data block=2 * ********************************************************************** That we've got a buffer format item (oper 130) that has two regions; the format item itself and one dirty region. The subsequent region after the buffer format item and it's data is them what we are tripping over, and the first bytes of it at an inode magic number. Not a log opheader like there is supposed to be. That means there's a problem with the buffer format item. It's dirty data region is 4096 bytes, and it contains - you guessed it - initialised inodes. But inode buffers are 8k, not 4k, and we log them in their entirety. So something is wrong here. The buffer format item contains: (gdb) p /x (struct xfs_buf_log_format )in_f $22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000, blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2, blf_data_map = {0xffffffff, 0xffffffff, .... }} Two regions, and a signle dirty contiguous region of 64 bits. 64 * 128 = 8k, so this should be followed by a single 8k region of data. And the blf_flags tell us that the type of buffer is a XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation buffer. So, it should be followed by 8k of inode data. But we know that the next region has a header of: (gdb) p /x ohead $25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69, oh_flags = 0x0, oh_res2 = 0x0} and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not long enough to hold all the logged data. There must be another region. There is - there's a following opheader for another 4k of data that contains the other half of the inode cluster data - the one we assert fail on because it's not a log format header. So why is the second part of the data not being accounted to the correct buffer log format structure? It took a little more work with gdb to work out that the buffer log format structure was both expecting it to be there but hadn't accounted for it. It was at that point I went to the kernel code, as clearly this wasn't a bug in xfs_logprint and the kernel was writing bad stuff to the log. First port of call was the buffer item formatting code, and the discontiguous memory/contiguous dirty region handling code immediately stood out. I've wondered for a long time why the code had this comment in it: vecp->i_addr = xfs_buf_offset(bp, buffer_offset); vecp->i_len = nbits XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; /* * You would think we need to bump the nvecs here too, but we do not * this number is used by recovery, and it gets confused by the boundary * split here * nvecs++; */ vecp++; And it didn't account for the extra vector pointer. The case being handled here is that a contiguous dirty region lies across a boundary that cannot be memcpy()d across, and so has to be split into two separate operations for xlog_write() to perform. What this code assumes is that what is written to the log is two consecutive blocks of data that are accounted in the buf log format item as the same contiguous dirty region and so will get decoded as such by the log recovery code. The thing is, xlog_write() knows nothing about this, and so just does it's normal thing of adding an opheader for each vector. That means the 8k region gets written to the log as two separate regions of 4k each, but because nvecs has not been incremented, the buf log format item accounts for only one of them. Hence when we come to log recovery, we process the first 4k region and then expect to come across a new item that starts with a log format structure of some kind that tells us whenteh next data is going to be. Instead, we hit raw buffer data and things go bad real quick. So, the commit from 2002 that commented out nvecs++ is just plain wrong. It breaks log recovery completely, and it would seem the only reason this hasn't been since then is that we don't log large contigous regions of multi-page unmapped buffers very often. Never would be a closer estimate, at least until the CRC code came along.... So, lets fix that by restoring the nvecs accounting for the extra region when we hit this case..... .... and there's the problemin log recovery it is apparently working around: XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135 Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty regions being broken up into multiple regions by the log formatting code. That's an easy fix, though - if the number of contiguous dirty bits exceeds the length of the region being copied out of the log, only account for the number of dirty bits that region covers, and then loop again and copy more from the next region. It's a 2 line fix. Now xfstests xfs/085 passes, we have one less piece of mystery code, and one more important piece of knowledge about how to structure new log format items.. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `709da6a61a`)	2013-05-30 17:18:01 -05:00
Dave Chinner	2962f5a5dc	xfs: kill suid/sgid through the truncate path. XFS has failed to kill suid/sgid bits correctly when truncating files of non-zero size since commit `c4ed4243` ("xfs: split xfs_setattr") introduced in the 3.1 kernel. Fix it. Fix it. cc: stable kernel <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `56c19e89b3`)	2013-05-30 17:17:35 -05:00
Dave Chinner	08fb39051f	xfs: avoid nesting transactions in xfs_qm_scall_setqlim() Lockdep reports: ============================================= [ INFO: possible recursive locking detected ] 3.9.0+ #3 Not tainted --------------------------------------------- setquota/28368 is trying to acquire lock: (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50 but task is already holding lock: (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50 from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be allocated. xfs_qm_scall_setqlim() is starting a transaction and then not passing it into xfs_qm_dqet() and so it starts it's own transaction when allocating the dquot. Splat! Fix this by not allocating the dquot in xfs_qm_scall_setqlim() inside the setqlim transaction. This requires getting the dquot first (and allocating it if necessary) then dropping and relocking the dquot before joining it to the setqlim transaction. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `f648167f3a`)	2013-05-30 17:10:56 -05:00
Dave Chinner	5ae6e6a401	xfs: fix dir3 freespace block corruption When the directory freespace index grows to a second block (2017 4k data blocks in the directory), the initialisation of the second new block header goes wrong. The write verifier fires a corruption error indicating that the block number in the header is zero. This was being tripped by xfs/110. The problem is that the initialisation of the new block is done just fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2 structure to zero on-disk header fields that xfs_dir3_free_get_buf() has already zeroed. These lined up with the block number in the dir v3 header format. While looking at this, I noticed that the struct xfs_dir3_free_hdr() had 4 bytes of padding in it that wasn't defined as padding or being zeroed by the initialisation. Add a pad field declaration and fully zero the on disk and in-core headers in xfs_dir3_free_get_buf() so that this is never an issue in the future. Note that this doesn't change the on-disk layout, just makes the 32 bits of padding in the layout explicit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-30 14:32:47 -05:00
Dave Chinner	56c19e89b3	xfs: kill suid/sgid through the truncate path. XFS has failed to kill suid/sgid bits correctly when truncating files of non-zero size since commit `c4ed4243` ("xfs: split xfs_setattr") introduced in the 3.1 kernel. Fix it. Fix it. cc: stable kernel <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-30 13:43:52 -05:00
Dave Chinner	74137fff06	xfs: add fsgeom flag for v5 superblock support. Currently userspace has no way of determining that a filesystem is CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to indicate that the filesystem has v5 superblock support enabled. This will allow xfs_info to correctly report the state of the filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-30 12:57:25 -05:00
Dave Chinner	02f75405a7	xfs: disable swap extents ioctl on CRC enabled filesystems Currently, swapping extents from one inode to another is a simple act of switching data and attribute forks from one inode to another. This, unfortunately in no longer so simple with CRC enabled filesystems as there is owner information embedded into the BMBT blocks that are swapped between inodes. Hence swapping the forks between inodes results in the inodes having mapping blocks that point to the wrong owner and hence are considered corrupt. To fix this we need an extent tree block or record based swap algorithm so that the BMBT block owner information can be updated atomically in the swap transaction. This is a significant piece of new work, so for the moment simply don't allow swap extent operations to succeed on CRC enabled filesystems. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-30 12:55:31 -05:00
Dave Chinner	709da6a61a	xfs: fix split buffer vector log recovery support A long time ago in a galaxy far away.... .. the was a commit made to fix some ilinux specific "fragmented buffer" log recovery problem: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603 That problem occurred when a contiguous dirty region of a buffer was split across across two pages of an unmapped buffer. It's been a long time since that has been done in XFS, and the changes to log the entire inode buffers for CRC enabled filesystems has re-introduced that corner case. And, of course, it turns out that the above commit didn't actually fix anything - it just ensured that log recovery is guaranteed to fail when this situation occurs. And now for the gory details. xfstest xfs/085 is failing with this assert: XFS (vdb): bad number of regions (0) in inode log format XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583 Largely undocumented factoid #1: Log recovery depends on all log buffer format items starting with this format: struct foo_log_format { __uint16_t type; __uint16_t size; .... As recoery uses the size field and assumptions about 32 bit alignment in decoding format items. So don't pay much attention to the fact log recovery thinks that it decoding an inode log format item - it just uses them to determine what the size of the item is. But why would it see a log format item with a zero size? Well, luckily enough xfs_logprint uses the same code and gives the same error, so with a bit of gdb magic, it turns out that it isn't a log format that is being decoded. What logprint tells us is this: Oper (130): tid: a0375e1a len: 28 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 144 (0x90) len: 16 bmap size: 2 flags: 0x4000 Oper (131): tid: a0375e1a len: 4096 clientid: TRANS flags: none BUF DATA ---------------------------------------------------------------------------- Oper (132): tid: a0375e1a len: 4096 clientid: TRANS flags: none xfs_logprint: unknown log operation type (4e49) ********************************************************************** * ERROR: data block=2 * ********************************************************************** That we've got a buffer format item (oper 130) that has two regions; the format item itself and one dirty region. The subsequent region after the buffer format item and it's data is them what we are tripping over, and the first bytes of it at an inode magic number. Not a log opheader like there is supposed to be. That means there's a problem with the buffer format item. It's dirty data region is 4096 bytes, and it contains - you guessed it - initialised inodes. But inode buffers are 8k, not 4k, and we log them in their entirety. So something is wrong here. The buffer format item contains: (gdb) p /x (struct xfs_buf_log_format )in_f $22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000, blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2, blf_data_map = {0xffffffff, 0xffffffff, .... }} Two regions, and a signle dirty contiguous region of 64 bits. 64 * 128 = 8k, so this should be followed by a single 8k region of data. And the blf_flags tell us that the type of buffer is a XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation buffer. So, it should be followed by 8k of inode data. But we know that the next region has a header of: (gdb) p /x ohead $25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69, oh_flags = 0x0, oh_res2 = 0x0} and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not long enough to hold all the logged data. There must be another region. There is - there's a following opheader for another 4k of data that contains the other half of the inode cluster data - the one we assert fail on because it's not a log format header. So why is the second part of the data not being accounted to the correct buffer log format structure? It took a little more work with gdb to work out that the buffer log format structure was both expecting it to be there but hadn't accounted for it. It was at that point I went to the kernel code, as clearly this wasn't a bug in xfs_logprint and the kernel was writing bad stuff to the log. First port of call was the buffer item formatting code, and the discontiguous memory/contiguous dirty region handling code immediately stood out. I've wondered for a long time why the code had this comment in it: vecp->i_addr = xfs_buf_offset(bp, buffer_offset); vecp->i_len = nbits XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; /* * You would think we need to bump the nvecs here too, but we do not * this number is used by recovery, and it gets confused by the boundary * split here * nvecs++; */ vecp++; And it didn't account for the extra vector pointer. The case being handled here is that a contiguous dirty region lies across a boundary that cannot be memcpy()d across, and so has to be split into two separate operations for xlog_write() to perform. What this code assumes is that what is written to the log is two consecutive blocks of data that are accounted in the buf log format item as the same contiguous dirty region and so will get decoded as such by the log recovery code. The thing is, xlog_write() knows nothing about this, and so just does it's normal thing of adding an opheader for each vector. That means the 8k region gets written to the log as two separate regions of 4k each, but because nvecs has not been incremented, the buf log format item accounts for only one of them. Hence when we come to log recovery, we process the first 4k region and then expect to come across a new item that starts with a log format structure of some kind that tells us whenteh next data is going to be. Instead, we hit raw buffer data and things go bad real quick. So, the commit from 2002 that commented out nvecs++ is just plain wrong. It breaks log recovery completely, and it would seem the only reason this hasn't been since then is that we don't log large contigous regions of multi-page unmapped buffers very often. Never would be a closer estimate, at least until the CRC code came along.... So, lets fix that by restoring the nvecs accounting for the extra region when we hit this case..... .... and there's the problemin log recovery it is apparently working around: XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135 Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty regions being broken up into multiple regions by the log formatting code. That's an easy fix, though - if the number of contiguous dirty bits exceeds the length of the region being copied out of the log, only account for the number of dirty bits that region covers, and then loop again and copy more from the next region. It's a 2 line fix. Now xfstests xfs/085 passes, we have one less piece of mystery code, and one more important piece of knowledge about how to structure new log format items.. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-30 12:48:33 -05:00
Dave Chinner	321a95839e	xfs: fix incorrect remote symlink block count When CRCs are enabled, the number of blocks needed to hold a remote symlink on a 1k block size filesystem may be 2 instead of 1. The transaction reservation for the allocated blocks was not taking this into account and only allocating one block. Hence when trying to read or invalidate such symlinks, we are mapping a hole where there should be a block and things go bad at that point. Fix the reservation to use the correct block count, clean up the block count calculation similar to the remote attribute calculation, and add a debug guard to detect when we don't write the entire symlink to disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-30 12:37:04 -05:00
Dave Chinner	34510185ab	xfs: don't emit v5 superblock warnings on write We write the superblock every 30s or so which results in the verifier being called. Right now that results in this output every 30s: XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled! Use of these features in this kernel is at your own risk! And spamming the logs. We don't need to check for whether we support v5 superblocks or whether there are feature bits we don't support set as these are only relevant when we first mount the filesytem. i.e. on superblock read. Hence for the write verification we can just skip all the checks (and hence verbose output) altogether. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-30 12:24:19 -05:00
Dave Chinner	7ae077802c	xfs: remote attribute lookups require the value length When reading a remote attribute, to correctly calculate the length of the data buffer for CRC enable filesystems, we need to know the length of the attribute data. We get this information when we look up the attribute, but we don't store it in the args structure along with the other remote attr information we get from the lookup. Add this information to the args structure so we can use it appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `e461fcb194`)	2013-05-24 16:31:20 -05:00
Dave Chinner	cf257abf02	xfs: xfs_attr_shortform_allfit() does not handle attr3 format. xfstests generic/117 fails with: XFS: Assertion failed: leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC) indicating a function that does not handle the attr3 format correctly. Fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `b38958d715`)	2013-05-24 16:29:56 -05:00
Dave Chinner	7ced60cae4	xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `72916fb8cb`)	2013-05-24 16:29:37 -05:00
Dave Chinner	b17cb364db	xfs: fix missing KM_NOFS tags to keep lockdep happy There are several places where we use KM_SLEEP allocation contexts and use the fact that they are called from transaction context to add KM_NOFS where appropriate. Unfortunately, there are several places where the code makes this assumption but can be called from outside transaction context but with filesystem locks held. These places need explicit KM_NOFS annotations to avoid lockdep complaining about reclaim contexts. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `ac14876cf9`)	2013-05-24 16:29:15 -05:00
Dave Chinner	509e708a89	xfs: Don't reference the EFI after it is freed Checking the EFI for whether it is being released from recovery after we've already released the known active reference is a mistake worthy of a brown paper bag. Fix the (now) obvious use after free that it can cause. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `52c24ad39f`)	2013-05-24 16:27:57 -05:00
Dave Chinner	7031d0e1c4	xfs: fix rounding in xfs_free_file_space The offset passed into xfs_free_file_space() needs to be rounded down to a certain size, but the rounding mask is built by a 32 bit variable. Hence the mask will always mask off the upper 32 bits of the offset and lead to incorrect writeback and invalidation ranges. This is not actually exposed as a bug because we writeback and invalidate from the rounded offset to the end of the file, and hence the offset we are actually punching a hole out of will always be covered by the code. This needs fixing, however, if we ever want to use exact ranges for writeback/invalidation here... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `28ca489c63`)	2013-05-24 16:27:41 -05:00
Dave Chinner	480d7467e4	xfs: fix sub-page blocksize data integrity writes FSX on 512 byte block size filesystems has been failing for some time with corrupted data. The fault dates back to the change in the writeback data integrity algorithm that uses a mark-and-sweep approach to avoid data writeback livelocks. Unfortunately, a side effect of this mark-and-sweep approach is that each page will only be written once for a data integrity sync, and there is a condition in writeback in XFS where a page may require two writeback attempts to be fully written. As a result of the high level change, we now only get a partial page writeback during the integrity sync because the first pass through writeback clears the mark left on the page index to tell writeback that the page needs writeback.... The cause is writing a partial page in the clustering code. This can happen when a mapping boundary falls in the middle of a page - we end up writing back the first part of the page that the mapping covers, but then never revisit the page to have the remainder mapped and written. The fix is simple - if the mapping boundary falls inside a page, then simple abort clustering without touching the page. This means that the next ->writepage entry that write_cache_pages() will make is the page we aborted on, and xfs_vm_writepage() will map all sections of the page correctly. This behaviour is also optimal for non-data integrity writes, as it results in contiguous sequential writeback of the file rather than missing small holes and having to write them a "random" writes in a future pass. With this fix, all the fsx tests in xfstests now pass on a 512 byte block size filesystem on a 4k page machine. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `49b137cbbc`)	2013-05-24 16:26:51 -05:00
Dave Chinner	ad1858d777	xfs: rework remote attr CRCs Note: this changes the on-disk remote attribute format. I assert that this is OK to do as CRCs are marked experimental and the first kernel it is included in has not yet reached release yet. Further, the userspace utilities are still evolving and so anyone using this stuff right now is a developer or tester using volatile filesystems for testing this feature. Hence changing the format right now to save longer term pain is the right thing to do. The fundamental change is to move from a header per extent in the attribute to a header per filesytem block in the attribute. This means there are more header blocks and the parsing of the attribute data is slightly more complex, but it has the advantage that we always know the size of the attribute on disk based on the length of the data it contains. This is where the header-per-extent method has problems. We don't know the size of the attribute on disk without first knowing how many extents are used to hold it. And we can't tell from a mapping lookup, either, because remote attributes can be allocated contiguously with other attribute blocks and so there is no obvious way of determining the actual size of the atribute on disk short of walking and mapping buffers. The problem with this approach is that if we map a buffer incorrectly (e.g. we make the last buffer for the attribute data too long), we then get buffer cache lookup failure when we map it correctly. i.e. we get a size mismatch on lookup. This is not necessarily fatal, but it's a cache coherency problem that can lead to returning the wrong data to userspace or writing the wrong data to disk. And debug kernels will assert fail if this occurs. I found lots of niggly little problems trying to fix this issue on a 4k block size filesystem, finally getting it to pass with lots of fixes. The thing is, 1024 byte filesystems still failed, and it was getting really complex handling all the corner cases that were showing up. And there were clearly more that I hadn't found yet. It is complex, fragile code, and if we don't fix it now, it will be complex, fragile code forever more. Hence the simple fix is to add a header to each filesystem block. This gives us the same relationship between the attribute data length and the number of blocks on disk as we have without CRCs - it's a linear mapping and doesn't require us to guess anything. It is simple to implement, too - the remote block count calculated at lookup time can be used by the remote attribute set/get/remove code without modification for both CRC and non-CRC filesystems. The world becomes sane again. Because the copy-in and copy-out now need to iterate over each filesystem block, I moved them into helper functions so we separate the block mapping and buffer manupulations from the attribute data and CRC header manipulations. The code becomes much clearer as a result, and it is a lot easier to understand and debug. It also appears to be much more robust - once it worked on 4k block size filesystems, it has worked without failure on 1k block size filesystems, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-23 18:04:06 -05:00
Dave Chinner	d4c712bcf2	xfs: fully initialise temp leaf in xfs_attr3_leaf_compact xfs_attr3_leaf_compact() uses a temporary buffer for compacting the the entries in a leaf. It copies the the original buffer into the temporary buffer, then zeros the original buffer completely. It then copies the entries back into the original buffer. However, the original buffer has not been correctly initialised, and so the movement of the entries goes horribly wrong. Make sure the zeroed destination buffer is fully initialised, and once we've set up the destination incore header appropriately, write is back to the buffer before starting to move entries around. While debugging this, the _d/_s prefixes weren't sufficient to remind me what buffer was what, so rename then all _src/_dst. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-23 17:53:08 -05:00
Dave Chinner	8517de2a81	xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining the entries in two leaves when the destination leaf requires compaction. The temporary buffer ends up being copied back over the original destination buffer, so the header in the temporary buffer needs to contain all the information that is in the destination buffer. To make sure the temporary buffer is fully initialised, once we've set up the temporary incore header appropriately, write is back to the temporary buffer before starting to move entries around. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-23 17:52:07 -05:00
Dave Chinner	6863ef8449	xfs: correctly map remote attr buffers during removal If we don't map the buffers correctly (same as for get/set operations) then the incore buffer lookup will fail. If a block number matches but a length is wrong, then debug kernels will ASSERT fail in _xfs_buf_find() due to the length mismatch. Ensure that we map the buffers correctly by basing the length of the buffer on the attribute data length rather than the remote block count. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-23 17:49:28 -05:00
Dave Chinner	4af3644c9a	xfs: remote attribute tail zeroing does too much When an attribute data does not fill then entire remote block, we zero the remaining part of the buffer. This, however, needs to take into account that the buffer has a header, and so the offset where zeroing starts and the length of zeroing need to take this into account. Otherwise we end up with zeros over the end of the attribute value when CRCs are enabled. While there, make sure we only ask to map an extent that covers the remaining range of the attribute, rather than asking every time for the full length of remote data. If the remote attribute blocks are contiguous with other parts of the attribute tree, it will map those blocks as well and we can potentially zero them incorrectly. We can also get buffer size mistmatches when trying to read or remove the remote attribute, and this can lead to not finding the correct buffer when looking it up in cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-23 17:35:18 -05:00
Dave Chinner	913e96bc29	xfs: remote attribute read too short Reading a maximally size remote attribute fails when CRCs are enabled with this verification error: XFS (vdb): remote attribute header does not match required off/len/owner) There are two reasons for this, the first being that the length of the buffer being read is determined from the args->rmtblkcnt which doesn't take into account CRC headers. Hence the mapped length ends up being too short and so we need to calculate it directly from the value length. The second is that the byte count of valid data within a buffer is capped by the length of the data and so doesn't take into account that the buffer might be longer due to headers. Hence we need to calculate the data space in the buffer first before calculating the actual byte count of data. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-23 17:31:20 -05:00
Lukas Czerner	34097dfe88	xfs: use ->invalidatepage() length argument ->invalidatepage() aop now accepts range to invalidate so we can make use of it in xfs_vm_invalidatepage() Signed-off-by: Lukas Czerner <lczerner@redhat.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com	2013-05-21 23:58:01 -04:00
Lukas Czerner	d47992f86b	mm: change invalidatepage prototype to accept length Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>	2013-05-21 23:17:23 -04:00
Dave Chinner	90253cf142	xfs: remote attribute allocation may be contiguous When CRCs are enabled, there may be multiple allocations made if the headers cause a length overflow. This, however, does not mean that the number of headers required increases, as the second and subsequent extents may be contiguous with the previous extent. Hence when we map the extents to write the attribute data, we may end up with less extents than allocations made. Hence the assertion that we consume the number of headers we calculated in the allocation loop is incorrect and needs to be removed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-21 14:22:51 -05:00
Dave Chinner	f648167f3a	xfs: avoid nesting transactions in xfs_qm_scall_setqlim() Lockdep reports: ============================================= [ INFO: possible recursive locking detected ] 3.9.0+ #3 Not tainted --------------------------------------------- setquota/28368 is trying to acquire lock: (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50 but task is already holding lock: (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50 from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be allocated. xfs_qm_scall_setqlim() is starting a transaction and then not passing it into xfs_qm_dqet() and so it starts it's own transaction when allocating the dquot. Splat! Fix this by not allocating the dquot in xfs_qm_scall_setqlim() inside the setqlim transaction. This requires getting the dquot first (and allocating it if necessary) then dropping and relocking the dquot before joining it to the setqlim transaction. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-21 13:57:05 -05:00
Dave Chinner	e461fcb194	xfs: remote attribute lookups require the value length When reading a remote attribute, to correctly calculate the length of the data buffer for CRC enable filesystems, we need to know the length of the attribute data. We get this information when we look up the attribute, but we don't store it in the args structure along with the other remote attr information we get from the lookup. Add this information to the args structure so we can use it appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-20 17:16:12 -05:00
Dave Chinner	b38958d715	xfs: xfs_attr_shortform_allfit() does not handle attr3 format. xfstests generic/117 fails with: XFS: Assertion failed: leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC) indicating a function that does not handle the attr3 format correctly. Fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-20 16:53:22 -05:00
Dave Chinner	72916fb8cb	xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-20 16:32:30 -05:00
Dave Chinner	ac14876cf9	xfs: fix missing KM_NOFS tags to keep lockdep happy There are several places where we use KM_SLEEP allocation contexts and use the fact that they are called from transaction context to add KM_NOFS where appropriate. Unfortunately, there are several places where the code makes this assumption but can be called from outside transaction context but with filesystem locks held. These places need explicit KM_NOFS annotations to avoid lockdep complaining about reclaim contexts. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-20 16:18:05 -05:00
Dave Chinner	52c24ad39f	xfs: Don't reference the EFI after it is freed Checking the EFI for whether it is being released from recovery after we've already released the known active reference is a mistake worthy of a brown paper bag. Fix the (now) obvious use after free that it can cause. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-20 14:29:34 -05:00
Dave Chinner	28ca489c63	xfs: fix rounding in xfs_free_file_space The offset passed into xfs_free_file_space() needs to be rounded down to a certain size, but the rounding mask is built by a 32 bit variable. Hence the mask will always mask off the upper 32 bits of the offset and lead to incorrect writeback and invalidation ranges. This is not actually exposed as a bug because we writeback and invalidate from the rounded offset to the end of the file, and hence the offset we are actually punching a hole out of will always be covered by the code. This needs fixing, however, if we ever want to use exact ranges for writeback/invalidation here... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-20 14:25:50 -05:00
Dave Chinner	49b137cbbc	xfs: fix sub-page blocksize data integrity writes FSX on 512 byte block size filesystems has been failing for some time with corrupted data. The fault dates back to the change in the writeback data integrity algorithm that uses a mark-and-sweep approach to avoid data writeback livelocks. Unfortunately, a side effect of this mark-and-sweep approach is that each page will only be written once for a data integrity sync, and there is a condition in writeback in XFS where a page may require two writeback attempts to be fully written. As a result of the high level change, we now only get a partial page writeback during the integrity sync because the first pass through writeback clears the mark left on the page index to tell writeback that the page needs writeback.... The cause is writing a partial page in the clustering code. This can happen when a mapping boundary falls in the middle of a page - we end up writing back the first part of the page that the mapping covers, but then never revisit the page to have the remainder mapped and written. The fix is simple - if the mapping boundary falls inside a page, then simple abort clustering without touching the page. This means that the next ->writepage entry that write_cache_pages() will make is the page we aborted on, and xfs_vm_writepage() will map all sections of the page correctly. This behaviour is also optimal for non-data integrity writes, as it results in contiguous sequential writeback of the file rather than missing small holes and having to write them a "random" writes in a future pass. With this fix, all the fsx tests in xfstests now pass on a 512 byte block size filesystem on a 4k page machine. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-20 14:14:25 -05:00
Jan Kara	211d022c43	xfs: Avoid pathological backwards allocation Writing a large file using direct IO in 16 MB chunks sometimes results in a pathological allocation pattern where 16 MB chunks of large free extent are allocated to a file in a reversed order. So extents of a file look for example as: ext logical physical expected length flags 0 0 13 4550656 1 4550656 188136807 4550668 12562432 2 17113088 200699240 200699238 622592 3 17735680 182046055 201321831 4096 4 17739776 182041959 182050150 4096 5 17743872 182037863 182046054 4096 6 17747968 182033767 182041958 4096 7 17752064 182029671 182037862 4096 ... 6757 45400064 154381644 154389835 4096 6758 45404160 154377548 154385739 4096 6759 45408256 252951571 154381643 73728 eof This happens because XFS_ALLOCTYPE_THIS_BNO allocation fails (the last extent in the file cannot be further extended) so we fall back to XFS_ALLOCTYPE_NEAR_BNO allocation which picks end of a large free extent as the best place to continue the file. Since the chunk at the end of the free extent again cannot be further extended, this behavior repeats until the whole free extent is consumed in a reversed order. For data allocations this backward allocation isn't beneficial so make xfs_alloc_compute_diff() pick start of a free extent instead of its end for them. That avoids the backward allocation pattern. See thread at http://oss.sgi.com/archives/xfs/2013-03/msg00144.html for more details about the reproduction case and why this solution was chosen. Based on idea by Dave Chinner <dchinner@redhat.com>. CC: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-20 13:09:11 -05:00
Linus Torvalds	8769e078a9	xfs: update (#2 ) for v3.10-rc1 * add CONFIG_XFS_WARN, a step between zero debugging and CONFIG_XFS_DEBUG. * fix attrmulti and attrlist to fall back to vmalloc when kmalloc fails. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJRi/V+AAoJENaLyazVq6ZOgncP/i0p141SO7qAwf20ql5NMi5Q KdhzWymyILIZ8GQoxsvoDASXb3D5I/f+8qX1WgrAG7/k7fahdGOYCtFTJ5YqYLvy ocxNoKqkhmUrMBbx0BcoNB4rtqgwLuHMR9ihDGDJrJDsn1b6nZlr38VBngIMhcC3 rWS8hU6ZwEnGm9hcmNzMBLJpjCJRfwqRr9CmHbENP/LspV0ZTSgltIuFggBfjK6q LK3FYdrlYfiX1md3c9zPpdVn8P3hxeM/3Jq+O3mLZ6hY4uq1+NBSKQbS5uwg1d6e 3ib5hxBDlRk//+S0mzJJ3DZ+Qqa2zHpZ/jNKsjdOBUoyqEdRC5tI9hx3F3la5VxP 4oktNP2rvhpziRqRny/EwZm5xHrGTrdP8Uwq2nO2nxevtZyVd+P73K/17sNgOeCU 8Xm6d3Usxw4FUTbiHjw0LFoJlM8yfEUSiTr1K8TmDP5G4phE6RsNnlbDLSUZ6kgh 6UodqGZtKqdXegYljtwysX75PELQcWWcrA5+7U2/yk+qKyEJ3HvMNIXHvW6MqGTL 69gxdrdD/Ff+83N+ktA3Ks31aj9n0LYpy6shxXg+YpHl6Mny3UiEpiUEsVc8jn4E iJ4qVqQI73qLDCNLM9XShBpPz5N/aPzaFw7QPXbj3A9UH3wd5OmUKNieoL4t9ucH sAreKLnndOuriYoTAWA9 =IPgF -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.10-rc1-2' of git://oss.sgi.com/xfs/xfs Pull xfs update (#2) from Ben Myers: - add CONFIG_XFS_WARN, a step between zero debugging and CONFIG_XFS_DEBUG. - fix attrmulti and attrlist to fall back to vmalloc when kmalloc fails. * tag 'for-linus-v3.10-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: fallback to vmalloc for large buffers in xfs_compat_attrlist_by_handle xfs: fallback to vmalloc for large buffers in xfs_attrlist_by_handle xfs: introduce CONFIG_XFS_WARN	2013-05-09 13:06:20 -07:00
Kent Overstreet	a27bb332c0	aio: don't include aio.h in sched.h Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-05-07 20:16:25 -07:00
Eric Sandeen	7dfbcbefad	xfs: fallback to vmalloc for large buffers in xfs_compat_attrlist_by_handle Shamelessly copied from dchinner's: `ad650f5b` xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get xfsdump uses a large buffer for extended attributes, which has a kmalloc'd shadow buffer in the kernel. This can fail after the system has been running for some time as it is a high order allocation. Add a fallback to vmalloc so that it doesn't require contiguous memory and so won't randomly fail while xfsdump is running. This was done for xfs_attrlist_by_handle but xfs_compat_attrlist_by_handle (the 32-bit version) needs the same attention. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-07 19:00:10 -05:00
Eric Sandeen	dd700d9452	xfs: fallback to vmalloc for large buffers in xfs_attrlist_by_handle Shamelessly copied from dchinner's: `ad650f5b` xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get xfsdump uses for a large buffer for extended attributes, which has a kmalloc'd shadow buffer in the kernel. This can fail after the system has been running for some time as it is a high order allocation. Add a fallback to vmalloc so that it doesn't require contiguous memory and so won't randomly fail while xfsdump is running. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-07 18:56:38 -05:00
Dave Chinner	742ae1e35b	xfs: introduce CONFIG_XFS_WARN Running a CONFIG_XFS_DEBUG kernel in production environments is not the best idea as it introduces significant overhead, can change the behaviour of algorithms (such as allocation) to improve test coverage, and (most importantly) panic the machine on non-fatal errors. There are many cases where all we want to do is run a kernel with more bounds checking enabled, such as is provided by the ASSERT() statements throughout the code, but without all the potential overhead and drawbacks. This patch converts all the ASSERT statements to evaluate as WARN_ON(1) statements and hence if they fail dump a warning and a stack trace to the log. This has minimal overhead and does not change any algorithms, and will allow us to find strange "out of bounds" problems more easily on production machines. There are a few places where assert statements contain debug only code. These are converted to be debug-or-warn only code so that we still get all the assert checks in the code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-07 18:45:36 -05:00
Linus Torvalds	c8d8566952	xfs: update for v3.10-rc1 For 3.10-rc1 we have a number of bug fixes and cleanups and a currently experimental feature from David Chinner, CRCs protection for metadata. CRCs are enabled by using mkfs.xfs to create a filesystem with the feature bits set. * numerous fixes for speculative preallocation * don't verify buffers on IO errors * rename of random32 to prandom32 * refactoring/rearrangement in xfs_bmap.c * removal of unused m_inode_shrink in struct xfs_mount * fix error handling of xfs_bufs and readahead * quota driven preallocation throttling * fix WARN_ON in xfs_vm_releasepage * add ratelimited printk for different alert levels * fix spurious forced shutdowns due to freed Extent Free Intents * remove some obsolete XLOG_CIL_HARD_SPACE_LIMIT() macros * remove some obsoleted comments * (experimental) CRC support for metadata -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJRgocuAAoJENaLyazVq6ZOXHUP+wbTG7P1cX33AeG9PErEJduU dpwzDmDn1n41pA5AY3/i5sm67qYSpF793LgI95n+wPxOh0LTDdKPqrLSEh5GAQ2c bAaax4UTmwFI8bnnHC2zcexyZX0tKDgfW8pxQe8i8xEh/bJalLFOLq7wTFfhAcQX 8NqI1BXp6bN7arm37rsUXpOS+mNIXxc5UtpKhREQ1zDQ/J+tQ3dGjnUmmMj4CX7F iRKzezT/5YpuWPX0MGgfuUAbSnNDQ9d4tPumTHEuTYuDtVWrea8ZRGMnXs+dEd4l NoFpeo1R0XaGtWx/4jOnWnmt3D+O3/k03jrFLmoZQKSuBW27jDkE7RRIq7OPmEo2 WVhDOO3I3CzoTGWfQ3BZ78dWF6rU/a5baxPmnkla4o4GIxyycgtARvsQWF97aeKO ImISIIBrBoifaElKOA+bDyP57EMe5DHSHAiMXGxuo/+djhTAxn5GugLwbes0u/sS 95DAsGy4PPOKcFJfHJvS0i64+lw0yFmeGqfcQ9GwsXALvl2QmA79O9wB9qN+AaXY AwC7eeC3xWnG86aPtxmnK8vduEFXWdBZ2ZPZjtr2wVo+FC/46pRhUqK1cUyDQxXH jx5CIyxe+8snRs8eGYu4k6lwVbCH6ICzRhNMtOCB6e4c+bXBun5eAoGP6jln2g3F z+CvMq0/WBFJ+86wzJqz =hHSs -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.10-rc1' of git://oss.sgi.com/xfs/xfs Pull xfs update from Ben Myers: "For 3.10-rc1 we have a number of bug fixes and cleanups and a currently experimental feature from David Chinner, CRCs protection for metadata. CRCs are enabled by using mkfs.xfs to create a filesystem with the feature bits set. - numerous fixes for speculative preallocation - don't verify buffers on IO errors - rename of random32 to prandom32 - refactoring/rearrangement in xfs_bmap.c - removal of unused m_inode_shrink in struct xfs_mount - fix error handling of xfs_bufs and readahead - quota driven preallocation throttling - fix WARN_ON in xfs_vm_releasepage - add ratelimited printk for different alert levels - fix spurious forced shutdowns due to freed Extent Free Intents - remove some obsolete XLOG_CIL_HARD_SPACE_LIMIT() macros - remove some obsoleted comments - (experimental) CRC support for metadata" * tag 'for-linus-v3.10-rc1' of git://oss.sgi.com/xfs/xfs: (46 commits) xfs: fix da node magic number mismatches xfs: Remote attr validation fixes and optimisations xfs: Teach dquot recovery about CONFIG_XFS_QUOTA xfs: add metadata CRC documentation xfs: implement extended feature masks xfs: add CRC checks to the superblock xfs: buffer type overruns blf_flags field xfs: add buffer types to directory and attribute buffers xfs: add CRC protection to remote attributes xfs: split remote attribute code out xfs: add CRCs to attr leaf blocks xfs: add CRCs to dir2/da node blocks xfs: shortform directory offsets change for dir3 format xfs: add CRC checking to dir2 leaf blocks xfs: add CRC checking to dir2 data blocks xfs: add CRC checking to dir2 free blocks xfs: add CRC checks to block format directory blocks xfs: add CRC checks to remote symlinks xfs: split out symlink code into it's own file. xfs: add version 3 inode format with CRCs ...	2013-05-02 14:49:33 -07:00
Dave Chinner	cab09a81fb	xfs: fix da node magic number mismatches Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-01 14:48:30 -05:00
Dave Chinner	946217ba28	xfs: Remote attr validation fixes and optimisations - optimise the calcuation for the number of blocks in a remote xattr. - check attribute length against MAX_XATTR_SIZE, not MAXPATHLEN - whitespace fixes Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-05-01 14:13:55 -05:00
Dave Chinner	123887e843	xfs: Teach dquot recovery about CONFIG_XFS_QUOTA Fix a build error when CONFIG_XFS_QUOTA=n: fs/built-in.o: In function `xlog_recovery_validate_buf_type': /home/dave/src/build/x86-64/xfsdev/fs/xfs/xfs_log_recover.c:1948: undefined reference to `xfs_dquot_buf_ops' Reported-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-30 13:07:37 -05:00
Dave Chinner	e721f504cf	xfs: implement extended feature masks The version 5 superblock has extended feature masks for compatible, incompatible and read-only compatible feature sets. Implement the masking and mount-time checking for these feature masks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 13:05:18 -05:00
Dave Chinner	04a1e6c5b2	xfs: add CRC checks to the superblock With the addition of CRCs, there is such a wide and varied change to the on disk format that it makes sense to bump the superblock version number rather than try to use feature bits for all the new functionality. This commit introduces all the new superblock fields needed for all the new functionality: feature masks similar to ext4, separate project quota inodes, a LSN field for recovery and the CRC field. This commit does not bump the superblock version number, however. That will be done as a separate commit at the end of the series after all the new functionality is present so we switch it all on in one commit. This means that we can slowly introduce the changes without them being active and hence maintain bisectability of the tree. This patch is based on a patch originally written by myself back from SGI days, which was subsequently modified by Christoph Hellwig. There is relatively little of that patch remaining, but the history of the patch still should be acknowledged here. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 13:03:12 -05:00
Dave Chinner	61fe135c1d	xfs: buffer type overruns blf_flags field The buffer type passed to log recvoery in the buffer log item overruns the blf_flags field. I had assumed that flags field was a 32 bit value, and it turns out it is a unisgned short. Therefore having 19 flags doesn't really work. Convert the buffer type field to numeric value, and use the top 5 bits of the flags field for it. We currently have 17 types of buffers, so using 5 bits gives us plenty of room for expansion in future.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 13:01:58 -05:00
Dave Chinner	d75afeb3d3	xfs: add buffer types to directory and attribute buffers Add buffer types to the buffer log items so that log recovery can validate the buffers and calculate CRCs correctly after the buffers are recovered. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 13:01:06 -05:00
Dave Chinner	d2e448d5fd	xfs: add CRC protection to remote attributes There are two ways of doing this - the first is to add a CRC to the remote attribute entry in the attribute block. The second is to treat them similar to the remote symlink, where each fragment has it's own header and identifies fragment location in the attribute. The problem with the CRC in the remote attr entry is that we cannot identify the owner of the metadata from the metadata blocks themselves, or where the blocks fit into the remote attribute. The down side to this approach is that we never know when the attribute has been read from disk or not and so we have to verify it every time it is read, and we must calculate it during the create transaction and log it. We do not log CRCs for any other metadata, and so this creates a unique set of coherency problems that, in general, are best avoided. Adding an identifying header to each allocated block allows us to identify each fragment and where in the attribute it is located. It enables us to rebuild the remote attribute from just the raw blocks containing the attribute. It also provides us to do per-block CRCs verification at IO time rather than during the transaction context that creates it or every time it is read into a user buffer. Hence it avoids all the problems that an external, logged CRC has, and provides all the benefits of self identifying metadata. The only complexity is that we have to add a header per fragment, and we don't know how many fragments will be needed prior to allocations. If we take the symlink example, the header is 56 bytes and hence for a 4k block size filesystem, in the worst case 16 headers requires 1 extra block for the 64k attribute data. For 512 byte filesystems the worst case is an extra block for every 9 fragments (i.e. 16 extra blocks in the worse case). This will be very rare and so it's not really a major concern. Because allocation is done in two steps - the first finds a hole large enough in the attribute file, the second does the allocation - we only need to find a hole big enough for a worst case allocation. We only need to allocate enough extra blocks for number of headers required by the fragments, and we can calculate that as we go.... Hence it really only makes sense to use the same model as for symlinks - it doesn't add that much complexity, does not require an attribute tree format change, and does not require logging calculated CRC values. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 12:58:53 -05:00
Dave Chinner	95920cd6ce	xfs: split remote attribute code out Adding CRC support to remote attributes adds a significant amount of remote attribute specific code. Split the existing remote attribute code out into it's own file so that all the relevant remote attribute code is in a single, easy to find place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 12:49:32 -05:00
Dave Chinner	517c22207b	xfs: add CRCs to attr leaf blocks Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 12:45:01 -05:00
Dave Chinner	f5ea110044	xfs: add CRCs to dir2/da node blocks Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 12:33:38 -05:00
Dave Chinner	6b2647a12a	xfs: shortform directory offsets change for dir3 format Because the header size for the CRC enabled directory blocks is larger, the offset of the first entry into a directory block is different to the dir2 format. The shortform directory stores the dirent's offset so that it doesn't change when moving from shortform to block form and back again, and hence it needs to take into account the different header sizes to maintain the correct offsets. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 12:24:32 -05:00
Dave Chinner	24df33b45e	xfs: add CRC checking to dir2 leaf blocks This addition follows the same pattern as the dir2 block CRCs. Seeing as both LEAF1 and LEAFN types need to changed at the same time, this is a pretty large amount of change. leaf block headers need to be abstracted away from the on-disk structures (struct xfs_dir3_icleaf_hdr), as do the base leaf entry locations. This header abstract allows the in-core header and leaf entry location to be passed around instead of the leaf block itself. This saves a lot of converting individual variables from on-disk format to host format where they are used, so there's a good chance that the compiler will be able to produce much more optimal code as it's not having to byteswap variables all over the place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 12:19:53 -05:00
Dave Chinner	33363feed1	xfs: add CRC checking to dir2 data blocks This addition follows the same pattern as the dir2 block CRCs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 12:00:00 -05:00
Dave Chinner	cbc8adf897	xfs: add CRC checking to dir2 free blocks This addition follows the same pattern as the dir2 block CRCs, but with a few differences. The main difference is that the free block header is different between the v2 and v3 formats, so an "in-core" free block header has been added and _todisk/_from_disk functions used to abstract the differences in structure format from the code. This is similar to the on-disk superblock versus the in-core superblock setup. The in-core strucutre is populated when the buffer is read from disk, all the in memory checks and modifications are done on the in-core version of the structure which is written back to the buffer before the buffer is logged. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 11:58:16 -05:00
Dave Chinner	f5f3d9b016	xfs: add CRC checks to block format directory blocks Now that directory buffers are made from a single struct xfs_buf, we can add CRC calculation and checking callbacks. While there, add all the fields to the on disk structures for future functionality such as d_type support, uuids, block numbers, owner inode, etc. To distinguish between the different on disk formats, change the magic numbers for the new format directory blocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 11:51:56 -05:00
Dave Chinner	f948dd76dd	xfs: add CRC checks to remote symlinks Add a header to the remote symlink block, containing location and owner information, as well as CRCs and LSN fields. This requires verifiers to be added to the remote symlink buffers for CRC enabled filesystems. This also fixes a bug reading multiple block symlinks, where the second block overwrites the first block when copying out the link name. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-27 11:49:28 -05:00
Dave Chinner	19de7351a8	xfs: split out symlink code into it's own file. The symlink code is about to get more complicated when CRCs are added for remote symlink blocks. The symlink management code is mostly self contained, so move it to it's own files so that all the new code and the existing symlink code will not be intermingled with other unrelated code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-21 15:38:04 -05:00
Christoph Hellwig	93848a999c	xfs: add version 3 inode format with CRCs Add a new inode version with a larger core. The primary objective is to allow for a crc of the inode, and location information (uuid and ino) to verify it was written in the right place. We also extend it by: a creation time (for Samba); a changecount (for NFSv4); a flush sequence (in LSN format for recovery); an additional inode flags field; and some additional padding. These additional fields are not implemented yet, but already laid out in the structure. [dchinner@redhat.com] Added LSN and flags field, some factoring and rework to capture all the necessary information in the crc calculation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-21 15:03:33 -05:00
Christoph Hellwig	3fe58f30b4	xfs: add CRC checks for quota blocks Use the reserved space in struct xfs_dqblk to store a UUID and a crc for the quota blocks. [dchinner@redhat.com] Add a LSN field and update for current verifier infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-21 14:58:22 -05:00
Dave Chinner	983d09ffe3	xfs: add CRC checks to the AGI Same set of changes made to the AGF need to be made to the AGI. This patch has a similar history to the AGF, hence a similar sign-off chain. Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dgc@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-21 14:57:43 -05:00
Christoph Hellwig	77c95bba01	xfs: add CRC checks to the AGFL Add CRC checks, location information and a magic number to the AGFL. Previously the AGFL was just a block containing nothing but the free block pointers. The new AGFL has a real header with the usual boilerplate instead, so that we can verify it's not corrupted and written into the right place. [dchinner@redhat.com] Added LSN field, reworked significantly to fit into new verifier structure and growfs structure, enabled full verifier functionality now there is a header to verify and we can guarantee an initialised AGFL. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-21 14:55:34 -05:00
Dave Chinner	4e0e6040c4	xfs: add CRC checks to the AGF The AGF already has some self identifying fields (e.g. the sequence number) so we only need to add the uuid to it to identify the filesystem it belongs to. The location is fixed based on the sequence number, so there's no need to add a block number, either. Hence the only additional fields are the CRC and LSN fields. These are unlogged, so place some space between the end of the logged fields and them so that future expansion of the AGF for logged fields can be placed adjacent to the existing logged fields and hence not complicate the field-derived range based logging we currently have. Based originally on a patch from myself, modified further by Christoph Hellwig and then modified again to fit into the verifier structure with additional fields by myself. The multiple signed-off-by tags indicate the age and history of this patch. Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-21 14:54:46 -05:00
Christoph Hellwig	ee1a47ab0e	xfs: add support for large btree blocks Add support for larger btree blocks that contains a CRC32C checksum, a filesystem uuid and block number for detecting filesystem consistency and out of place writes. [dchinner@redhat.com] Also include an owner field to allow reverse mappings to be implemented for improved repairability and a LSN field to so that log recovery can easily determine the last modification that made it to disk for each buffer. [dchinner@redhat.com] Add buffer log format flags to indicate the type of buffer to recovery so that we don't have to do blind magic number tests to determine what the buffer is. [dchinner@redhat.com] Modified to fit into the verifier structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-21 14:53:46 -05:00
Dave Chinner	a2050646f6	xfs: increase hexdump output in xfs_corruption_error Currently xfs_corruption_error() dumps the first 16 bytes of the buffer that is passed to it when a corruption occurs. This is not large enough to see the entire state of the header of the block that was determined to be corrupt. increase the output to 64 bytes to capture the majority of all headers in all types of metadata blocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-21 14:48:41 -05:00
Jeff Liu	7fe3258c50	xfs: Update xfs_log_commit_cil() comments xfs_log_commit_iclog() function has been removed by commits `93b8a585`: xfs: remove the deprecated nodelaylog option Beginning from Linux 3.3, only delayed logging is supported so that we call xfs_log_commit_cil() at xfs_trans_commit() only, remove the useless comments so. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-16 13:20:03 -05:00
Jeff Liu	d4fd0e92fb	xfs: Remove the obsolete XLOG_CIL_HARD_SPACE_LIMIT() macros There is no more users of this Macro, so it's time to kill it dead. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-16 13:18:33 -05:00
Al Viro	8d71db4f08	lift sb_start_write/sb_end_write out of ->aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-09 14:12:55 -04:00
Dave Chinner	666d644cd7	xfs: don't free EFIs before the EFDs are committed Filesystems are occasionally being shut down with this error: xfs_trans_ail_delete_bulk: attempting to delete a log item that is not in the AIL. It was diagnosed to be related to the EFI/EFD commit order when the EFI and EFD are in different checkpoints and the EFD is committed before the EFI here: http://oss.sgi.com/archives/xfs/2013-01/msg00082.html The real problem is that a single bit cannot fully describe the states that the EFI/EFD processing can be in. These completion states are: EFI EFI in AIL EFD Result committed/unpinned Yes committed OK committed/pinned No committed Shutdown uncommitted No committed Shutdown Note that the "result" field is what should happen, not what does happen. The current logic is broken and handles the first two cases correctly by luck. That is, the code will free the EFI if the XFS_EFI_COMMITTED bit is not set, rather than if it is set. The inverted logic "works" because if both EFI and EFD are committed, then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED bit, and the second frees the EFI item. Hence as long as xfs_efi_item_committed() has been called, everything appears to be fine. It is the third case where the logic fails - where xfs_efd_item_committed() is called before xfs_efi_item_committed(), and that results in the EFI being freed before it has been committed. That is the bug that triggered the shutdown, and hence keeping track of whether the EFI has been committed or not is insufficient to correctly order the EFI/EFD operations w.r.t. the AIL. What we really want is this: the EFI is always placed into the AIL before the last reference goes away. The only way to guarantee that is that the EFI is not freed until after it has been unpinned and the EFD has been committed. That is, restructure the logic so that the only case that can occur is the first case. This can be done easily by replacing the XFS_EFI_COMMITTED with an EFI reference count. The EFI is initialised with it's own count, and that is not released until it is unpinned. However, there is a complication to this method - the high level EFI/EFD code in xfs_bmap_finish() does not hold direct references to the EFI structure, and runs a transaction commit between the EFI and EFD processing. Hence the EFI can be freed even before the EFD is created using such a method. Further, log recovery uses the AIL for tracking EFI/EFDs that need to be recovered, but it uses the AIL differently to the EFI transaction commit. Hence log recovery never pins or unpins EFIs, so we can't drop the EFI reference count indirectly to free the EFI. However, this doesn't prevent us from using a reference count here. There is a 1:1 relationship between EFIs and EFDs, so when we initialise the EFI we can take a reference count for the EFD as well. This solves the xfs_bmap_finish() issue - the EFI will never be freed until the EFD is processed. In terms of log recovery, during the committing of the EFD we can look for the XFS_EFI_RECOVERED bit being set and drop the EFI reference as well, thereby ensuring everything works correctly there as well. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-05 13:25:35 -05:00
Rich Johnston	3d6e036193	xfs: Add ratelimited printk for different alert levels Ratelimited printk will be useful in printing xfs messages which are otherwise not required to be printed always due to their high rate (to prevent kernel ring buffer from overflowing), while at the same time required to be printed. Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-04-03 13:20:39 -05:00
Jan Kara	ff9a28f6c2	xfs: Fix WARN_ON(delalloc) in xfs_vm_releasepage() When a dirty page is truncated from a file but reclaim gets to it before truncate_inode_pages(), we hit WARN_ON(delalloc) in xfs_vm_releasepage(). This is because reclaim tries to write the page, xfs_vm_writepage() just bails out (leaving page clean) and thus reclaim thinks it can continue and calls xfs_vm_releasepage() on page with dirty buffers. Fix the issue by redirtying the page in xfs_vm_writepage(). This makes reclaim stop reclaiming the page and also logically it keeps page in a more consistent state where page with dirty buffers has PageDirty set. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-22 16:12:37 -05:00
Brian Foster	19cb7e3854	xfs: xfs_iomap_prealloc_size() tracepoint Add a tracepoint to provide some feedback on preallocation size calculation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-22 16:07:56 -05:00
Brian Foster	76a4202a38	xfs: add quota-driven speculative preallocation throttling Introduce the need_throttle() and calc_throttle() functions to independently check whether throttling is required for a particular dquot and if so, calculate the associated throttling metrics based on the state of the quota. We use the same general algorithm to calculate the throttle shift as for global free space with the exception of using three stages rather than five. Update xfs_iomap_prealloc_size() to use the smallest available prealloc size based on each of the constraints and apply the maximum shift to obtain the throttled preallocation size. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-22 16:07:21 -05:00
Brian Foster	b136645116	xfs: xfs_dquot prealloc throttling watermarks and low free space Enable tracking of high and low watermarks for preallocation throttling of files under quota restrictions. These values are calculated when the quota limit is read from disk or modified and cached for later use by the throttling algorithm. The high watermark specifies when preallocation is disabled, the low watermark specifies when throttling is enabled and the low free space data structure contains precalculated low free space limits to serve as input to determine the level of throttling required. Note that the low free space data structure is based on the existing global low free space data structure with the exception of using three stages (5%, 3% and 1%) rather than five to reduce the impact of xfs_dquot memory overhead. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-22 16:06:30 -05:00
Brian Foster	4b6eae2e6a	xfs: pass xfs_dquot to xfs_qm_adjust_dqlimits() instead of xfs_disk_dquot_t Modify xfs_qm_adjust_dqlimits() to take the xfs_dquot as a parameter instead of just the xfs_disk_dquot_t so we can update in-memory fields if necessary. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-22 16:05:52 -05:00
Brian Foster	c9bdbdc074	xfs: push rounddown_pow_of_two() to after prealloc throttle The round down occurs towards the beginning of the function. Push it down after throttling has occurred. This is to support adding further transformations to 'alloc_blocks' that might not preserve power-of-two alignment (and thus could lead to rounding down multiple times). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-22 16:05:00 -05:00
Brian Foster	3c58b5f809	xfs: reorganize xfs_iomap_prealloc_size to remove indentation The majority of xfs_iomap_prealloc_size() executes within the check for lack of default I/O size. Reverse the logic to remove the extra indentation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-22 16:04:23 -05:00
Linus Torvalds	10b38669d6	- Fix for a potential infinite loop which was introduced in `4d559a3bcb` - Fix for the return type of xfs_iomap_eof_prealloc_initial_size from `a1e16c2666` - Fix for a failed buffer readahead causing subsequent callers to fail incorrectly -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJRSOIAAAoJENaLyazVq6ZODqQP/2m1iZVIA9CXFf5hS2QZgkc2 MHq+QaQ1aaZlAIRCnZO4XrWoLw4tH7AmsHA7dVJVz/ZhVrJg4ahfdSS6qR5EGWFb I5uE8LD8ZhpIiW6mBytJ7g9ST6xnaeean2sMwa0BcVK3uF84nO/uBopntZVrVlZE sMuklZe8GfxDpF6SBxVGG+5+OeLXzFmf+s+xoCYN410uuzYoT8/jveFP6a5ARcmH xEcOJA2+3o2z4/fsdx/Euf6LnDMSyOsAFUJCtnmBdKUA5w9DrJJqGpDDPEkg9h6d /DTPYXEWx6+w4xoMnIf09oEdCSamBVTWcRFXtftN03VNrbRNtyVwAc8HUaSNmt0p I3P/b5NJ5guH7uK72jp61N2RP7D5KOqwkwR58Y1SJWuwcgatYuB3NM5UeUyJBILj ViZ4DsKGE6BCl8T3hwkN+mxSxB+o7O8AypjWdEviBXbVIG9CwOxr1IEatl3eyV5T 8QsNFb0LJcWzl1+F/uUYe1Goeqxvzupt7omUaRONdMnac3uFIk0ARrdxXFgawIJ9 lgeftBCmMkqqLZUACSfmfCYNwyupz3E6bYB7Azwx01qg7CzTPUfIL2SxqDYp2dup /s+R7HL4HOJ0FCzjCZxHHO/1jsWgu265dJdpaQw/UcIe2IuEFGr558deHEM62bDW rWCVHj5eY5NRGyzSwzqB =41Vk -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.9-rc4' of git://oss.sgi.com/xfs/xfs Pull XFS fixes from Ben Myers: - Fix for a potential infinite loop which was introduced in commit `4d559a3bcb` ("xfs: limit speculative prealloc near ENOSPC thresholds") - Fix for the return type of xfs_iomap_eof_prealloc_initial_size from commit `a1e16c2666` ("xfs: limit speculative prealloc size on sparse files") - Fix for a failed buffer readahead causing subsequent callers to fail incorrectly * tag 'for-linus-v3.9-rc4' of git://oss.sgi.com/xfs/xfs: xfs: ensure we capture IO errors correctly xfs: fix xfs_iomap_eof_prealloc_initial_size type xfs: fix potential infinite loop in xfs_iomap_prealloc_size()	2013-03-19 15:17:40 -07:00
Dave Chinner	e001873853	xfs: ensure we capture IO errors correctly Failed buffer readahead can leave the buffer in the cache marked with an error. Most callers that then issue a subsequent read on the buffer do not zero the b_error field out, and so we may incorectly detect an error during IO completion due to the stale error value left on the buffer. Avoid this problem by zeroing the error before IO submission. This ensures that the only IO errors that are detected those captured from are those captured from bio submission or completion. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `c163f9a176`)	2013-03-18 13:39:10 -05:00
Mark Tinguely	3325beed46	xfs: fix xfs_iomap_eof_prealloc_initial_size type Fix the return type of xfs_iomap_eof_prealloc_initial_size() to xfs_fsblock_t to reflect the fact that the return value may be an unsigned 64 bits if XFS_BIG_BLKNOS is defined. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `e8108cedb1`)	2013-03-18 13:38:50 -05:00
Brian Foster	83cdadd8b0	xfs: fix potential infinite loop in xfs_iomap_prealloc_size() If freesp == 0, we could end up in an infinite loop while squashing the preallocation. Break the loop when we've killed the prealloc entirely. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `e78c420bfc`)	2013-03-18 13:30:38 -05:00
Christoph Hellwig	56cea2d088	xfs: take inode version into account in XFS_LITINO Add a version argument to XFS_LITINO so that it can return different values depending on the inode version. This is required for the upcoming v3 inodes with a larger fixed layout dinode. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-14 16:19:14 -05:00
Dave Chinner	c163f9a176	xfs: ensure we capture IO errors correctly Failed buffer readahead can leave the buffer in the cache marked with an error. Most callers that then issue a subsequent read on the buffer do not zero the b_error field out, and so we may incorectly detect an error during IO completion due to the stale error value left on the buffer. Avoid this problem by zeroing the error before IO submission. This ensures that the only IO errors that are detected those captured from are those captured from bio submission or completion. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-14 15:56:53 -05:00
Jeff Liu	d8ddfe81c7	xfs: Remove obsoleted m_inode_shrink from xfs_mount structure Looks the old m_inode_shrink is obsoleted as we perform inodes reclaim per AG via m_reclaim_workqueue, this patch remove it from the xfs_mount structure if so. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-14 15:55:32 -05:00
Dave Chinner	9e5987a779	xfs: rearrange some code in xfs_bmap for better locality xfs_bmap.c is a big file, and some of the related code is spread all throughout the file requiring function prototypes for static function and jumping all through the file to follow a single call path. Rearrange the code so that: a) related functionality is grouped together; and b) functions are grouped in call dependency order While the diffstat is large, there are no code changes in the patch; it is just moving the functionality around and removing the function prototypes at the top of the file. The resulting layout of the code is as follows (top of file to bottom): - miscellaneous helper functions - extent tree block counting routines - debug/sanity checking code - bmap free list manipulation functions - inode fork format manipulation functions - internal/external extent tree seach functions - extent tree manipulation functions used during allocation - functions used during extent read/allocate/removal operations (i.e. xfs_bmapi_write, xfs_bmapi_read, xfs_bunmapi and xfs_getbmap) This means that following logic paths through the bmapi code is much simpler - most of the code relevant to a specific operation is now clustered together rather than spread all over the file.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-07 12:35:22 -06:00
Akinobu Mita	ecb3403de1	xfs: rename random32() to prandom_u32() Use more preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: <bpm@sgi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: xfs@oss.sgi.com Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-07 12:33:57 -06:00
Dave Chinner	d5929de833	xfs: don't verify buffers after IO errors When we read a buffer, we might get an error from the underlying block device and not the real data. Hence if we get an IO error, we shouldn't run the verifier but instead just pass the IO error straight through. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-07 12:31:02 -06:00
Mark Tinguely	e8108cedb1	xfs: fix xfs_iomap_eof_prealloc_initial_size type Fix the return type of xfs_iomap_eof_prealloc_initial_size() to xfs_fsblock_t to reflect the fact that the return value may be an unsigned 64 bits if XFS_BIG_BLKNOS is defined. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-07 12:29:59 -06:00
Brian Foster	e114b5fce6	xfs: increase prealloc size to double that of the previous extent The updated speculative preallocation algorithm for handling sparse files can becomes less effective in situations with a high number of concurrent, sequential writers. The number of writers and amount of available RAM affect the writeback bandwidth slicing algorithm, which in turn affects the block allocation pattern of XFS. For example, running 32 sequential writers on a system with 32GB RAM, preallocs become fixed at a value of around 128MB (instead of steadily increasing to the 8GB maximum as sequential writes proceed). Update the speculative prealloc heuristic to base the size of the next prealloc on double the size of the preceding extent. This preserves the original aggressive speculative preallocation behavior and continues to accomodate sparse files at a slight cost of increasing the size of preallocated data regions following holes of sparse files. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-07 12:28:25 -06:00
Brian Foster	e78c420bfc	xfs: fix potential infinite loop in xfs_iomap_prealloc_size() If freesp == 0, we could end up in an infinite loop while squashing the preallocation. Break the loop when we've killed the prealloc entirely. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-03-07 12:21:39 -06:00
Eric W. Biederman	7f78e03513	fs: Limit sys_mount to only request filesystem modules. Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-03-03 19:36:31 -08:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Linus Torvalds	d895cb1af1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle and ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...	2013-02-26 20:16:07 -08:00
Namjae Jeon	94e07a7590	fs: encode_fh: return FILEID_INVALID if invalid fid_type This patch is a follow up on below patch: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type commit: `216b6cbdcb` Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Vivek Trivedi <t.vivek@samsung.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-26 02:46:10 -05:00
Al Viro	496ad9aa8e	new helper: file_inode(file) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-22 23:31:31 -05:00
Linus Torvalds	06991c28f3	Driver core patches for 3.9-rc1 Here is the big driver core merge for 3.9-rc1 There are two major series here, both of which touch lots of drivers all over the kernel, and will cause you some merge conflicts: - add a new function called devm_ioremap_resource() to properly be able to check return values. - remove CONFIG_EXPERIMENTAL If you need me to provide a merged tree to handle these resolutions, please let me know. Other than those patches, there's not much here, some minor fixes and updates. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9 weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW =yWAQ -----END PGP SIGNATURE----- Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg Kroah-Hartman: "Here is the big driver core merge for 3.9-rc1 There are two major series here, both of which touch lots of drivers all over the kernel, and will cause you some merge conflicts: - add a new function called devm_ioremap_resource() to properly be able to check return values. - remove CONFIG_EXPERIMENTAL Other than those patches, there's not much here, some minor fixes and updates" Fix up trivial conflicts * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits) base: memory: fix soft/hard_offline_page permissions drivercore: Fix ordering between deferred_probe and exiting initcalls backlight: fix class_find_device() arguments TTY: mark tty_get_device call with the proper const values driver-core: constify data for class_find_device() firmware: Ignore abort check when no user-helper is used firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER firmware: Make user-mode helper optional firmware: Refactoring for splitting user-mode helper code Driver core: treat unregistered bus_types as having no devices watchdog: Convert to devm_ioremap_resource() thermal: Convert to devm_ioremap_resource() spi: Convert to devm_ioremap_resource() power: Convert to devm_ioremap_resource() mtd: Convert to devm_ioremap_resource() mmc: Convert to devm_ioremap_resource() mfd: Convert to devm_ioremap_resource() media: Convert to devm_ioremap_resource() iommu: Convert to devm_ioremap_resource() drm: Convert to devm_ioremap_resource() ...	2013-02-21 12:05:51 -08:00
Dave Chinner	1e82379b01	xfs: xfs_bmap_add_attrfork_local is too generic When we are converting local data to an extent format as a result of adding an attribute, the type of data contained in the local fork determines the behaviour that needs to occur. xfs_bmap_add_attrfork_local() already handles the directory data case specially by using S_ISDIR() and calling out to xfs_dir2_sf_to_block(), but with verifiers we now need to handle each different type of metadata specially and different metadata formats require different verifiers (and eventually block header initialisation). There is only a single place that we add and attribute fork to the inode, but that is in the attribute code and it knows nothing about the specific contents of the data fork. It is only the case of local data that is the issue here, so adding code to hadnle this case in the attribute specific code is wrong. Hence we are really stuck trying to detect the data fork contents in xfs_bmap_add_attrfork_local() and performing the correct callout there. Luckily the current cases can be determined by S_IS* macros, and we can push the work off to data specific callouts, but each of those callouts does a lot of work in common with xfs_bmap_local_to_extents(). The only reason that this fails for symlinks right now is is that xfs_bmap_local_to_extents() assumes the data fork contains extent data, and so attaches a a bmap extent data verifier to the buffer and simply copies the data fork information straight into it. To fix this, allow us to pass a "formatting" callback into xfs_bmap_local_to_extents() which is responsible for setting the buffer type, initialising it and copying the data fork contents over to the new buffer. This allows callers to specify how they want to format the new buffer (which is necessary for the upcoming CRC enabled metadata blocks) and hence make xfs_bmap_local_to_extents() useful for any type of data fork content. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-14 17:35:51 -06:00
Brian Foster	fa5566e4ff	xfs: remove log force from xfs_buf_trylock() The trylock log force invoked via xfs_buf_item_push() can attempt to acquire xa_lock, thus leading to a recursion bug when called with xa_lock held. This log force was originally added to xfs_buf_trylock() to address xfsaild stalls due to pinned and stale buffers. Since the addition of this behavior, the log item pushing code had been reworked to detect and track pinned items to inform xfsaild to issue a log force itself when necessary. As such, the log force on trylock failure is redundant and safe to remove. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-14 17:24:53 -06:00
Brian Foster	5337fe9b10	xfs: recheck buffer pinned status after push trylock failure The buffer pinned check and trylock sequence in xfs_buf_item_push() can race with an active transaction on marking the buffer pinned. This can result in the buffer becoming pinned and stale after the initial check and the trylock failure, but before the check in xfs_buf_trylock() that issues a log force. If the log force is issued from this context, a spinlock recursion occurs on xa_lock. Prepare xfs_buf_item_push() to handle the race by detecting a pinned buffer after the trylock failure so xfsaild issues a log force from a safe context. This, along with various previous fixes, renders the log force in xfs_buf_trylock() redundant. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-14 17:23:42 -06:00
Dave Chinner	a1e16c2666	xfs: limit speculative prealloc size on sparse files Speculative preallocation based on the current file size works well for contiguous files, but is sub-optimal for sparse files where the EOF preallocation can fill holes and result in large amounts of zeros being written when it is not necessary. The algorithm is modified to prevent EOF speculative preallocation from triggering larger allocations on IO patterns of truncate--to-zero-seek-write-seek-write-.... which results in non-sparse files for large files. This, unfortunately, is the way cp now behaves when copying sparse files and so needs to be fixed. What this code does is that it looks at the existing extent adjacent to the current EOF and if it determines that it is a hole we disable speculative preallocation altogether. To avoid the next write from doing a large prealloc, it takes the size of subsequent preallocations from the current size of the existing EOF extent. IOWs, if you leave a hole in the file, it resets preallocation behaviour to the same as if it was a zero size file. Example new behaviour: $ xfs_io -f -c "pwrite 0 31m" \ -c "pwrite 33m 1m" \ -c "pwrite 128m 1m" \ -c "fiemap -v" /mnt/scratch/blah wrote 32505856/32505856 bytes at offset 0 31 MiB, 7936 ops; 0.0000 sec (1.608 GiB/sec and 421432.7439 ops/sec) wrote 1048576/1048576 bytes at offset 34603008 1 MiB, 256 ops; 0.0000 sec (1.462 GiB/sec and 383233.5329 ops/sec) wrote 1048576/1048576 bytes at offset 134217728 1 MiB, 256 ops; 0.0000 sec (1.719 GiB/sec and 450704.2254 ops/sec) /mnt/scratch/blah: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..65535]: 96..65631 65536 0x0 1: [65536..67583]: hole 2048 2: [67584..69631]: 67680..69727 2048 0x0 3: [69632..262143]: hole 192512 4: [262144..264191]: 262240..264287 2048 0x1 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-14 17:21:32 -06:00
Alex Elder	311f08acde	xfs: memory barrier before wake_up_bit() In xfs_ifunlock() there is a call to wake_up_bit() after clearing the flush lock on the xfs inode. This is not guaranteed to be safe, as noted in the comments above wake_up_bit() beginning with: In order for this to function properly, as it uses waitqueue_active() internally, some kind of memory barrier must be done prior to calling this. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-07 09:39:48 -06:00
Jeff Liu	a21cd50367	xfs: refactor space log reservation for XFS_TRANS_ATTR_SET Currently, we calculate the attribute set transaction log space reservation at runtime in two parts: 1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time. 2) ((ext * (mp)->m_sb.sb_sectsize) + \ (ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \ (128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)))))) which is calculated out at runtime since it depend on the given extent length in blocks. This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would be used to calculate out the unit of the log space reservation for one block. In this way, the total runtime space for the given extent length can be figured out by: XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:56:31 -06:00
Jeff Liu	762c585b18	xfs: make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy() Make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:55:59 -06:00
Jeff Liu	5166ab0655	xfs: make use of XFS_SB_LOG_RES() at xfs_mount_log_sb() Make use of XFS_SB_LOG_RES() at xfs_mount_log_sb(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:55:08 -06:00
Jeff Liu	e457274b60	xfs: make use of XFS_SB_LOG_RES() at xfs_log_sbcount() Make use of XFS_SB_LOG_RES() at xfs_log_sbcount(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:47:18 -06:00
Jeff Liu	a7bd794a0f	xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk Introduce a new transaction space reservation XFS_SB_LOG_RES() for those transactions that need to modify the superblock on disk. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:46:35 -06:00
Jeff Liu	762d7ba657	xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time Convert the calculation for end of quotaoff log space reservation from runtime to mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:45:50 -06:00
Jeff Liu	a1bd955754	xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time Convert the calculation of quota off transaction log space reservation from runtime to mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:44:29 -06:00
Jeff Liu	4800104438	xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time The disk quota allocation log space reservation is calcuated at runtime, this patch does it at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:43:51 -06:00
Jeff Liu	f0f2df94fa	xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time For adjusting quota limits transactions, we calculate out the log space reservation at runtime, this patch does it at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:43:11 -06:00
Jeff Liu	f910a8c620	xfs: calculate xfs_qm_write_sb_changes() space log reservation at mount time For the transaction that write the incore superblock changes of quota flags to disk, it would reserve the same log space to clear/reset quota flags transaction, hence we can use XFS_TRANS_SBCHANGE_LOG_RES() for it as well. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:42:32 -06:00
Jeff Liu	b0c10b983a	xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time The transaction log space for clearing/reseting the quota flags is calculated out at runtime, this patch can figure it out at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:40:17 -06:00
Jeff Liu	5b292ae3a9	xfs: make use of xfs_calc_buf_res() in xfs_trans.c Refining the existing reservations with xfs_calc_buf_res() in xfs_trans.c Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:39:29 -06:00
Jeff Liu	4f3b57832b	xfs: add a helper to figure out the space log reservation per item Add a new helper xfs_calc_buf_res() to calcuate out the transaction space reservations per item. xfs_buf_log_overhead() is used to figure out the extra space for struct xfs_buf_log_format that gets written into the log for every buffer as well as a log opheader, i.e. struct xlog_op_header. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:35:06 -06:00
Torsten Kaiser	2729423cf2	xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages() Commit `fb59581404` removed xfs_flushinval_pages() and changed its callers to use filemap_write_and_wait() and truncate_pagecache_range() directly. But in xfs_swap_extents() this change accidental switched the argument for 'tip' to 'ip'. This patch switches it back to 'tip' Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 13:50:10 -06:00
Jan Kara	ced55f38d6	xfs: Fix possible use-after-free with AIO Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: xfs@oss.sgi.com CC: Ben Myers <bpm@sgi.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-26 09:43:58 -06:00
Dave Chinner	3b19034d4f	xfs: fix shutdown hang on invalid inode during create When the new inode verify in xfs_iread() fails, the create transaction is aborted and a shutdown occurs. The subsequent unmount then hangs in xfs_wait_buftarg() on a buffer that has an elevated hold count. Debug showed that it was an AGI buffer getting stuck: [ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck The trace of this buffer leading up to the shutdown (trimmed for brevity) looks like: xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock And that is the AGI buffer from cold cache read into memory to transaction abort. You can see at transaction abort the bli is dirty and only has a single reference. The item is not pinned, and it's not in the AIL. Hence the only reference to it is this transaction. The problem is that the xfs_buf_item_unlock() call is dropping the last reference to the xfs_buf_log_item attached to the buffer (which holds a reference to the buffer), but it is not freeing the xfs_buf_log_item. Hence nothing will ever release the buffer, and the unmount hangs waiting for this reference to go away. The fix is simple - xfs_buf_item_unlock needs to detect the last reference going away in this case and free the xfs_buf_log_item to release the reference it holds on the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-26 09:34:38 -06:00
Dave Chinner	4d559a3bcb	xfs: limit speculative prealloc near ENOSPC thresholds There is a window on small filesytsems where specualtive preallocation can be larger than that ENOSPC throttling thresholds, resulting in specualtive preallocation trying to reserve more space than there is space available. This causes immediate ENOSPC to be triggered, prealloc to be turned off and flushing to occur. One the next write (i.e. next 4k page), we do exactly the same thing, and so effective drive into synchronous 4k writes by triggering ENOSPC flushing on every page while in the window between the prealloc size and the ENOSPC prealloc throttle threshold. Fix this by checking to see if the prealloc size would consume all free space, and throttle it appropriately to avoid premature ENOSPC... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-24 11:08:55 -06:00
Dave Chinner	10616b806d	xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end When _xfs_buf_find is passed an out of range address, it will fail to find a relevant struct xfs_perag and oops with a null dereference. This can happen when trying to walk a filesystem with a metadata inode that has a partially corrupted extent map (i.e. the block number returned is corrupt, but is otherwise intact) and we try to read from the corrupted block address. In this case, just fail the lookup. If it is readahead being issued, it will simply not be done, but if it is real read that fails we will get an error being reported. Ideally this case should result in an EFSCORRUPTED error being reported, but we cannot return an error through xfs_buf_read() or xfs_buf_get() so this lookup failure may result in ENOMEM or EIO errors being reported instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-24 11:06:41 -06:00
Ben Myers	003fd6c8be	xfs: fix fs/xfs/xfs_log.c:1740:39: error: 'B_TRUE' undeclared Commit `667a9291c5` "xfs: Remove boolean_t typedef completely." didn't. Remove a stray B_TRUE that breaks CONFIG_XFS_DEBUG=y. Signed-off-by: Ben Myers <bpm@sgi.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com>	2013-01-18 15:11:57 -06:00
Greg Kroah-Hartman	ed408f7c0f	Merge 3.9-rc4 into driver-core-next This is to fix up a build problem with a wireless driver due to the dynamic-debug patches in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-17 19:48:18 -08:00
Brian Foster	9e96fe6df4	xfs: pull up stack_switch check into xfs_bmapi_write The stack_switch check currently occurs in __xfs_bmapi_allocate, which means the stack switch only occurs when xfs_bmapi_allocate() is called in a loop. Pull the check up before the loop in xfs_bmapi_write() such that the first iteration of the loop has consistent behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-17 17:53:37 -06:00
Thiago Farina	667a9291c5	xfs: Remove boolean_t typedef completely. Since we are using C99 we have one builtin defined in include/linux/types.h, use that instead. v2: you missed one in fs/xfs/xfs_qm_bhv.c, cleaned up. -bpm Signed-off-by: Thiago Farina <tfarina@chromium.org> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-17 17:32:57 -06:00
Eric Sandeen	aeb4f20a02	xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic `9802182` changed the return value from EWRONGFS (aka EINVAL) to EFSCORRUPTED which doesn't seem to be handled properly by the root filesystem probe. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 17:33:53 -06:00
Eric Sandeen	37f13561de	xfs: recalculate leaf entry pointer after compacting a dir2 block Dave Jones hit this assert when doing a compile on recent git, with CONFIG_XFS_DEBUG enabled: XFS: Assertion failed: (char )dup - (char )hdr == be16_to_cpu(xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828 Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup) contained "2" and not the proper offset, and I found that this value was changed after the memmoves under "Use a stale leaf for our new entry." in xfs_dir2_block_addname(), i.e. memmove(&blp[mid + 1], &blp[mid], (highstale - mid) sizeof(*blp)); overwrote it. What has happened is that the previous call to xfs_dir2_block_compact() has rearranged things; it changes btp->count as well as the blp array. So after we make that call, we must recalculate the proper pointer to the leaf entries by making another call to xfs_dir2_block_leaf_p(). Dave provided a metadump image which led to a simple reproducer (create a particular filename in the affected directory) and this resolves the testcase as well as the bug on his live system. Thanks also to dchinner for looking at this one with me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Dave Jones <davej@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:08:55 -06:00
Brian Foster	ab7eac2200	xfs: remove int casts from debug dquot soft limit timer asserts The int casts here make it easy to trigger an assert with a large soft limit. For example, set a >4TB soft limit on an empty volume to reproduce a (0 > -x) comparison due to an overflow of d_blk_softlimit. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:08:40 -06:00
Mark Tinguely	91e4bac0b7	xfs: fix the multi-segment log buffer format Per Dave Chinner suggestion, this patch: 1) Corrects the detection of whether a multi-segment buffer is still tracking data. 2) Clears all the buffer log formats for a multi-segment buffer. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:08:08 -06:00
Mark Tinguely	2d0e9df579	xfs: fix segment in xfs_buf_item_format_segment Not every segment in a multi-segment buffer is dirty in a transaction and they will not be outputted. The assert in xfs_buf_item_format_segment() that checks for the at least one chunk of data in the segment to be used is not necessary true for multi-segmented buffers. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:07:56 -06:00
Mark Tinguely	0f22f9d0cd	xfs: rename bli_format to avoid confusion with bli_formats Rename the bli_format structure to __bli_format to avoid accidently confusing them with the bli_formats pointer. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:07:37 -06:00
Mark Tinguely	d44d9bc68e	xfs: use b_maps[] for discontiguous buffers Commits starting at `77c1a08` introduced a multiple segment support to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment buffer in the transaction because it was looking at the single segment block number rather than the multi-segment b_maps[0].bm.bn. This results on a recursive buffer lock that can never be satisfied. This patch: 1) Changed the remaining b_map accesses to be b_maps[0] accesses. 2) Renames the single segment b_map structure to __b_map to avoid future confusion. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:07:11 -06:00
Abhijit Pawar	a17164e54b	fs/xfs remove obsolete simple_strto<foo> This patch replaces usages of obsolete simple_strtoul with kstrtoint in xfs_args and suffix_strtoul. Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-13 14:42:07 -06:00
Eric Sandeen	d4608632ec	xfs: recalculate leaf entry pointer after compacting a dir2 block Dave Jones hit this assert when doing a compile on recent git, with CONFIG_XFS_DEBUG enabled: XFS: Assertion failed: (char )dup - (char )hdr == be16_to_cpu(xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828 Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup) contained "2" and not the proper offset, and I found that this value was changed after the memmoves under "Use a stale leaf for our new entry." in xfs_dir2_block_addname(), i.e. memmove(&blp[mid + 1], &blp[mid], (highstale - mid) sizeof(*blp)); overwrote it. What has happened is that the previous call to xfs_dir2_block_compact() has rearranged things; it changes btp->count as well as the blp array. So after we make that call, we must recalculate the proper pointer to the leaf entries by making another call to xfs_dir2_block_leaf_p(). Dave provided a metadump image which led to a simple reproducer (create a particular filename in the affected directory) and this resolves the testcase as well as the bug on his live system. Thanks also to dchinner for looking at this one with me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Dave Jones <davej@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-13 14:36:17 -06:00
Kees Cook	d9777b8de4	fs/xfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Ben Myers <bpm@sgi.com> CC: Alex Elder <elder@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Ben Myers <bpm@sgi.com>	2013-01-11 11:39:04 -08:00
Eric Sandeen	83a9ba0057	xfs: don't zero structure members after a memset(0) Commit `408cc4e97a` added memset(0, ...) to allocation args structures, so there is no need to explicitly set any of the fields to 0 after that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-03 16:00:07 -06:00
Brian Foster	f755503206	xfs: remove int casts from debug dquot soft limit timer asserts The int casts here make it easy to trigger an assert with a large soft limit. For example, set a >4TB soft limit on an empty volume to reproduce a (0 > -x) comparison due to an overflow of d_blk_softlimit. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-03 15:58:39 -06:00
Mark Tinguely	ec47eb6b0b	xfs remove the XFS_TRANS_DEBUG routines Remove the XFS_TRANS_DEBUG routines. They are no longer appropriate and have not been used in years Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-12-17 16:29:00 -06:00
Mark Tinguely	c883d0c400	xfs: fix the multi-segment log buffer format Per Dave Chinner suggestion, this patch: 1) Corrects the detection of whether a multi-segment buffer is still tracking data. 2) Clears all the buffer log formats for a multi-segment buffer. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-12-17 16:28:14 -06:00
Mark Tinguely	820a554f2f	xfs: fix segment in xfs_buf_item_format_segment Not every segment in a multi-segment buffer is dirty in a transaction and they will not be outputted. The assert in xfs_buf_item_format_segment() that checks for the at least one chunk of data in the segment to be used is not necessary true for multi-segmented buffers. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-12-17 16:26:01 -06:00
Mark Tinguely	b94381737e	xfs: rename bli_format to avoid confusion with bli_formats Rename the bli_format structure to __bli_format to avoid accidently confusing them with the bli_formats pointer. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-12-17 16:25:30 -06:00
Mark Tinguely	f4b42421d8	xfs: use b_maps[] for discontiguous buffers Commits starting at `77c1a08` introduced a multiple segment support to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment buffer in the transaction because it was looking at the single segment block number rather than the multi-segment b_maps[0].bm.bn. This results on a recursive buffer lock that can never be satisfied. This patch: 1) Changed the remaining b_map accesses to be b_maps[0] accesses. 2) Renames the single segment b_map structure to __b_map to avoid future confusion. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-12-17 16:19:56 -06:00
Dave Chinner	f9668a09e3	xfs: fix sparse reported log CRC endian issue Not a bug as such, just warning noise from the xlog_cksum() returning a __be32 type when it should be returning a __le32 type. On Wed, Nov 28, 2012 at 08:30:59AM -0500, Christoph Hellwig wrote: > But why are we storing the crc field little endian while all other on > disk formats are big endian? (And yes I realize it might as well have > been me who did that back in the idea, but I still have no idea why) Because the CRC always returns the calcuation LE format, even on BE systems. So rather than always having to byte swap it everywhere and have all the force casts and anootations for sparse, it seems simpler to just make it a __le32 everywhere.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-12-03 12:10:59 -06:00
Dave Chinner	b870553cde	xfs: fix stray dquot unlock when reclaiming dquots When we fail to get a dquot lock during reclaim, we jump to an error handler that unlocks the dquot. This is wrong as we didn't lock the dquot, and unlocking it means who-ever is holding the lock has had it silently taken away, and hence it results in a lock imbalance. Found by inspection while modifying the code for the numa-lru patchset. This fixes a random hang I've been seeing on xfstest 232 for the past several months. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-29 14:24:03 -06:00
Dave Chinner	437a255aa2	xfs: fix direct IO nested transaction deadlock. The direct IO path can do a nested transaction reservation when writing past the EOF. The first transaction is the append transaction for setting the filesize at IO completion, but we can also need a transaction for allocation of blocks. If the log is low on space due to reservations and small log, the append transaction can be granted after wating for space as the only active transaction in the system. This then attempts a reservation for an allocation, which there isn't space in the log for, and the reservation sleeps. The result is that there is nothing left in the system to wake up all the processes waiting for log space to come free. The stack trace that shows this deadlock is relatively innocuous: xlog_grant_head_wait xlog_grant_head_check xfs_log_reserve xfs_trans_reserve xfs_iomap_write_direct __xfs_get_blocks xfs_get_blocks_direct do_blockdev_direct_IO __blockdev_direct_IO xfs_vm_direct_IO generic_file_direct_write xfs_file_dio_aio_writ xfs_file_aio_write do_sync_write vfs_write This was discovered on a filesystem with a log of only 10MB, and a log stripe unit of 256k whih increased the base reservations by 512k. Hence a allocation transaction requires 1.2MB of log space to be available instead of only 260k, and so greatly increased the chance that there wouldn't be enough log space available for the nested transaction to succeed. The key to reproducing it is this mkfs command: mkfs.xfs -f -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b $SCRATCH_DEV The test case was a 1000 fsstress processes running with random freeze and unfreezes every few seconds. Thanks to Eryu Guan (eguan@redhat.com) for writing the test that found this on a system with a somewhat unique default configuration.... cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-29 14:22:56 -06:00
Dave Chinner	ef9d873344	xfs: byte range granularity for XFS_IOC_ZERO_RANGE XFS_IOC_ZERO_RANGE simply does not work properly for non page cache aligned ranges. Neither test 242 or 290 exercise this correctly, so the behaviour is completely busted even though the tests pass. Fix it to support full byte range granularity as was originally intended for this ioctl. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-29 14:21:46 -06:00
Dave Chinner	7c4cebe8e0	xfs: inode allocation should use unmapped buffers. Inode buffers do not need to be mapped as inodes are read or written directly from/to the pages underlying the buffer. This fixes a regression introduced by commit `611c994` ("xfs: make XBF_MAPPED the default behaviour"). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-26 16:01:31 -06:00
Christoph Hellwig	0e446be448	xfs: add CRC checks to the log Implement CRCs for the log buffers. We re-use a field in struct xlog_rec_header that was used for a weak checksum of the log buffer payload in debug builds before. The new checksumming uses the crc32c checksum we will use elsewhere in XFS, and also protects the record header and addition cycle data. Due to this there are some interesting changes in xlog_sync, as we need to do the cycle wrapping for the split buffer case much earlier, as we would touch the buffer after generating the checksum otherwise. The CRC calculation is always enabled, even for non-CRC filesystems, as adding this CRC does not change the log format. On non-CRC filesystems, only issue an alert if a CRC mismatch is found and allow recovery to continue - this will act as an indicator that log recovery problems are a result of log corruption. On CRC enabled filesystems, however, log recovery will fail. Note that existing debug kernels will write a simple checksum value to the log, so the first time this is run on a filesystem taht was last used on a debug kernel it will through CRC mismatch warning errors. These can be ignored. Initially based on a patch from Dave Chinner, then modified significantly by Christoph Hellwig. Modified again by Dave Chinner to get to this version. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-19 20:18:41 -06:00
Christoph Hellwig	bc02e8693d	xfs: add CRC infrastructure - add a mount feature bit for CRC enabled filesystems - add some helpers for generating and verifying the CRCs - add a copy_uuid helper The checksumming helpers are loosely based on similar ones in sctp, all other bits come from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-19 20:11:24 -06:00
Dave Chinner	1813dd6405	xfs: convert buffer verifiers to an ops structure. To separate the verifiers from iodone functions and associate read and write verifiers at the same time, introduce a buffer verifier operations structure to the xfs_buf. This avoids the need for assigning the write verifier, clearing the iodone function and re-running ioend processing in the read verifier, and gets rid of the nasty "b_pre_io" name for the write verifier function pointer. If we ever need to, it will also be easier to add further content specific callbacks to a buffer with an ops structure in place. We also avoid needing to export verifier functions, instead we can simply export the ops structures for those that are needed outside the function they are defined in. This patch also fixes a directory block readahead verifier issue it exposed. This patch also adds ops callbacks to the inode/alloc btree blocks initialised by growfs. These will need more work before they will work with CRCs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:35:12 -06:00
Dave Chinner	b0f539de9f	xfs: connect up write verifiers to new buffers Metadata buffers that are read from disk have write verifiers already attached to them, but newly allocated buffers do not. Add appropriate write verifiers to all new metadata buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:35:09 -06:00
Dave Chinner	612cfbfe17	xfs: add pre-write metadata buffer verifier callbacks These verifiers are essentially the same code as the read verifiers, but do not require ioend processing. Hence factor the read verifier functions and add a new write verifier wrapper that is used as the callback. This is done as one large patch for all verifiers rather than one patch per verifier as the change is largely mechanical. This includes hooking up the write verifier via the read verifier function. Hooking up the write verifier for buffers obtained via xfs_trans_get_buf() will be done in a separate patch as that touches code in many different places rather than just the verifier functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:35:02 -06:00
Dave Chinner	cfb0285222	xfs: add buffer pre-write callback Add a callback to the buffer write path to enable verification of the buffer and CRC calculation prior to issuing the write to the underlying storage. If the callback function detects some kind of failure or error condition, it must mark the buffer with an error so that the caller can take appropriate action. In the case of xfs_buf_ioapply(), a corrupt metadta buffer willt rigger a shutdown of the filesystem, because something is clearly wrong and we can't allow corrupt metadata to be written to disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:35:00 -06:00
Dave Chinner	da6958c873	xfs: Add verifiers to dir2 data readahead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:57 -06:00
Dave Chinner	d9392a4bb7	xfs: add xfs_da_node verification Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:55 -06:00
Dave Chinner	ad14c33ac8	xfs: factor and verify attr leaf reads Some reads are not converted yet because it isn't obvious ahead of time what the format of the block is going to be. Need to determine how to tell if the first block in the tree is a node or leaf format block. That will be done in later patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:52 -06:00
Dave Chinner	e6f7667c4e	xfs: factor dir2 leaf read Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:48 -06:00
Dave Chinner	e481357264	xfs: factor out dir2 data block reading And add a verifier callback function while there. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:45 -06:00
Dave Chinner	2025207ca6	xfs: factor dir2 free block reading Also factor out the updating of the free block when removing entries from leaf blocks, and add a verifier callback for reads. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:43 -06:00
Dave Chinner	82025d7f79	xfs: verify dir2 block format buffers Add a dir2 block format read verifier. To fully verify every block when read, call xfs_dir2_data_check() on them. Change xfs_dir2_data_check() to do runtime checking, convert ASSERT() checks to XFS_WANT_CORRUPTED_RETURN(), which will trigger an ASSERT failure on debug kernels, but on production kernels will dump an error to dmesg and return EFSCORRUPTED to the caller. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:41 -06:00
Dave Chinner	20f7e9f372	xfs: factor dir2 block read operations In preparation for verifying dir2 block format buffers, factor the read operations out of the block operations (lookup, addname, getdents) and some of the additional logic to make it easier to understand an dmodify the code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:39 -06:00
Dave Chinner	4bb20a83a2	xfs: add verifier callback to directory read code Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:36 -06:00
Dave Chinner	c631919870	xfs: verify dquot blocks as they are read from disk Add a dquot buffer verify callback function and pass it into the buffer read functions. This checks all the dquots in a buffer, but cannot completely verify the dquot ids are correct. Also, errors cannot be repaired, so an additional function is added to repair bad dquots in the buffer if such an error is detected in a context where repair is allowed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:33 -06:00
Dave Chinner	3d3e6f64e2	xfs: verify btree blocks as they are read from disk Add an btree block verify callback function and pass it into the buffer read functions. Because each different btree block type requires different verification, add a function to the ops structure that is called from the generic code. Also, propagate the verification callback functions through the readahead functions, and into the external bmap and bulkstat inode readahead code that uses the generic btree buffer read functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:31 -06:00
Dave Chinner	af133e8606	xfs: verify inode buffers as they are read from disk Add an inode buffer verify callback function and pass it into the buffer read functions. Inodes are special in that the verbose checks will be done when reading the inode, but we still need to sanity check the buffer when that is first read. Always verify the magic numbers in all inodes in the buffer, rather than jus ton debug kernels. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:19 -06:00
Dave Chinner	bb80c6d79a	xfs: verify AGFL blocks as they are read from disk Add an AGFL block verify callback function and pass it into the buffer read functions. While this commit adds verification code to the AGFL, it cannot be used reliably until the CRC format change comes along as mkfs does not initialise the full AGFL. Hence it can be full of garbage at the first mount and will fail verification right now. CRC enabled filesystems won't have this problem, so leave the code that has already been written ifdef'd out until the proper time. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:14 -06:00
Dave Chinner	3702ce6ed7	xfs: verify AGI blocks as they are read from disk Add an AGI block verify callback function and pass it into the buffer read functions. Remove the now redundant verification code that is currently in use. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:12 -06:00
Dave Chinner	5d5f527d13	xfs: verify AGF blocks as they are read from disk Add an AGF block verify callback function and pass it into the buffer read functions. This replaces the existing verification that is done after the read completes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:10 -06:00
Dave Chinner	98021821a5	xfs: verify superblocks as they are read from disk Add a superblock verify callback function and pass it into the buffer read functions. Remove the now redundant verification code that is currently in use. Adding verification shows that secondary superblocks never have their "sb_inprogress" flag cleared by mkfs.xfs, so when validating the secondary superblocks during a grow operation we have to avoid checking this field. Even if we fix mkfs, we will still have to ignore this field for verification purposes unless a version of mkfs that does not have this bug was used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:07 -06:00
Dave Chinner	eab4e63368	xfs: uncached buffer reads need to return an error With verification being done as an IO completion callback, different errors can be returned from a read. Uncached reads only return a buffer or NULL on failure, which means the verification error cannot be returned to the caller. Split the error handling for these reads into two - a failure to get a buffer will still return NULL, but a read error will return a referenced buffer with b_error set rather than NULL. The caller is responsible for checking the error state of the buffer returned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:05 -06:00
Dave Chinner	c3f8fc73ac	xfs: make buffer read verication an IO completion function Add a verifier function callback capability to the buffer read interfaces. This will be used by the callers to supply a function that verifies the contents of the buffer when it is read from disk. This patch does not provide callback functions, but simply modifies the interfaces to allow them to be called. The reason for adding this to the read interfaces is that it is very difficult to tell fom the outside is a buffer was just read from disk or whether we just pulled it out of cache. Supplying a callbck allows the buffer cache to use it's internal knowledge of the buffer to execute it only when the buffer is read from disk. It is intended that the verifier functions will mark the buffer with an EFSCORRUPTED error when verification fails. This allows the reading context to distinguish a verification error from an IO error, and potentially take further actions on the buffer (e.g. attempt repair) based on the error reported. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-15 21:34:02 -06:00
Dave Chinner	fb59581404	xfs: remove xfs_flushinval_pages It's just a simple wrapper around VFS functionality, and is actually bugging in that it doesn't remove mappings before invalidating the page cache. Remove it and replace it with the correct VFS functionality. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-14 15:15:08 -06:00
Dave Chinner	4bc1ea6b8d	xfs: remove xfs_flush_pages It is a complex wrapper around VFS functions, but there are VFS functions that provide exactly the same functionality. Call the VFS functions directly and remove the unnecessary indirection and complexity. We don't need to care about clearing the XFS_ITRUNCATED flag, as that is done during .writepages. Hence is cleared by the VFS writeback path if there is anything to write back during the flush. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-14 15:12:45 -06:00
Dave Chinner	95eacf0f71	xfs: remove xfs_wait_on_pages() It's just a simple wrapper around a VFS function that is only called by another function in xfs_fs_subr.c. Remove it and call the VFS function directly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-14 15:12:20 -06:00
Andrew Dahl	d6638ae244	xfs: reverse the check on XFS_IOC_ZERO_RANGE Reversing the check on XFS_IOC_ZERO_RANGE. Range should be zeroed if the start is less than or equal to the end. Signed-off-by: Andrew Dahl <adahl@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-14 15:11:52 -06:00
Dave Chinner	f5b8911b67	xfs: remove xfs_tosspages It's a buggy, unnecessary wrapper that is duplicating truncate_pagecache_range(). When replacing the call in xfs_change_file_space(), also ensure that the length being allocated/freed is always positive before making any changes. These checks are done in the lower extent manipulation functions, too, but we need to do them before any page cache operations. Reported-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-By: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-14 15:11:19 -06:00
Dave Chinner	de497688da	xfs: make growfs initialise the AGFL header For verification purposes, AGFLs need to be initialised to a known set of values. For upcoming CRC changes, they are also headers that need to be initialised. Currently, growfs does neither for the AGFLs - it ignores them completely. Add initialisation of the AGFL to be full of invalid block numbers (NULLAGBLOCK) to put the infrastructure in place needed for CRC support. Includes a comment clarification from Jeff Liu. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-13 16:40:59 -06:00
Dave Chinner	fd23683c3b	xfs: growfs: use uncached buffers for new headers When writing the new AG headers to disk, we can't attach write verifiers because they have a dependency on the struct xfs-perag being attached to the buffer to be fully initialised and growfs can't fully initialise them until later in the process. The simplest way to avoid this problem is to use uncached buffers for writing the new headers. These buffers don't have the xfs-perag attached to them, so it's simple to detect in the write verifier and be able to skip the checks that need the xfs-perag. This enables us to attach the appropriate buffer ops to the buffer and hence calculate CRCs on the way to disk. IT also means that the buffer is torn down immediately, and so the first access to the AG headers will re-read the header from disk and perform full verification of the buffer. This way we also can catch corruptions due to problems that went undetected in growfs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-13 16:40:43 -06:00
Dave Chinner	b64f3a390d	xfs: use btree block initialisation functions in growfs Factor xfs_btree_init_block() to be independent of the btree cursor, and use the function to initialise btree blocks in the growfs code. This makes adding support for different format btree blocks simple. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-13 16:40:27 -06:00
Dave Chinner	ee73259b40	xfs: add more attribute tree trace points. Added when debugging recent attribute tree problems to more finely trace code execution through the maze of twisty passages that makes up the attr code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-13 14:47:00 -06:00
Dave Chinner	37eb17e604	xfs: drop buffer io reference when a bad bio is built Error handling in xfs_buf_ioapply_map() does not handle IO reference counts correctly. We increment the b_io_remaining count before building the bio, but then fail to decrement it in the failure case. This leads to the buffer never running IO completion and releasing the reference that the IO holds, so at unmount we can leak the buffer. This leak is captured by this assert failure during unmount: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273 This is not a new bug - the b_io_remaining accounting has had this problem for a long, long time - it's just very hard to get a zero length bio being built by this code... Further, the buffer IO error can be overwritten on a multi-segment buffer by subsequent bio completions for partial sections of the buffer. Hence we should only set the buffer error status if the buffer is not already carrying an error status. This ensures that a partial IO error on a multi-segment buffer will not be lost. This part of the problem is a regression, however. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-13 14:45:57 -06:00
Dave Chinner	7bf7f35219	xfs: fix broken error handling in xfs_vm_writepage When we shut down the filesystem, it might first be detected in writeback when we are allocating a inode size transaction. This happens after we have moved all the pages into the writeback state and unlocked them. Unfortunately, if we fail to set up the transaction we then abort writeback and try to invalidate the current page. This then triggers are BUG() in block_invalidatepage() because we are trying to invalidate an unlocked page. Fixing this is a bit of a chicken and egg problem - we can't allocate the transaction until we've clustered all the pages into the IO and we know the size of it (i.e. whether the last block of the IO is beyond the current EOF or not). However, we don't want to hold pages locked for long periods of time, especially while we lock other pages to cluster them into the write. To fix this, we need to make a clear delineation in writeback where errors can only be handled by IO completion processing. That is, once we have marked a page for writeback and unlocked it, we have to report errors via IO completion because we've already started the IO. We may not have submitted any IO, but we've changed the page state to indicate that it is under IO so we must now use the IO completion path to report errors. To do this, add an error field to xfs_submit_ioend() to pass it the error that occurred during the building on the ioend chain. When this is non-zero, mark each ioend with the error and call xfs_finish_ioend() directly rather than building bios. This will immediately push the ioends through completion processing with the error that has occurred. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-13 14:45:45 -06:00
Dave Chinner	07428d7f0c	xfs: fix attr tree double split corruption In certain circumstances, a double split of an attribute tree is needed to insert or replace an attribute. In rare situations, this can go wrong, leaving the attribute tree corrupted. In this case, the attr being replaced is the last attr in a leaf node, and the replacement is larger so doesn't fit in the same leaf node. When we have the initial condition of a node format attribute btree with two leaves at index 1 and 2. Call them L1 and L2. The leaf L1 is completely full, there is not a single byte of free space in it. L2 is mostly empty. The attribute being replaced - call it X - is the last attribute in L1. The way an attribute replace is executed is that the replacement attribute - call it Y - is first inserted into the tree, but has an INCOMPLETE flag set on it so that list traversals ignore it. Once this transaction is committed, a second transaction it run to atomically mark Y as COMPLETE and X as INCOMPLETE, so that a traversal will now find Y and skip X. Once that transaction is committed, attribute X is then removed. So, the initial condition is: +--------+ +--------+ \| L1 \| \| L2 \| \| fwd: 2 \|---->\| fwd: 0 \| \| bwd: 0 \|<----\| bwd: 1 \| \| fsp: 0 \| \| fsp: N \| \|--------\| \|--------\| \| attr A \| \| attr 1 \| \|--------\| \|--------\| \| attr B \| \| attr 2 \| \|--------\| \|--------\| .......... .......... \|--------\| \|--------\| \| attr X \| \| attr n \| +--------+ +--------+ So now we go to replace X, and see that L1:fsp = 0 - it is full so we can't insert Y in the same leaf. So we record the the location of attribute X so we can track it for later use, then we split L1 into L1 and L3 and reblance across the two leafs. We end with: +--------+ +--------+ +--------+ \| L1 \| \| L3 \| \| L2 \| \| fwd: 3 \|---->\| fwd: 2 \|---->\| fwd: 0 \| \| bwd: 0 \|<----\| bwd: 1 \|<----\| bwd: 3 \| \| fsp: M \| \| fsp: J \| \| fsp: N \| \|--------\| \|--------\| \|--------\| \| attr A \| \| attr X \| \| attr 1 \| \|--------\| +--------+ \|--------\| \| attr B \| \| attr 2 \| \|--------\| \|--------\| .......... .......... \|--------\| \|--------\| \| attr W \| \| attr n \| +--------+ +--------+ And we track that the original attribute is now at L3:0. We then try to insert Y into L1 again, and find that there isn't enough room because the new attribute is larger than the old one. Hence we have to split again to make room for Y. We end up with this: +--------+ +--------+ +--------+ +--------+ \| L1 \| \| L4 \| \| L3 \| \| L2 \| \| fwd: 4 \|---->\| fwd: 3 \|---->\| fwd: 2 \|---->\| fwd: 0 \| \| bwd: 0 \|<----\| bwd: 1 \|<----\| bwd: 4 \|<----\| bwd: 3 \| \| fsp: M \| \| fsp: J \| \| fsp: J \| \| fsp: N \| \|--------\| \|--------\| \|--------\| \|--------\| \| attr A \| \| attr Y \| \| attr X \| \| attr 1 \| \|--------\| + INCOMP + +--------+ \|--------\| \| attr B \| +--------+ \| attr 2 \| \|--------\| \|--------\| .......... .......... \|--------\| \|--------\| \| attr W \| \| attr n \| +--------+ +--------+ And now we have the new (incomplete) attribute @ L4:0, and the original attribute at L3:0. At this point, the first transaction is committed, and we move to the flipping of the flags. This is where we are supposed to end up with this: +--------+ +--------+ +--------+ +--------+ \| L1 \| \| L4 \| \| L3 \| \| L2 \| \| fwd: 4 \|---->\| fwd: 3 \|---->\| fwd: 2 \|---->\| fwd: 0 \| \| bwd: 0 \|<----\| bwd: 1 \|<----\| bwd: 4 \|<----\| bwd: 3 \| \| fsp: M \| \| fsp: J \| \| fsp: J \| \| fsp: N \| \|--------\| \|--------\| \|--------\| \|--------\| \| attr A \| \| attr Y \| \| attr X \| \| attr 1 \| \|--------\| +--------+ + INCOMP + \|--------\| \| attr B \| +--------+ \| attr 2 \| \|--------\| \|--------\| .......... .......... \|--------\| \|--------\| \| attr W \| \| attr n \| +--------+ +--------+ But that doesn't happen properly - the attribute tracking indexes are not pointing to the right locations. What we end up with is both the old attribute to be removed pointing at L4:0 and the new attribute at L4:1. On a debug kernel, this assert fails like so: XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725 because the new attribute location does not exist. On a production kernel, this goes unnoticed and the code proceeds ahead merrily and removes L4 because it thinks that is the block that is no longer needed. This leaves the hash index node pointing to entries L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free space, and so everything is busted. This corruption is caused by the removal of the old attribute triggering a join - it joins everything correctly but then frees the wrong block. xfs_repair will report something like: bad sibling back pointer for block 4 in attribute fork for inode 131 problem with attribute contents in inode 131 would clear attr fork bad nblocks 8 for inode 131, would reset to 3 bad anextents 4 for inode 131, would reset to 0 The problem lies in the assignment of the old/new blocks for tracking purposes when the double leaf split occurs. The first split tries to place the new attribute inside the current leaf (i.e. "inleaf == true") and moves the old attribute (X) to the new block. This sets up the old block/index to L1:X, and newly allocated block to L3:0. It then moves attr X to the new block and tries to insert attr Y at the old index. That fails, so it splits again. With the second split, the rebalance ends up placing the new attr in the second new block - L4:0 - and this is where the code goes wrong. What is does is it sets both the new and old block index to the second new block. Hence it inserts attr Y at the right place (L4:0) but overwrites the current location of the attr to replace that is held in the new block index (currently L3:0). It over writes it with L4:1 - the index we later assert fail on. Hopefully this table will show this in a foramt that is a bit easier to understand: Split old attr index new attr index vanilla patched vanilla patched before 1st L1:26 L1:26 N/A N/A after 1st L3:0 L3:0 L1:26 L1:26 after 2nd L4:0 L3:0 L4:1 L4:0 ^^^^ ^^^^ wrong wrong The fix is surprisingly simple, for all this analysis - just stop the rebalance on the out-of leaf case from overwriting the new attr index - it's already correct for the double split case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-13 14:45:29 -06:00
Brian Foster	579b62faa5	xfs: add background scanning to clear eofblocks inodes Create a new mount workqueue and delayed_work to enable background scanning and freeing of eofblocks inodes. The scanner kicks in once speculative preallocation occurs and stops requeueing itself when no eofblocks inodes exist. The scan interval is based on the new 'speculative_prealloc_lifetime' tunable (default to 5m). The background scanner performs unfiltered, best effort scans (which skips inodes under lock contention or with a dirty cache mapping). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:34:59 -06:00
Brian Foster	00ca79a04b	xfs: add minimum file size filtering to eofblocks scan Support minimum file size filtering in the eofblocks scan. The caller must set the XFS_EOF_FLAGS_MINFILESIZE flags bit and minimum file size value in bytes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:32:29 -06:00
Brian Foster	1b5560488d	xfs: support multiple inode id filtering in eofblocks scan Enhance the eofblocks scan code to filter based on multiply specified inode id values. When multiple inode id values are specified, only inodes that match all id values are selected. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:31:13 -06:00
Brian Foster	3e3f9f5863	xfs: add inode id filtering to eofblocks scan Support inode ID filtering in the eofblocks scan. The caller must set the associated XFS_EOF_FLAGS_*ID bit and ID field. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:29:14 -06:00
Brian Foster	8ca149de80	xfs: add XFS_IOC_FREE_EOFBLOCKS ioctl The XFS_IOC_FREE_EOFBLOCKS ioctl allows users to invoke an EOFBLOCKS scan. The xfs_eofblocks structure is defined to support the command parameters (scan mode). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:27:49 -06:00
Brian Foster	41176a68e3	xfs: create function to scan and clear EOFBLOCKS inodes xfs_inodes_free_eofblocks() implements scanning functionality for EOFBLOCKS inodes. It uses the AG iterator to walk the tagged inodes and free post-EOF blocks via the xfs_inode_free_eofblocks() execute function. The scan can be invoked in best-effort mode or wait (force) mode. A best-effort scan (default) handles all inodes that do not have a dirty cache and we successfully acquire the io lock via trylock. In wait mode, we continue to cycle through an AG until all inodes are handled. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:25:40 -06:00
Brian Foster	40165e2761	xfs: make xfs_free_eofblocks() non-static, return EAGAIN on trylock failure Turn xfs_free_eofblocks() into a non-static function, return EAGAIN to indicate trylock failure and make sure this error is not propagated in xfs_release(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:24:26 -06:00
Brian Foster	72b53efa4a	xfs: create helper to check whether to free eofblocks on inode This check is used in multiple places to determine whether we should check for (and potentially free) post EOF blocks on an inode. Add a helper to consolidate the check. Note that when we remove an inode from the cache (xfs_inactive()), we are required to trim post-EOF blocks even if the inode is marked preallocated or append-only to maintain correct space accounting. The 'force' parameter to xfs_can_free_eofblocks() specifies whether we should ignore the prealloc/append-only status of the inode. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:22:34 -06:00
Brian Foster	a454f7428f	xfs: support a tag-based inode_ag_iterator Genericize xfs_inode_ag_walk() to support an optional radix tree tag and args argument for the execute function. Create a new wrapper called xfs_inode_ag_iterator_tag() that performs a tag based walk of perag's and inodes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 15:20:38 -06:00
Brian Foster	27b5286792	xfs: add EOFBLOCKS inode tagging/untagging Add the XFS_ICI_EOFBLOCKS_TAG inode tag to identify inodes with speculatively preallocated blocks beyond EOF. An inode is tagged when speculative preallocation occurs and untagged either via truncate down or when post-EOF blocks are freed via release or reclaim. The tag management is intentionally not aggressive to prefer simplicity over the complexity of handling all the corner cases under which post-EOF blocks could be freed (i.e., forward truncation, fallocate, write error conditions, etc.). This means that a tagged inode may or may not have post-EOF blocks after a period of time. The tag is eventually cleared when the inode is released or reclaimed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-08 14:20:44 -06:00
Eric Sandeen	69a58a43f7	xfs: report projid32bit feature in geometry call When xfs gained the projid32bit feature, it was never added to the FSGEOMETRY ioctl feature flags, so it's not queryable without this patch. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-07 15:27:27 -06:00
Dave Chinner	009507b052	xfs: fix reading of wrapped log data Commit `4439647` ("xfs: reset buffer pointers before freeing them") in 3.0-rc1 introduced a regression when recovering log buffers that wrapped around the end of log. The second part of the log buffer at the start of the physical log was being read into the header buffer rather than the data buffer, and hence recovery was seeing garbage in the data buffer when it got to the region of the log buffer that was incorrectly read. Cc: <stable@vger.kernel.org> # 3.0.x, 3.2.x, 3.4.x 3.6.x Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-07 15:27:17 -06:00
Dave Chinner	137fff09b7	xfs: fix buffer shudown reference count mismatch When we shut down the filesystem, we have to unpin and free all the buffers currently active in the CIL. To do this we unpin and remove them in one operation as a result of a failed iclogbuf write. For buffers, we do this removal via a simultated IO completion of after marking the buffer stale. At the time we do this, we have two references to the buffer - the active LRU reference and the buf log item. The LRU reference is removed by marking the buffer stale, and the active CIL reference is by the xfs_buf_iodone() callback that is run by xfs_buf_do_callbacks() during ioend processing (via the bp->b_iodone callback). However, ioend processing requires one more reference - that of the IO that it is completing. We don't have this reference, so we free the buffer prematurely and use it after it is freed. For buffers marked with XBF_ASYNC, this leads to assert failures in xfs_buf_rele() on debug kernels because the b_hold count is zero. Fix this by making sure we take the necessary IO reference before starting IO completion processing on the stale buffer, and set the XBF_ASYNC flag to ensure that IO completion processing removes all the active references from the buffer to ensure it is fully torn down. Cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-07 15:26:53 -06:00
Dave Chinner	b6aff29f3a	xfs: don't vmap inode cluster buffers during free Inode buffers do not need to be mapped as inodes are read or written directly from/to the pages underlying the buffer. This fixes a regression introduced by commit `611c994` ("xfs: make XBF_MAPPED the default behaviour"). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-07 14:15:48 -06:00
Dave Chinner	4c05f9ad4d	xfs: invalidate allocbt blocks moved to the free list When we free a block from the alloc btree tree, we move it to the freelist held in the AGFL and mark it busy in the busy extent tree. This typically happens when we merge btree blocks. Once the transaction is committed and checkpointed, the block can remain on the free list for an indefinite amount of time. Now, this isn't the end of the world at this point - if the free list is shortened, the buffer is invalidated in the transaction that moves it back to free space. If the buffer is allocated as metadata from the free list, then all the modifications getted logged, and we have no issues, either. And if it gets allocated as userdata direct from the freelist, it gets invalidated and so will never get written. However, during the time it sits on the free list, pressure on the log can cause the AIL to be pushed and the buffer that covers the block gets pushed for write. IOWs, we end up writing a freed metadata block to disk. Again, this isn't the end of the world because we know from the above we are only writing to free space. The problem, however, is for validation callbacks. If the block was on old btree root block, then the level of the block is going to be higher than the current tree root, and so will fail validation. There may be other inconsistencies in the block as well, and currently we don't care because the block is in free space. Shutting down the filesystem because a freed block doesn't pass write validation, OTOH, is rather unfriendly. So, make sure we always invalidate buffers as they move from the free space trees to the free list so that we guarantee they never get written to disk while on the free list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Phil White <pwhite@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-07 14:13:30 -06:00
Carlos Maiolino	cd856db69c	xfs: Update inode alloc comments I found some out of date comments while studying the inode allocation code, so I believe it's worth to have these comments updated. It basically rewrites the comment regarding to "call_again" variable, which is not used anymore, but instead, callers of xfs_ialloc() decides if it needs to be called again relying only if ialloc_context is NULL or not. Also did some small changes in another comment that I thought to be pertinent to the current behaviour of these functions and some alignment on both comments. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-02 17:13:12 -05:00
Dave Chinner	531c3bdc86	xfs: silence uninitialised f.file warning. Uninitialised variable build warning introduced by `2903ff0` ("switch simple cases of fget_light to fdget"), gcc is not smart enough to work out that the variable is not used uninitialised, and the commit removed the initialisation at declaration that the old variable had. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-02 16:18:45 -05:00
Dave Chinner	1375cb65e8	xfs: growfs: don't read garbage for new secondary superblocks When updating new secondary superblocks in a growfs operation, the superblock buffer is read from the newly grown region of the underlying device. This is not guaranteed to be zero, so violates the underlying assumption that the unused parts of superblocks are zero filled. Get a new buffer for these secondary superblocks to ensure that the unused regions are zero filled correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-02 15:42:35 -05:00
Dave Chinner	e04426b920	xfs: move allocation stack switch up to xfs_bmapi_allocate Switching stacks are xfs_alloc_vextent can cause deadlocks when we run out of worker threads on the allocation workqueue. This can occur because xfs_bmap_btalloc can make multiple calls to xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can return with the AGF locked in the current allocation transaction. If we then need to make another allocation, and all the allocation worker contexts are exhausted because the are blocked waiting for the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent work completed to release the AGF. Hence allocation effectively deadlocks. To avoid this, move the stack switch one layer up to xfs_bmapi_allocate() so that all of the allocation attempts in a single switched stack transaction occur in a single worker context. This avoids the problem of an allocation being blocked waiting for a worker thread whilst holding the AGF. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-18 17:42:48 -05:00
Dave Chinner	2455881c0b	xfs: introduce XFS_BMAPI_STACK_SWITCH Certain allocation paths through xfs_bmapi_write() are in situations where we have limited stack available. These are almost always in the buffered IO writeback path when convertion delayed allocation extents to real extents. The current stack switch occurs for userdata allocations, which means we also do stack switches for preallocation, direct IO and unwritten extent conversion, even those these call chains have never been implicated in a stack overrun. Hence, let's target just the single stack overun offended for stack switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that the caller can pass xfs_bmapi_write() to indicate it should switch stacks if it needs to do allocation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-18 17:41:56 -05:00
Mark Tinguely	a00416844b	xfs: zero allocation_args on the kernel stack Zero the kernel stack space that makes up the xfs_alloc_arg structures. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-18 17:34:16 -05:00
Dave Chinner	d35e88faa3	xfs: only update the last_sync_lsn when a transaction completes The log write code stamps each iclog with the current tail LSN in the iclog header so that recovery knows where to find the tail of thelog once it has found the head. Normally this is taken from the first item on the AIL - the log item that corresponds to the oldest active item in the log. The problem is that when the AIL is empty, the tail lsn is dervied from the the l_last_sync_lsn, which is the LSN of the last iclog to be written to the log. In most cases this doesn't happen, because the AIL is rarely empty on an active filesystem. However, when it does, it opens up an interesting case when the transaction being committed to the iclog spans multiple iclogs. That is, the first iclog is stamped with the l_last_sync_lsn, and IO is issued. Then the next iclog is setup, the changes copied into the iclog (takes some time), and then the l_last_sync_lsn is stamped into the header and IO is issued. This is still the same transaction, so the tail lsn of both iclogs must be the same for log recovery to find the entire transaction to be able to replay it. The problem arises in that the iclog buffer IO completion updates the l_last_sync_lsn with it's own LSN. Therefore, If the first iclog completes it's IO before the second iclog is filled and has the tail lsn stamped in it, it will stamp the LSN of the first iclog into it's tail lsn field. If the system fails at this point, log recovery will not see a complete transaction, so the transaction will no be replayed. The fix is simple - the l_last_sync_lsn is updated when a iclog buffer IO completes, and this is incorrect. The l_last_sync_lsn shoul dbe updated when a transaction is completed by a iclog buffer IO. That is, only iclog buffers that have transaction commit callbacks attached to them should update the l_last_sync_lsn. This means that the last_sync_lsn will only move forward when a commit record it written, not in the middle of a large transaction that is rolling through multiple iclog buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 13:43:35 -05:00
Dave Chinner	33479e0542	xfs: remove xfs_iget.c The inode cache functions remaining in xfs_iget.c can be moved to xfs_icache.c along with the other inode cache functions. This removes all functionality from xfs_iget.c, so the file can simply be removed. This move results in various functions now only having the scope of a single file (e.g. xfs_inode_free()), so clean up all the definitions and exported prototypes in xfs_icache.[ch] and xfs_inode.h appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 13:42:25 -05:00
Dave Chinner	fa96acadf1	xfs: move inode locking functions to xfs_inode.c xfs_ilock() and friends really aren't related to the inode cache in any way, so move them to xfs_inode.c with all the other inode related functionality. While doing this move, move the xfs_ilock() tracepoints to before the lock is taken so that when a hang on a lock occurs we have events to indicate which process and what inode we were trying to lock when the hang occurred. This is much better than the current silence we get on a hang... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 13:40:54 -05:00
Dave Chinner	6d8b79cfca	xfs: rename xfs_sync.[ch] to xfs_icache.[ch] xfs_sync.c now only contains inode reclaim functions and inode cache iteration functions. It is not related to sync operations anymore. Rename to xfs_icache.c to reflect it's contents and prepare for consolidation with the other inode cache file that exists (xfs_iget.c). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 13:40:09 -05:00
Dave Chinner	c75921a72a	xfs: xfs_quiesce_attr() should quiesce the log like unmount xfs_quiesce_attr() is supposed to leave the log empty with an unmount record written. Right now it does not wait for the AIL to be emptied before writing the unmount record, not does it wait for metadata IO completion, either. Fix it to use the same method and code as xfs_log_unmount(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 13:39:14 -05:00
Dave Chinner	c7eea6f7ad	xfs: move xfs_quiesce_attr() into xfs_super.c Both callers of xfs_quiesce_attr() are in xfs_super.c, and there's nothing really sync-specific about this functionality so it doesn't really matter where it lives. Move it to benext to it's callers, so all the remount/sync_fs code is in the one place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 13:30:20 -05:00
Dave Chinner	34061f5c42	xfs: xfs_sync_fsdata is redundant Why do we need to write the superblock to disk once we've written all the data? We don't actually - the reasons for doing this are lost in the mists of time, and go back to the way Irix used to drive VFS flushing. On linux, this code is only called from two contexts: remount and .sync_fs. In the remount case, the call is followed by a metadata sync, which unpins and writes the superblock. In the sync_fs case, we only need to force the log to disk to ensure that the superblock is correctly on disk, so we don't actually need to write it. Hence the functionality is either redundant or superfluous and thus can be removed. Seeing as xfs_quiesce_data is essentially now just a log force, remove it as well and fold the code back into the two callers. Neither of them need the log covering check, either, as that is redundant for the remount case, and unnecessary for the .sync_fs case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 12:28:47 -05:00
Dave Chinner	5889608df3	xfs: syncd workqueue is no more With the syncd functions moved to the log and/or removed, the syncd workqueue is the only remaining bit left. It is used by the log covering/ail pushing work, as well as by the inode reclaim work. Given how cheap workqueues are these days, give the log and inode reclaim work their own work queues and kill the syncd work queue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 12:19:27 -05:00
Dave Chinner	9aa05000f2	xfs: xfs_sync_data is redundant. We don't do any data writeback from XFS any more - the VFS is completely responsible for that, including for freeze. We can replace the remaining caller with a VFS level function that achieves the same thing, but without conflicting with current writeback work. This means we can remove the flush_work and xfs_flush_inodes() - the VFS functionality completely replaces the internal flush queue for doing this writeback work in a separate context to avoid stack overruns. This does have one complication - it cannot be called with page locks held. Hence move the flushing of delalloc space when ENOSPC occurs back up into xfs_file_aio_buffered_write when we don't hold any locks that will stall writeback. Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to trigger delalloc conversion fast enough to prevent spurious ENOSPC whent here are hundreds of writers, thousands of small files and GBs of free RAM. Hence we need to use sync_sb_inodes() to block callers while we wait for writeback like the previous xfs_flush_inodes implementation did. That means we have to hold the s_umount lock here, but because this call can nest inside i_mutex (the parent directory in the create case, held by the VFS), we have to use down_read_trylock() to avoid potential deadlocks. In practice, this trylock will succeed on almost every attempt as unmount/remount type operations are exceedingly rare. Note: we always need to pass a count of zero to generic_file_buffered_write() as the previously written byte count. We only do this by accident before this patch by the virtue of ret always being zero when there are no errors. Make this explicit rather than needing to specifically zero ret in the ENOSPC retry case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 12:01:25 -05:00
Dave Chinner	cf2931db2d	xfs: Bring some sanity to log unmounting When unmounting the filesystem, there are lots of operations that need to be done in a specific order, and they are spread across across a couple of functions. We have to drain the AIL before we write the unmount record, and we have to shut down the background log work before we do either of them. But this is all split haphazardly across xfs_unmountfs() and xfs_log_unmount(). Move all the AIL flushing and log manipulations to xfs_log_unmount() so that the responisbilities of each function is clear and the operations they perform obvious. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 11:57:10 -05:00
Dave Chinner	f661f1e0bf	xfs: sync work is now only periodic log work The only thing the periodic sync work does now is flush the AIL and idle the log. These are really functions of the log code, so move the work to xfs_log.c and rename it appropriately. The only wart that this leaves behind is the xfssyncd_centisecs sysctl, otherwise the xfssyncd is dead. Clean up any comments that related to xfssyncd to reflect it's passing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 11:53:29 -05:00
Dave Chinner	7f7bebefba	xfs: don't run the sync work if the filesystem is read-only If the filesystem is mounted or remounted read-only, stop the sync worker that tries to flush or cover the log if the filesystem is dirty. It's read-only, so it isn't dirty. Restart it on a remount,rw as necessary. This avoids the need for RO checks in the work. Similarly, stop the sync work when the filesystem is frozen, and start it again when the filesysetm is thawed. This avoids the need for special freeze checks in the work. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 11:48:29 -05:00
Dave Chinner	7e18530bef	xfs: rationalise xfs_mount_wq users Instead of starting and stopping background work on the xfs_mount_wq all at the same time, separate them to where they really are needed to start and stop. The xfs_sync_worker, only needs to be started after all the mount processing has completed successfully, while it needs to be stopped before the log is unmounted. The xfs_reclaim_worker is started on demand, and can be stopped before the unmount process does it's own inode reclaim pass. The xfs_flush_inodes work is run on demand, and so we really only need to ensure that it has stopped running before we start processing an unmount, freeze or remount,ro. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 11:25:06 -05:00
Dave Chinner	33c7a2bc48	xfs: xfs_syncd_stop must die xfs_syncd_start and xfs_syncd_stop tie a bunch of unrelated functionailty together that actually have different start and stop requirements. Kill these functions and open code the start/stop methods for each of the background functions. Subsequent patches will move the start/stop functions around to the correct places to avoid races and shutdown issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-10-17 11:14:19 -05:00
Hugh Dickins	35c2a7f490	tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(), u64 inum = fid->raw[2]; which is unhelpfully reported as at the end of shmem_alloc_inode(): BUG: unable to handle kernel paging request at ffff880061cd3000 IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Call Trace: [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0 [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0 [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10 [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6 Right, tmpfs is being stupid to access fid->raw[2] before validating that fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may fall at the end of a page, and the next page not be present. But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and could oops in the same way: add the missing fh_len checks to those. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Sage Weil <sage@inktank.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-09 23:33:55 -04:00
Konstantin Khlebnikov	0b173bc4da	mm: kill vma flag VM_CAN_NONLINEAR Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:17 +09:00
Linus Torvalds	60c7b4df82	xfs: update for 3.7-rc1 Several enhancements and cleanups: - make inode32 and inode64 remountable options - SEEK_HOLE/SEEK_DATA enhancements - cleanup struct declarations in xfs_mount.h -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJQayQwAAoJENaLyazVq6ZO2H0P/RiWpYlVLP0ZNbxUOvL33WFy xOHwNkhRG23Z0sgDfrUJjDibzOuSmf/XothsChGPE4Mm8JLCjZKzxR77ySP+pbC5 sylrciXl5E2tY3wtW8oh2kQORLRneqA6qXNCrOtYklqwSmB0qx7bf9OdjL/ALiuC SvTAn/PhyVZqGJgVZyEAc63CsbSzK2XmLgyfeQeA7JbSptbg5fDkFYrbNE+zr+yW S1umMsuZW+zwDjzRq0lGkLcYOH3fD0JpMrfqvtQsk3+SIDlq1HSaObD63dhOJVUV LUpOqgXCPEC3nOL3Dh+i8gv5OTRLJW0P2zYuAY++kqYnuRsYU18tcfdd0loTGeHV LulurSIMBJV19Qx1K3C3E8KPwbwNwNlvmpEbgXy00Qet7LTPbofcUXDi3mcV6ozQ SMGQh42VGV4S8wEjAp93wLva7xLf6enV/X6Stuzy/ec8HN8K2tYMyYyGvSxZbjmq 9JTxCulDvbr7EnSDqCkH1inlQ4cXvBSzOi2trDpQbKq0cuYbqlzgL2sloBv+y+CB 4fTWEeeisQ05VhHU8Yp3bsVtiyUjGGcbOgvqM+NHmmh1HuHUmiBnXh7k3RfaLfBi E4RprFT0Yrpo16mt3Fn0GjkhY87DYHBGIehFOlu13zKPc5q+IBTwIjSugxoa23EL uXTtFFHMxR4pUV/An5Wf =G6gd -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.7-rc1' of git://oss.sgi.com/xfs/xfs Pull xfs update from Ben Myers: "Several enhancements and cleanups: - make inode32 and inode64 remountable options - SEEK_HOLE/SEEK_DATA enhancements - cleanup struct declarations in xfs_mount.h" * tag 'for-linus-v3.7-rc1' of git://oss.sgi.com/xfs/xfs: xfs: Make inode32 a remountable option xfs: add inode64->inode32 transition into xfs_set_inode32() xfs: Fix mp->m_maxagi update during inode64 remount xfs: reduce code duplication handling inode32/64 options xfs: make inode64 as the default allocation mode xfs: Fix m_agirotor reset during AG selection Make inode64 a remountable option xfs: stop the sync worker before xfs_unmountfs xfs: xfs_seek_hole() refinement with hole searching from page cache for unwritten extents xfs: xfs_seek_data() refinement with unwritten extents check up from page cache xfs: Introduce a helper routine to probe data or hole offset from page cache xfs: Remove type argument from xfs_seek_data()/xfs_seek_hole() xfs: fix race while discarding buffers [V4] xfs: check for possible overflow in xfs_ioc_trim xfs: unlock the AGI buffer when looping in xfs_dialloc xfs: kill struct declarations in xfs_mount.h xfs: fix uninitialised variable in xfs_rtbuf_get()	2012-10-02 20:42:58 -07:00
Linus Torvalds	aab174f0df	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...	2012-10-02 20:25:04 -07:00
Kirill A. Shutemov	8c0a853770	fs: push rcu_barrier() from deactivate_locked_super() to filesystems There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Linus Torvalds	437589a74b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "This is a mostly modest set of changes to enable basic user namespace support. This allows the code to code to compile with user namespaces enabled and removes the assumption there is only the initial user namespace. Everything is converted except for the most complex of the filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs, nfs, ocfs2 and xfs as those patches need a bit more review. The strategy is to push kuid_t and kgid_t values are far down into subsystems and filesystems as reasonable. Leaving the make_kuid and from_kuid operations to happen at the edge of userspace, as the values come off the disk, and as the values come in from the network. Letting compile type incompatible compile errors (present when user namespaces are enabled) guide me to find the issues. The most tricky areas have been the places where we had an implicit union of uid and gid values and were storing them in an unsigned int. Those places were converted into explicit unions. I made certain to handle those places with simple trivial patches. Out of that work I discovered we have generic interfaces for storing quota by projid. I had never heard of the project identifiers before. Adding full user namespace support for project identifiers accounts for most of the code size growth in my git tree. Ultimately there will be work to relax privlige checks from "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing root in a user names to do those things that today we only forbid to non-root users because it will confuse suid root applications. While I was pushing kuid_t and kgid_t changes deep into the audit code I made a few other cleanups. I capitalized on the fact we process netlink messages in the context of the message sender. I removed usage of NETLINK_CRED, and started directly using current->tty. Some of these patches have also made it into maintainer trees, with no problems from identical code from different trees showing up in linux-next. After reading through all of this code I feel like I might be able to win a game of kernel trivial pursuit." Fix up some fairly trivial conflicts in netfilter uid/git logging code. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits) userns: Convert the ufs filesystem to use kuid/kgid where appropriate userns: Convert the udf filesystem to use kuid/kgid where appropriate userns: Convert ubifs to use kuid/kgid userns: Convert squashfs to use kuid/kgid where appropriate userns: Convert reiserfs to use kuid and kgid where appropriate userns: Convert jfs to use kuid/kgid where appropriate userns: Convert jffs2 to use kuid and kgid where appropriate userns: Convert hpfs to use kuid and kgid where appropriate userns: Convert btrfs to use kuid/kgid where appropriate userns: Convert bfs to use kuid/kgid where appropriate userns: Convert affs to use kuid/kgid wherwe appropriate userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids userns: On ia64 deal with current_uid and current_gid being kuid and kgid userns: On ppc convert current_uid from a kuid before printing. userns: Convert s390 getting uid and gid system calls to use kuid and kgid userns: Convert s390 hypfs to use kuid and kgid where appropriate userns: Convert binder ipc to use kuids userns: Teach security_path_chown to take kuids and kgids userns: Add user namespace support to IMA userns: Convert EVM to deal with kuids and kgids in it's hmac computation ...	2012-10-02 11:11:09 -07:00
Linus Torvalds	033d9959ed	Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...	2012-10-02 09:54:49 -07:00
Al Viro	2903ff019b	switch simple cases of fget_light to fdget Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 22:20:08 -04:00
Al Viro	64e09fa2e1	switch xfs_find_handle() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:11 -04:00
Al Viro	1ea65c9607	switch xfs_swapext() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:11 -04:00
Carlos Maiolino	2ea0392983	xfs: Make inode32 a remountable option As inode64 is the default option now, and was also made remountable previously, inode32 can also be remounted on-the-fly when it is needed. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 16:01:28 -05:00
Carlos Maiolino	4056c1d08d	xfs: add inode64->inode32 transition into xfs_set_inode32() To make inode32 a remountable option, xfs_set_inode32() should be able to make a transition from inode64 option, disabling inode allocation on higher AGs. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:59:50 -05:00
Carlos Maiolino	4c0837224c	xfs: Fix mp->m_maxagi update during inode64 remount With the changes made on xfs_set_inode64(), to make it behave as xfs_set_inode32() (now leaving to the caller the responsibility to update mp->m_maxagi), we use the return value of xfs_set_inode64() to update mp->m_maxagi during remount. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:58:21 -05:00
Carlos Maiolino	2d2194f61f	xfs: reduce code duplication handling inode32/64 options Add xfs_set_inode32() to be used to enable inode32 allocation mode. this will reduce the amount of duplicated code needed to mount/remount a filesystem with inode32 option. This patch also changes xfs_set_inode64() to return the maximum AG number that inodes can be allocated instead of set mp->m_maxagi by itself, so that the behaviour is the same as xfs_set_inode32(). This simplifies code that calls these functions and needs to know the maximum AG that inodes can be allocated in. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:56:33 -05:00
Carlos Maiolino	08bf540412	xfs: make inode64 as the default allocation mode since 64-bit inodes can be accessed while using inode32, and these can also be used on 32-bit kernels, there is no reason to still keep inode32 as the default mount option. If the filesystem cannot handle 64bit inode numbers (i.e CONFIG_LBDAF is not enabled and BITS_PER_LONG == 32), XFS_MOUNT_SMALL_INUMS will still be set by default, so inode64 is not an unconditional default value. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:54:19 -05:00
Carlos Maiolino	8aea3ff411	xfs: Fix m_agirotor reset during AG selection xfs_ialloc_next_ag() currently resets m_agirotor when it is equal to m_maxagi: if (++mp->m_agirotor == mp->m_maxagi) mp->m_agirotor = 0; But, if for some reason mp->m_maxagi changes to a lower value than current m_agirotor, this condition will never be true, causing m_agirotor to exceed the maximum allowed value (m_maxagi). This implies mainly during lookups for xfs_perag structs in its radix tree, since the agno value used for the lookup is based on m_agirotor. An out-of-range m_agirotor may cause a lookup failure which in case will return NULL. As an example, the value of m_maxagi is decreased during inode64->inode32 remount process, case where I've found this problem. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:42:42 -05:00
Carlos Maiolino	c3a58fecdd	Make inode64 a remountable option Actually, there is no reason about why a user must umount and mount a XFS filesystem to enable 'inode64' option. So, this patch makes this a remountable option. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:41:39 -05:00
Ben Myers	0ba6e5368c	xfs: stop the sync worker before xfs_unmountfs Cancel work of the xfs_sync_worker before teardown of the log in xfs_unmountfs. This prevents occasional crashes on unmount like so: PID: 21602 TASK: ee9df060 CPU: 0 COMMAND: "kworker/0:3" #0 [c5377d28] crash_kexec at c0292c94 #1 [c5377d80] oops_end at c07090c2 #2 [c5377d98] no_context at c06f614e #3 [c5377dbc] __bad_area_nosemaphore at c06f6281 #4 [c5377df4] bad_area_nosemaphore at c06f629b #5 [c5377e00] do_page_fault at c070b0cb #6 [c5377e7c] error_code (via page_fault) at c070892c EAX: f300c6a8 EBX: f300c6a8 ECX: 000000c0 EDX: 000000c0 EBP: c5377ed0 DS: 007b ESI: 00000000 ES: 007b EDI: 00000001 GS: ffffad20 CS: 0060 EIP: c0481ad0 ERR: ffffffff EFLAGS: 00010246 #7 [c5377eb0] atomic64_read_cx8 at c0481ad0 #8 [c5377ebc] xlog_assign_tail_lsn_locked at f7cc7c6e [xfs] #9 [c5377ed4] xfs_trans_ail_delete_bulk at f7ccd520 [xfs] #10 [c5377f0c] xfs_buf_iodone at f7ccb602 [xfs] #11 [c5377f24] xfs_buf_do_callbacks at f7cca524 [xfs] #12 [c5377f30] xfs_buf_iodone_callbacks at f7cca5da [xfs] #13 [c5377f4c] xfs_buf_iodone_work at f7c718d0 [xfs] #14 [c5377f58] process_one_work at c024ee4c #15 [c5377f98] worker_thread at c024f43d #16 [c5377fbc] kthread at c025326b #17 [c5377fe8] kernel_thread_helper at c070e834 PID: 26653 TASK: e79143b0 CPU: 3 COMMAND: "umount" #0 [cde0fda0] __schedule at c0706595 #1 [cde0fe28] schedule at c0706b89 #2 [cde0fe30] schedule_timeout at c0705600 #3 [cde0fe94] __down_common at c0706098 #4 [cde0fec8] __down at c0706122 #5 [cde0fed0] down at c025936f #6 [cde0fee0] xfs_buf_lock at f7c7131d [xfs] #7 [cde0ff00] xfs_freesb at f7cc2236 [xfs] #8 [cde0ff10] xfs_fs_put_super at f7c80f21 [xfs] #9 [cde0ff1c] generic_shutdown_super at c0333d7a #10 [cde0ff38] kill_block_super at c0333e0f #11 [cde0ff48] deactivate_locked_super at c0334218 #12 [cde0ff58] deactivate_super at c033495d #13 [cde0ff68] mntput_no_expire at c034bc13 #14 [cde0ff7c] sys_umount at c034cc69 #15 [cde0ffa0] sys_oldumount at c034ccd4 #16 [cde0ffb0] system_call at c0707e66 commit `11159a05` added this to xfs_log_unmount and needs to be cleaned up at a later date. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com>	2012-09-18 16:51:26 -05:00
Eric W. Biederman	431f19744d	userns: Convert quota netlink aka quota_send_warning Modify quota_send_warning to take struct kqid instead a type and identifier pair. When sending netlink broadcasts always convert uids and quota identifiers into the intial user namespace. There is as yet no way to send a netlink broadcast message with different contents to receivers in different namespaces, so for the time being just map all of the identifiers into the initial user namespace which preserves the current behavior. Change the callers of quota_send_warning in gfs2, xfs and dquot to generate a struct kqid to pass to quota send warning. When all of the user namespaces convesions are complete a struct kqid values will be availbe without need for conversion, but a conversion is needed now to avoid needing to convert everything at once. Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-09-18 01:01:40 -07:00
Eric W. Biederman	74a8a10378	userns: Convert qutoactl Update the quotactl user space interface to successfull compile with user namespaces support enabled and to hand off quota identifiers to lower layers of the kernel in struct kqid instead of type and qid pairs. The quota on function is not converted because while it takes a quota type and an id. The id is the on disk quota format to use, which is something completely different. The signature of two struct quotactl_ops methods were changed to take struct kqid argumetns get_dqblk and set_dqblk. The dquot, xfs, and ocfs2 implementations of get_dqblk and set_dqblk are minimally changed so that the code continues to work with the change in parameter type. This is the first in a series of changes to always store quota identifiers in the kernel in struct kqid and only use raw type and qid values when interacting with on disk structures or userspace. Always using struct kqid internally makes it hard to miss places that need conversion to or from the kernel internal values. Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-09-18 01:01:39 -07:00
Eric W. Biederman	5f3a4a28ec	userns: Pass a userns parameter into posix_acl_to_xattr and posix_acl_from_xattr - Pass the user namespace the uid and gid values in the xattr are stored in into posix_acl_from_xattr. - Pass the user namespace kuid and kgid values should be converted into when storing uid and gid values in an xattr in posix_acl_to_xattr. - Modify all callers of posix_acl_from_xattr and posix_acl_to_xattr to pass in &init_user_ns. In the short term this change is not strictly needed but it makes the code clearer. In the longer term this change is necessary to be able to mount filesystems outside of the initial user namespace that natively store posix acls in the linux xattr format. Cc: Theodore Tso <tytso@mit.edu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-09-18 01:01:35 -07:00
Ben Myers	4026c9fde9	xfs: stop the sync worker before xfs_unmountfs Cancel work of the xfs_sync_worker before teardown of the log in xfs_unmountfs. This prevents occasional crashes on unmount like so: PID: 21602 TASK: ee9df060 CPU: 0 COMMAND: "kworker/0:3" #0 [c5377d28] crash_kexec at c0292c94 #1 [c5377d80] oops_end at c07090c2 #2 [c5377d98] no_context at c06f614e #3 [c5377dbc] __bad_area_nosemaphore at c06f6281 #4 [c5377df4] bad_area_nosemaphore at c06f629b #5 [c5377e00] do_page_fault at c070b0cb #6 [c5377e7c] error_code (via page_fault) at c070892c EAX: f300c6a8 EBX: f300c6a8 ECX: 000000c0 EDX: 000000c0 EBP: c5377ed0 DS: 007b ESI: 00000000 ES: 007b EDI: 00000001 GS: ffffad20 CS: 0060 EIP: c0481ad0 ERR: ffffffff EFLAGS: 00010246 #7 [c5377eb0] atomic64_read_cx8 at c0481ad0 #8 [c5377ebc] xlog_assign_tail_lsn_locked at f7cc7c6e [xfs] #9 [c5377ed4] xfs_trans_ail_delete_bulk at f7ccd520 [xfs] #10 [c5377f0c] xfs_buf_iodone at f7ccb602 [xfs] #11 [c5377f24] xfs_buf_do_callbacks at f7cca524 [xfs] #12 [c5377f30] xfs_buf_iodone_callbacks at f7cca5da [xfs] #13 [c5377f4c] xfs_buf_iodone_work at f7c718d0 [xfs] #14 [c5377f58] process_one_work at c024ee4c #15 [c5377f98] worker_thread at c024f43d #16 [c5377fbc] kthread at c025326b #17 [c5377fe8] kernel_thread_helper at c070e834 PID: 26653 TASK: e79143b0 CPU: 3 COMMAND: "umount" #0 [cde0fda0] __schedule at c0706595 #1 [cde0fe28] schedule at c0706b89 #2 [cde0fe30] schedule_timeout at c0705600 #3 [cde0fe94] __down_common at c0706098 #4 [cde0fec8] __down at c0706122 #5 [cde0fed0] down at c025936f #6 [cde0fee0] xfs_buf_lock at f7c7131d [xfs] #7 [cde0ff00] xfs_freesb at f7cc2236 [xfs] #8 [cde0ff10] xfs_fs_put_super at f7c80f21 [xfs] #9 [cde0ff1c] generic_shutdown_super at c0333d7a #10 [cde0ff38] kill_block_super at c0333e0f #11 [cde0ff48] deactivate_locked_super at c0334218 #12 [cde0ff58] deactivate_super at c033495d #13 [cde0ff68] mntput_no_expire at c034bc13 #14 [cde0ff7c] sys_umount at c034cc69 #15 [cde0ffa0] sys_oldumount at c034ccd4 #16 [cde0ffb0] system_call at c0707e66 commit `11159a05` added this to xfs_log_unmount and needs to be cleaned up at a later date. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com>	2012-09-17 12:05:22 -05:00
Carlos Maiolino	6fb8a90aa3	xfs: fix race while discarding buffers [V4] While xfs_buftarg_shrink() is freeing buffers from the dispose list (filled with buffers from lru list), there is a possibility to have xfs_buf_stale() racing with it, and removing buffers from dispose list before xfs_buftarg_shrink() does it. This happens because xfs_buftarg_shrink() handle the dispose list without locking and the test condition in xfs_buf_stale() checks for the buffer being in any list: if (!list_empty(&bp->b_lru)) If the buffer happens to be on dispose list, this causes the buffer counter of lru list (btp->bt_lru_nr) to be decremented twice (once in xfs_buftarg_shrink() and another in xfs_buf_stale()) causing a wrong account usage of the lru list. This may cause xfs_buftarg_shrink() to return a wrong value to the memory shrinker shrink_slab(), and such account error may also cause an underflowed value to be returned; since the counter is lower than the current number of items in the lru list, a decrement may happen when the counter is 0, causing an underflow on the counter. The fix uses a new flag field (and a new buffer flag) to serialize buffer handling during the shrink process. The new flag field has been designed to use btp->bt_lru_lock/unlock instead of xfs_buf_lock/unlock mechanism. dchinner, sandeen, aquini and aris also deserve credits for this. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-29 15:01:11 -05:00
Jeff Liu	b686d1f79a	xfs: xfs_seek_hole() refinement with hole searching from page cache for unwritten extents xfs_seek_hole() refinement with hole searching from page cache for unwritten extent. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-24 13:57:10 -05:00
Jeff Liu	52f1acc8b5	xfs: xfs_seek_data() refinement with unwritten extents check up from page cache xfs_seek_data() refinement with unwritten extents check up from page cache. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-24 13:56:29 -05:00
Jeff Liu	d126d43f63	xfs: Introduce a helper routine to probe data or hole offset from page cache Introduce helpers to probe data or hole offset from page cache. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-24 13:55:09 -05:00
Jeff Liu	834ab12228	xfs: Remove type argument from xfs_seek_data()/xfs_seek_hole() The type is already indicated by the function naming explicitly, so this argument can be omitted from those calls. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-24 13:48:05 -05:00
Carlos Maiolino	e599b3253c	xfs: fix race while discarding buffers [V4] While xfs_buftarg_shrink() is freeing buffers from the dispose list (filled with buffers from lru list), there is a possibility to have xfs_buf_stale() racing with it, and removing buffers from dispose list before xfs_buftarg_shrink() does it. This happens because xfs_buftarg_shrink() handle the dispose list without locking and the test condition in xfs_buf_stale() checks for the buffer being in any list: if (!list_empty(&bp->b_lru)) If the buffer happens to be on dispose list, this causes the buffer counter of lru list (btp->bt_lru_nr) to be decremented twice (once in xfs_buftarg_shrink() and another in xfs_buf_stale()) causing a wrong account usage of the lru list. This may cause xfs_buftarg_shrink() to return a wrong value to the memory shrinker shrink_slab(), and such account error may also cause an underflowed value to be returned; since the counter is lower than the current number of items in the lru list, a decrement may happen when the counter is 0, causing an underflow on the counter. The fix uses a new flag field (and a new buffer flag) to serialize buffer handling during the shrink process. The new flag field has been designed to use btp->bt_lru_lock/unlock instead of xfs_buf_lock/unlock mechanism. dchinner, sandeen, aquini and aris also deserve credits for this. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-24 13:46:10 -05:00
Tomas Racek	a672e1be30	xfs: check for possible overflow in xfs_ioc_trim If range.start or range.minlen is bigger than filesystem size, return invalid value error. This fixes possible overflow in BTOBB macro when passed value was nearly ULLONG_MAX. Signed-off-by: Tomas Racek <tracek@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-23 14:48:44 -05:00
Christoph Hellwig	7612903099	xfs: unlock the AGI buffer when looping in xfs_dialloc Also update some commens in the area to make the code easier to read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-23 14:48:32 -05:00
Dave Chinner	0b9e3f6d84	xfs: fix uninitialised variable in xfs_rtbuf_get() Results in this assert failure in generic/090: XFS: Assertion failed: *nmap >= 1, file: fs/xfs/xfs_bmap.c, line: 4363 ..... Call Trace: [<ffffffff814680db>] xfs_bmapi_read+0x6b/0x370 [<ffffffff814b64b2>] xfs_rtbuf_get+0x42/0x130 [<ffffffff814b6f09>] xfs_rtget_summary+0x89/0x120 [<ffffffff814b7bfe>] xfs_rtallocate_extent_size+0xce/0x340 [<ffffffff814b89f0>] xfs_rtallocate_extent+0x240/0x290 [<ffffffff81462c1a>] xfs_bmap_rtalloc+0x1ba/0x340 [<ffffffff81463a65>] xfs_bmap_alloc+0x35/0x40 [<ffffffff8146f111>] xfs_bmapi_allocate+0xf1/0x350 [<ffffffff8146f9de>] xfs_bmapi_write+0x66e/0xa60 [<ffffffff8144538a>] xfs_iomap_write_direct+0x22a/0x3f0 [<ffffffff8143707b>] __xfs_get_blocks+0x38b/0x5d0 [<ffffffff814372d4>] xfs_get_blocks_direct+0x14/0x20 [<ffffffff811b0081>] do_blockdev_direct_IO+0xf71/0x1eb0 [<ffffffff811b1015>] __blockdev_direct_IO+0x55/0x60 [<ffffffff814355ca>] xfs_vm_direct_IO+0x11a/0x1e0 [<ffffffff8112d617>] generic_file_direct_write+0xd7/0x1b0 [<ffffffff8143e16c>] xfs_file_dio_aio_write+0x13c/0x320 [<ffffffff8143e6f2>] xfs_file_aio_write+0x1c2/0x1d0 [<ffffffff81174a07>] do_sync_write+0xa7/0xe0 [<ffffffff81175288>] vfs_write+0xa8/0x160 [<ffffffff81175702>] sys_pwrite64+0x92/0xb0 [<ffffffff81b68f69>] system_call_fastpath+0x16/0x1b Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-23 14:48:16 -05:00
Tejun Heo	43829731dd	workqueue: deprecate flush[_delayed]_work_sync() flush[_delayed]_work_sync() are now spurious. Mark them deprecated and convert all users to flush[_delayed]_work(). If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant and the regular flushes guarantee that the work item is not pending or running on any CPU on return, so there's no reason to use the sync flushes at all and they're going away. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mattia Dongili <malattia@linux.it> Cc: Kent Yoder <key@linux.vnet.ibm.com> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Bryan Wu <bryan.wu@canonical.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Avi Kivity <avi@redhat.com>	2012-08-20 14:51:24 -07:00
Tomas Racek	643bfc061c	xfs: check for possible overflow in xfs_ioc_trim If range.start or range.minlen is bigger than filesystem size, return invalid value error. This fixes possible overflow in BTOBB macro when passed value was nearly ULLONG_MAX. Signed-off-by: Tomas Racek <tracek@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-16 16:42:52 -05:00
Christoph Hellwig	c4982110ae	xfs: unlock the AGI buffer when looping in xfs_dialloc Also update some commens in the area to make the code easier to read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-16 16:23:59 -05:00
Alex Elder	1ed845df60	xfs: kill struct declarations in xfs_mount.h I noticed that "struct xfs_mount_args" was still declared in "fs/xfs/xfs_mount.h". That struct doesn't even exist any more (and is obviously not referenced elsewhere in that header file). While in there, delete four other unneeded struct declarations in that file. Doing so highlights that "fs/xfs/xfs_trace.h" was relying indirectly on "xfs_mount.h" to be #included in order to declare "struct xfs_bmbt_irec", so add that declaration to resolve that issue. Signed-off-by: Alex Elder <elder@inktank.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-16 13:29:35 -05:00
Dave Chinner	a76cccbeef	xfs: fix uninitialised variable in xfs_rtbuf_get() Results in this assert failure in generic/090: XFS: Assertion failed: *nmap >= 1, file: fs/xfs/xfs_bmap.c, line: 4363 ..... Call Trace: [<ffffffff814680db>] xfs_bmapi_read+0x6b/0x370 [<ffffffff814b64b2>] xfs_rtbuf_get+0x42/0x130 [<ffffffff814b6f09>] xfs_rtget_summary+0x89/0x120 [<ffffffff814b7bfe>] xfs_rtallocate_extent_size+0xce/0x340 [<ffffffff814b89f0>] xfs_rtallocate_extent+0x240/0x290 [<ffffffff81462c1a>] xfs_bmap_rtalloc+0x1ba/0x340 [<ffffffff81463a65>] xfs_bmap_alloc+0x35/0x40 [<ffffffff8146f111>] xfs_bmapi_allocate+0xf1/0x350 [<ffffffff8146f9de>] xfs_bmapi_write+0x66e/0xa60 [<ffffffff8144538a>] xfs_iomap_write_direct+0x22a/0x3f0 [<ffffffff8143707b>] __xfs_get_blocks+0x38b/0x5d0 [<ffffffff814372d4>] xfs_get_blocks_direct+0x14/0x20 [<ffffffff811b0081>] do_blockdev_direct_IO+0xf71/0x1eb0 [<ffffffff811b1015>] __blockdev_direct_IO+0x55/0x60 [<ffffffff814355ca>] xfs_vm_direct_IO+0x11a/0x1e0 [<ffffffff8112d617>] generic_file_direct_write+0xd7/0x1b0 [<ffffffff8143e16c>] xfs_file_dio_aio_write+0x13c/0x320 [<ffffffff8143e6f2>] xfs_file_aio_write+0x1c2/0x1d0 [<ffffffff81174a07>] do_sync_write+0xa7/0xe0 [<ffffffff81175288>] vfs_write+0xa8/0x160 [<ffffffff81175702>] sys_pwrite64+0x92/0xb0 [<ffffffff81b68f69>] system_call_fastpath+0x16/0x1b Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-08-16 12:53:12 -05:00
Linus Torvalds	a0e881b7c1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs pile from Al Viro: "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the deadlock reproduced by xfstests 068), symlink and hardlink restriction patches, plus assorted cleanups and fixes. Note that another fsfreeze deadlock (emergency thaw one) is not dealt with - the series by Fernando conflicts a lot with Jan's, breaks userland ABI (FIFREEZE semantics gets changed) and trades the deadlock for massive vfsmount leak; this is going to be handled next cycle. There probably will be another pull request, but that stuff won't be in it." Fix up trivial conflicts due to unrelated changes next to each other in drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) delousing target_core_file a bit Documentation: Correct s_umount state for freeze_fs/unfreeze_fs fs: Remove old freezing mechanism ext2: Implement freezing btrfs: Convert to new freezing mechanism nilfs2: Convert to new freezing mechanism ntfs: Convert to new freezing mechanism fuse: Convert to new freezing mechanism gfs2: Convert to new freezing mechanism ocfs2: Convert to new freezing mechanism xfs: Convert to new freezing code ext4: Convert to new freezing mechanism fs: Protect write paths by sb_start_write - sb_end_write fs: Skip atime update on frozen filesystem fs: Add freezing handling to mnt_want_write() / mnt_drop_write() fs: Improve filesystem freezing handling switch the protection of percpu_counter list to spinlock nfsd: Push mnt_want_write() outside of i_mutex btrfs: Push mnt_want_write() outside of i_mutex fat: Push mnt_want_write() outside of i_mutex ...	2012-08-01 10:26:23 -07:00
Jan Kara	d9457dc056	xfs: Convert to new freezing code Generic code now blocks all writers from standard write paths. So we add blocking of all writers coming from ioctl (we get a protection of ioctl against racing remount read-only as a bonus) and convert xfs_file_aio_write() to a non-racy freeze protection. We also keep freeze protection on transaction start to block internal filesystem writes such as removal of preallocated blocks. CC: Ben Myers <bpm@sgi.com> CC: Alex Elder <elder@kernel.org> CC: xfs@oss.sgi.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-31 09:45:48 +04:00
Linus Torvalds	37cd9600a9	xfs: update for 3.6-rc1 Numerous cleanups and several bug fixes. Here are some highlights: * Discontiguous directory buffer support * Inode allocator refactoring * Removal of the IO lock in inode reclaim * Implementation of .update_time * Fix for handling of EOF in xfs_vm_writepage * Fix for races in xfsaild, and idle mode is re-enabled * Fix for a crash in xfs_buf completion handlers on unmount. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJQFtCvAAoJENaLyazVq6ZOIuEQAINJXb4SK9oBrdwGmq+Vsqf2 Eh4OmzZmdnSPrxfFGmqvyL9DdUBvGBuidwOcVLMAXGtzbxE9USK9NuKC5zN/hJip 8tIyv/8bqZ0aD4RJlHGN5zKFoQh/9Tag+JsaaqWstO8Ir1tA/5p04hDAz492btfT 49SvnV64sJ1fi7pmaJblMWMMtlWJjD6iOldaHwnKBQ3LKmcgy9sD9DY5HiGOTr1j ecKtucX7B8Q9oFLKHaKEwTYZRRYDNuTbqZmI6hlEcA5hT280jotsGA4q/aXx/gHS lZuBaqVtNFT5WCKm+j/et76tmTfIh0CSbo64ZfgSOESy2BkEVXHg5XJ1gDvPdV+L 6eBlUx3jaiNyFVHxVzFhzwKC/XdaITCd/ixFEogRDmoppDXencTCibLJXHNXxupN BCAyTLCxEJIE9WCeOMmwHA0450bMY4or13NGep57pIvG8GomtdG1WncTRIo84KV5 0W5ocaUTGP7ROsr+KF8U9C7H866OHzVFijA+vvcTy8GtsT/xOCFxuJrqPVb+kgD7 mIKaoK7iH6Kufu433TzsLEcUkF36gq/7NytPKjQhURLpZhxkHG3rq6LC0HXp6uuZ QgX5Y5Gl7SwDovIrndXmQXRnGrzvqHLguZl65+rB1CKggjemkLSdSLhryoNVjLU2 iB7/hvzOUdYFMRRz2mLc =2wkC -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.6-rc1' of git://oss.sgi.com/xfs/xfs Pull xfs update from Ben Myers: "Numerous cleanups and several bug fixes. Here are some highlights: - Discontiguous directory buffer support - Inode allocator refactoring - Removal of the IO lock in inode reclaim - Implementation of .update_time - Fix for handling of EOF in xfs_vm_writepage - Fix for races in xfsaild, and idle mode is re-enabled - Fix for a crash in xfs_buf completion handlers on unmount." Fix up trivial conflicts in fs/xfs/{xfs_buf.c,xfs_log.c,xfs_log_priv.h} due to duplicate patches that had already been merged for 3.5. * tag 'for-linus-v3.6-rc1' of git://oss.sgi.com/xfs/xfs: (44 commits) xfs: wait for the write the superblock on unmount xfs: re-enable xfsaild idle mode and fix associated races xfs: remove iolock lock classes xfs: avoid the iolock in xfs_free_eofblocks for evicted inodes xfs: do not take the iolock in xfs_inactive xfs: remove xfs_inactive_attrs xfs: clean up xfs_inactive xfs: do not read the AGI buffer in xfs_dialloc until nessecary xfs: refactor xfs_ialloc_ag_select xfs: add a short cut to xfs_dialloc for the non-NULL agbp case xfs: remove the alloc_done argument to xfs_dialloc xfs: split xfs_dialloc xfs: remove xfs_ialloc_find_free Prefix IO_XX flags with XFS_IO_XX to avoid namespace colision. xfs: remove xfs_inotobp xfs: merge xfs_itobp into xfs_imap_to_bp xfs: handle EOF correctly in xfs_vm_writepage xfs: implement ->update_time xfs: fix comment typo of struct xfs_da_blkinfo. xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacks ...	2012-07-30 13:37:53 -07:00
Mark Tinguely	9a57fa8ee7	xfs: wait for the write the superblock on unmount v2: Add the xfs_buf_lock to xfs_quiesce_attr(). Add explaination why xfs_buf_lock() is used to wait for write. xfs_wait_buftarg() does not wait for the completion of the write of the uncached superblock. This write can race with the shutdown of the log and causes a panic if the write does not win the race. During the log write, xfsaild_push() will lock the buffer and set the XBF_ASYNC flag. Because the XBF_FLAG is set, complete() is not performed on the buffer's iowait entry, we cannot call xfs_buf_iowait() to wait for the write to complete. The buffer's lock is held until the write is complete, so we can block on a xfs_buf_lock() request to be notified that the write is complete. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:34:19 -05:00
Brian Foster	8375f922aa	xfs: re-enable xfsaild idle mode and fix associated races xfsaild idle mode logic currently leads to a couple hangs: 1.) If xfsaild is rescheduled in during an incremental scan (i.e., tout != 0) and the target has been updated since the previous run, we can hit the new target and go into idle mode with a still populated ail. 2.) A wake up is only issued when the target is pushed forward. The wake up can race with xfsaild if it is currently in the process of entering idle mode, causing future wake up events to be lost. These hangs have been reproduced and verified as fixed by running xfstests 273 in a loop on a slightly modified upstream kernel. The kernel is modified to re-enable idle mode as previously implemented (when count == 0) and with a revert of commit `670ce93f`, which includes performance improvements that make this harder to reproduce. The solution, the algorithm for which has been outlined by Dave Chinner, is to modify xfsaild to enter idle mode only when the ail is empty and the push target has not been moved forward since the last push. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:27:57 -05:00
Christoph Hellwig	4f59af758f	xfs: remove iolock lock classes Content-Disposition: inline; filename=xfs-remove-iolock-classes Now that we never take the iolock during inode reclaim we don't need to play games with lock classes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:23:51 -05:00
Christoph Hellwig	5a15322da1	xfs: avoid the iolock in xfs_free_eofblocks for evicted inodes Same rational as the last patch - these inodes are not reachable, so don't bother with locking. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:22:20 -05:00
Christoph Hellwig	0b56185b0d	xfs: do not take the iolock in xfs_inactive An inode that enters xfs_inactive has been removed from all global lists but the inode hash, and can't be recycled in xfs_iget before it has been marked reclaimable. Thus taking the iolock in here is not nessecary at all, and given the amount of lockdep false positives it has triggered already I'd rather remove the locking. The only change outside of xfs_inactive is relaxing an assert in xfs_itruncate_extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:16:49 -05:00
Christoph Hellwig	fe67be036f	xfs: remove xfs_inactive_attrs Remove this helper as the code flow is a lot more obvious when it gets merged into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:15:33 -05:00
Christoph Hellwig	b373e98daa	xfs: clean up xfs_inactive The code to reserve log space and join the inode to the transaction is common for all cases, so don't duplicate it. Also remove the trivial xfs_inactive_symlink_local helper which can simply be opencode now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:13:09 -05:00
Christoph Hellwig	be60fe54b2	xfs: do not read the AGI buffer in xfs_dialloc until nessecary Refactor the AG selection loop in xfs_dialloc to operate on the in-memory perag data as much as possible. We only read the AGI buffer once we have selected an AG to allocate inodes now instead of for every AG considered. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:10:54 -05:00
Christoph Hellwig	55d6af64cb	xfs: refactor xfs_ialloc_ag_select Loop over the in-core perag structures and prefer using pagi_freecount over going out to the AGI buffer where possible. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:08:13 -05:00
Christoph Hellwig	4bb61069d2	xfs: add a short cut to xfs_dialloc for the non-NULL agbp case In this case we already have selected an AG and know it has free space beause the buffer lock never got released. Jump directly into xfs_dialloc_ag and short cut the AG selection loop. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:03:23 -05:00
Christoph Hellwig	08358906ed	xfs: remove the alloc_done argument to xfs_dialloc We can simplify check the IO_agbp pointer for being non-NULL instead of passing another argument through two layers of function calls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 16:00:31 -05:00
Christoph Hellwig	f2ecc5e453	xfs: split xfs_dialloc Move the actual allocation once we have selected an allocation group into a separate helper, and make xfs_dialloc a wrapper around it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-29 15:56:49 -05:00
Linus Torvalds	a66d2c8f7e	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull the big VFS changes from Al Viro: "This one is big and changes quite a few things around VFS. What's in there: - the first of two really major architecture changes - death to open intents. The former is finally there; it was very long in making, but with Miklos getting through really hard and messy final push in fs/namei.c, we finally have it. Unlike his variant, this one doesn't introduce struct opendata; what we have instead is ->atomic_open() taking preallocated struct file * and passing everything via its fields. Instead of returning struct file , it returns -E... on error, 0 on success and 1 in "deal with it yourself" case (e.g. symlink found on server, etc.). See comments before fs/namei.c:atomic_open(). That made a lot of goodies finally possible and quite a few are in that pile: ->lookup(), ->d_revalidate() and ->create() do not get struct nameidata anymore; ->lookup() and ->d_revalidate() get lookup flags instead, ->create() gets "do we want it exclusive" flag. With the introduction of new helper (kern_path_locked()) we are rid of all struct nameidata instances outside of fs/namei.c; it's still visible in namei.h, but not for long. Come the next cycle, declaration will move either to fs/internal.h or to fs/namei.c itself. [me, miklos, hch] - The second major change: behaviour of final fput(). Now we have __fput() done without any locks held by caller and not from deep in call stack. That obviously lifts a lot of constraints on the locking in there. Moreover, it's legal now to call fput() from atomic contexts (which has immediately simplified life for aio.c). We also don't need anti-recursion logics in __scm_destroy() anymore. There is a price, though - the damn thing has become partially asynchronous. For fput() from normal process we are guaranteed that pending __fput() will be done before the caller returns to userland, exits or gets stopped for ptrace. For kernel threads and atomic contexts it's done via schedule_work(), so theoretically we might need a way to make sure it's finished; so far only one such place had been found, but there might be more. There's flush_delayed_fput() (do all pending __fput()) and there's __fput_sync() (fput() analog doing __fput() immediately). I hope we won't need them often; see warnings in fs/file_table.c for details. [me, based on task_work series from Oleg merged last cycle] - sync series from Jan - large part of "death to sync_supers()" work from Artem; the only bits missing here are exofs and ext4 ones. As far as I understand, those are going via the exofs and ext4 trees resp.; once they are in, we can put ->write_super() to the rest, along with the thread calling it. - preparatory bits from unionmount series (from dhowells). - assorted cleanups and fixes all over the place, as usual. This is not the last pile for this cycle; there's at least jlayton's ESTALE work and fsfreeze series (the latter - in dire need of fixes, so I'm not sure it'll make the cut this cycle). I'll probably throw symlink/hardlink restrictions stuff from Kees into the next pile, too. Plus there's a lot of misc patches I hadn't thrown into that one - it's large enough as it is..." * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits) ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file() btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file() switch dentry_open() to struct path, make it grab references itself spufs: shift dget/mntget towards dentry_open() zoran: don't bother with struct file * in zoran_map ecryptfs: don't reinvent the wheels, please - use struct completion don't expose I_NEW inodes via dentry->d_inode tidy up namei.c a bit unobfuscate follow_up() a bit ext3: pass custom EOF to generic_file_llseek_size() ext4: use core vfs llseek code for dir seeks vfs: allow custom EOF in generic_file_llseek code vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes vfs: Remove unnecessary flushing of block devices vfs: Make sys_sync writeout also block device inodes vfs: Create function for iterating over block devices vfs: Reorder operations during sys_sync quota: Move quota syncing to ->sync_fs method quota: Split dquot_quota_sync() to writeback and cache flushing part vfs: Move noop_backing_dev_info check from sync into writeback ...	2012-07-23 12:27:27 -07:00
Al Viro	765927b2d5	switch dentry_open() to struct path, make it grab references itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-23 00:01:29 +04:00
Christoph Hellwig	824c313139	xfs: remove xfs_ialloc_find_free This function is entirely trivial and only has one caller, so remove it to simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-22 13:40:10 -05:00
Alain Renaud	0d882a360b	Prefix IO_XX flags with XFS_IO_XX to avoid namespace colision. Add a XFS_ prefix to IO_DIRECT,XFS_IO_DELALLOC, XFS_IO_UNWRITTEN and XFS_IO_OVERWRITE. This to avoid namespace conflict with other modules. Signed-off-by: Alain Renaud <arenaud@sgi.com> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-22 11:00:55 -05:00
Christoph Hellwig	129dbc9a2d	xfs: remove xfs_inotobp There is no need to keep this helper around, opencoding it in the only caller is just as clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-22 10:55:36 -05:00
Christoph Hellwig	475ee413f3	xfs: merge xfs_itobp into xfs_imap_to_bp All callers of xfs_imap_to_bp want the dinode pointer, so let's calculate it inside xfs_imap_to_bp. Once that is done xfs_itobp becomes a fairly pointless wrapper which can be replaced with direct calls to xfs_imap_to_bp. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-22 10:46:56 -05:00
Christoph Hellwig	6b7a03f03a	xfs: handle EOF correctly in xfs_vm_writepage We need to zero out part of a page which beyond EOF before setting uptodate, otherwise, mapread or write will see non-zero data beyond EOF. Based on the code in fs/buffer.c and the following ext4 commit: ext4: handle EOF correctly in ext4_bio_write_page() And yes, I wish we had a good test case for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-22 10:42:56 -05:00
Christoph Hellwig	69ff282611	xfs: implement ->update_time Use this new method to replace our hacky use of ->dirty_inode. An additional benefit is that we can now propagate errors up the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-22 10:38:32 -05:00
Chen Baozi	96ee34be7a	xfs: fix comment typo of struct xfs_da_blkinfo. Fix trivial typo error that has written "It" to "Is". Signed-off-by: Chen Baozi <baozich@gmail.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-22 10:34:42 -05:00
Al Viro	ebfc3b49a7	don't pass nameidata to ->create() boolean "does it have to be exclusive?" flag is passed instead; Local filesystem should just ignore it - the object is guaranteed not to be there yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:47 +04:00
Al Viro	00cd8dd3bf	stop passing nameidata to ->lookup() Just the flags; only NFS cares even about that, but there are legitimate uses for such argument. And getting rid of that completely would require splitting ->lookup() into a couple of methods (at least), so let's leave that alone for now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:32 +04:00
Christoph Hellwig	1632dcc93f	xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacks xfs_bdstrat_cb only adds a check for a shutdown filesystem over xfs_buf_iorequest, but xfs_buf_iodone_callbacks just checked for a shut down filesystem a little earlier. In addition the shutdown handling in xfs_bdstrat_cb is not very suitable for this caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-13 13:09:49 -05:00
Christoph Hellwig	40a9b7963d	xfs: prevent recursion in xfs_buf_iorequest If the b_iodone handler is run in calling context in xfs_buf_iorequest we can run into a recursion where xfs_buf_iodone_callbacks keeps calling back into xfs_buf_iorequest because an I/O error happened, which keeps calling back into xfs_buf_iorequest. This chain will usually not take long because the filesystem gets shut down because of log I/O errors, but even over a short time it can cause stack overflows if run on the same context. As a short term workaround make sure we always call the iodone handler in workqueue context. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-13 13:09:39 -05:00
Dave Chinner	aa292847b9	xfs: don't defer metadata allocation to the workqueue Almost all metadata allocations come from shallow stack usage situations. Avoid the overhead of switching the allocation to a workqueue as we are not in danger of running out of stack when making these allocations. Metadata allocations are already marked through the args that are passed down, so this is trivial to do. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-by: Mel Gorman <mgorman@suse.de> Tested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-13 13:09:27 -05:00
Dave Chinner	e3a746f5aa	xfs: really fix the cursor leak in xfs_alloc_ag_vextent_near The current cursor is reallocated when retrying the allocation, so the existing cursor needs to be destroyed in both the restart and the failure cases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-13 13:09:18 -05:00
Christoph Hellwig	a2dcf5df5f	xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacks xfs_bdstrat_cb only adds a check for a shutdown filesystem over xfs_buf_iorequest, but xfs_buf_iodone_callbacks just checked for a shut down filesystem a little earlier. In addition the shutdown handling in xfs_bdstrat_cb is not very suitable for this caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-13 12:50:54 -05:00
Christoph Hellwig	08023d6dbe	xfs: prevent recursion in xfs_buf_iorequest If the b_iodone handler is run in calling context in xfs_buf_iorequest we can run into a recursion where xfs_buf_iodone_callbacks keeps calling back into xfs_buf_iorequest because an I/O error happened, which keeps calling back into xfs_buf_iorequest. This chain will usually not take long because the filesystem gets shut down because of log I/O errors, but even over a short time it can cause stack overflows if run on the same context. As a short term workaround make sure we always call the iodone handler in workqueue context. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-13 12:50:42 -05:00
Dave Chinner	eb71a12e41	xfs: don't defer metadata allocation to the workqueue Almost all metadata allocations come from shallow stack usage situations. Avoid the overhead of switching the allocation to a workqueue as we are not in danger of running out of stack when making these allocations. Metadata allocations are already marked through the args that are passed down, so this is trivial to do. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-by: Mel Gorman <mgorman@suse.de> Tested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-13 12:50:24 -05:00
Dave Chinner	1f432a887e	xfs: really fix the cursor leak in xfs_alloc_ag_vextent_near The current cursor is reallocated when retrying the allocation, so the existing cursor needs to be destroyed in both the restart and the failure cases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-13 12:47:58 -05:00
Dave Chinner	9b73bd7b61	xfs: factor buffer reading from xfs_dir2_leaf_getdents The buffer reading code in xfs_dir2_leaf_getdents is complex and difficult to follow due to the readahead and all the context is carries. it is also badly indented and so difficult to read. Factor it out into a separate function to make it easier to understand and optimise in future patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:08 -05:00
Dave Chinner	1d9025e561	xfs: remove struct xfs_dabuf and infrastructure The struct xfs_dabuf now only tracks a single xfs_buf and all the information it holds can be gained directly from the xfs_buf. Hence we can remove the struct dabuf and pass the xfs_buf around everywhere. Kill the struct dabuf and the associated infrastructure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:07 -05:00
Dave Chinner	3605431fb9	xfs: use discontiguous xfs_buf support in dabuf wrappers First step in converting the directory code to use native discontiguous buffers and replacing the dabuf construct. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:07 -05:00
Dave Chinner	372cc85ec6	xfs: support discontiguous buffers in the xfs_buf_log_item discontigous buffer in separate buffer format structures. This means log recovery will recover all the changes on a per segment basis without requiring any knowledge of the fact that it was logged from a compound buffer. To do this, we need to be able to determine what buffer segment any given offset into the compound buffer sits over. This enables us to translate the dirty bitmap in the number of separate buffer format structures required. We also need to be able to determine the number of bitmap elements that a given buffer segment has, as this determines the size of the buffer format structure. Hence we need to be able to determine the both the start offset into the buffer and the length of a given segment to be able to calculate this. With this information, we can preallocate, build and format the correct log vector array for each segment in a compound buffer to appear exactly the same as individually logged buffers in the log. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:06 -05:00
Dave Chinner	de2a4f5919	xfs: add discontiguous buffer support to transactions Now that the buffer cache supports discontiguous buffers, add support to the transaction buffer interface for getting and reading buffers. Note that this patch does not convert the buffer item logging to support discontiguous buffers. That will be done as a separate commit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:06 -05:00
Dave Chinner	6dde27077e	xfs: add discontiguous buffer map interface With the internal interfaces supporting discontiguous buffer maps, add external lookup, read and get interfaces so they can start to be used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:05 -05:00
Dave Chinner	3e85c868a6	xfs: convert internal buffer functions to pass maps While the external interface currently uses separate blockno/length variables, we need to move internal interfaces to passing and parsing vector maps. This will then allow us to add external interfaces to support discontiguous buffer maps as the internal code will already support them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:05 -05:00
Dave Chinner	cbb7baab28	xfs: separate buffer indexing from block map To support discontiguous buffers in the buffer cache, we need to separate the cache index variables from the I/O map. While this is currently a 1:1 mapping, discontiguous buffer support will break this relationship. However, for caching purposes, we can still treat them the same as a contiguous buffer - the block number of the first block and the length of the buffer - as that is still a unique representation. Also, the only way we will ever access the discontiguous regions of buffers is via bulding the complete buffer in the first place, so using the initial block number and entire buffer length is a sane way to index the buffers. Add a block mapping vector construct to the xfs_buf and use it in the places where we are doing IO instead of the current b_bn/b_length variables. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:04 -05:00
Dave Chinner	77c1a08fc9	xfs: struct xfs_buf_log_format isn't variable sized. The struct xfs_buf_log_format wants to think the dirty bitmap is variable sized. In fact, it is variable size on disk simply due to the way we map it from the in-memory structure, but we still just use a fixed size memory allocation for the in-memory structure. Hence it makes no sense to set the function up as a variable sized structure when we already know it's maximum size, and we always allocate it as such. Simplify the structure by making the dirty bitmap a fixed sized array and just using the size of the structure for the allocation size. This will make it much simpler to allocate and manipulate an array of format structures for discontiguous buffer support. The previous struct xfs_buf_log_item size according to /proc/slabinfo was 224 bytes. pahole doesn't give the same size because of the variable size definition. With this modification, pahole reports the same as /proc/slabinfo: /* size: 224, cachelines: 4, members: 6 */ Because the xfs_buf_log_item size is now determined by the maximum supported block size we introduce a dependency on xfs_alloc_btree.h. Avoid this dependency by moving the idefines for the maximum block sizes supported to xfs_types.h with all the other max/min type defines to avoid any new dependencies. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-07-01 14:50:04 -05:00
Mark Tinguely	9a8d2fdbb4	xfs: remove xlog_t typedef Remove the xlog_t type definitions. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 14:22:27 -05:00
Mark Tinguely	f7bdf03a99	xfs: rename log structure to xlog Rename the XFS log structure to xlog to help crash distinquish it from the other logs in Linux. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 14:21:11 -05:00
Ben Myers	8866fc6fa5	xfs: shutdown xfs_sync_worker before the log Revert commit `1307bbd`, which uses the s_umount semaphore to provide exclusion between xfs_sync_worker and unmount, in favor of shutting down the sync worker before freeing the log in xfs_log_unmount. This is a cleaner way of resolving the race between xfs_sync_worker and unmount than using s_umount. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2012-06-21 14:20:48 -05:00
Jan Kara	59c84ed0dd	xfs: Fix overallocation in xfs_buf_allocate_memory() Commit `de1cbee` which removed b_file_offset in favor of b_bn introduced a bug causing xfs_buf_allocate_memory() to overestimate the number of necessary pages. The problem is that xfs_buf_alloc() sets b_bn to -1 and thus effectively every buffer is straddling a page boundary which causes xfs_buf_allocate_memory() to allocate two pages and use vmalloc() for access which is unnecessary. Dave says xfs_buf_alloc() doesn't need to set b_bn to -1 anymore since the buffer is inserted into the cache only after being fully initialized now. So just make xfs_buf_alloc() fill in proper block number from the beginning. CC: David Chinner <dchinner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 14:20:36 -05:00
Dave Chinner	76d095388b	xfs: fix allocbt cursor leak in xfs_alloc_ag_vextent_near When we fail to find an matching extent near the requested extent specification during a left-right distance search in xfs_alloc_ag_vextent_near, we fail to free the original cursor that we used to look up the XFS_BTNUM_CNT tree and hence leak it. Reported-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 14:20:20 -05:00
Brian Foster	9a3a5dab63	xfs: check for stale inode before acquiring iflock on push An inode in the AIL can be flush locked and marked stale if a cluster free transaction occurs at the right time. The inode item is then marked as flushing, which causes xfsaild to spin and leaves the filesystem stalled. This is reproduced by running xfstests 273 in a loop for an extended period of time. Check for stale inodes before the flush lock. This marks the inode as pinned, leads to a log flush and allows the filesystem to proceed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 14:20:06 -05:00
Mark Tinguely	ad223e6030	xfs: rename log structure to xlog Rename the XFS log structure to xlog to help crash distinquish it from the other logs in Linux. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 13:49:39 -05:00
Ben Myers	11159a0500	xfs: shutdown xfs_sync_worker before the log Revert commit `1307bbd`, which uses the s_umount semaphore to provide exclusion between xfs_sync_worker and unmount, in favor of shutting down the sync worker before freeing the log in xfs_log_unmount. This is a cleaner way of resolving the race between xfs_sync_worker and unmount than using s_umount. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2012-06-21 13:41:04 -05:00
Jan Kara	bcf62ab64d	xfs: Fix overallocation in xfs_buf_allocate_memory() Commit `de1cbee` which removed b_file_offset in favor of b_bn introduced a bug causing xfs_buf_allocate_memory() to overestimate the number of necessary pages. The problem is that xfs_buf_alloc() sets b_bn to -1 and thus effectively every buffer is straddling a page boundary which causes xfs_buf_allocate_memory() to allocate two pages and use vmalloc() for access which is unnecessary. Dave says xfs_buf_alloc() doesn't need to set b_bn to -1 anymore since the buffer is inserted into the cache only after being fully initialized now. So just make xfs_buf_alloc() fill in proper block number from the beginning. CC: David Chinner <dchinner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 13:38:25 -05:00
Dave Chinner	079da28c64	xfs: fix allocbt cursor leak in xfs_alloc_ag_vextent_near When we fail to find an matching extent near the requested extent specification during a left-right distance search in xfs_alloc_ag_vextent_near, we fail to free the original cursor that we used to look up the XFS_BTNUM_CNT tree and hence leak it. Reported-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 12:32:43 -05:00
Brian Foster	76e8f13866	xfs: check for stale inode before acquiring iflock on push An inode in the AIL can be flush locked and marked stale if a cluster free transaction occurs at the right time. The inode item is then marked as flushing, which causes xfsaild to spin and leaves the filesystem stalled. This is reproduced by running xfstests 273 in a loop for an extended period of time. Check for stale inodes before the flush lock. This marks the inode as pinned, leads to a log flush and allows the filesystem to proceed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-21 10:53:43 -05:00
Jeff Liu	3b876c8f2a	xfs: fix debug_object WARN at xfs_alloc_vextent() Fengguang reports: [ 780.529603] XFS (vdd): Ending clean mount [ 781.454590] ODEBUG: object is on stack, but not annotated [ 781.455433] ------------[ cut here ]------------ [ 781.455433] WARNING: at /c/kernel-tests/sound/lib/debugobjects.c:301 __debug_object_init+0x173/0x1f1() [ 781.455433] Hardware name: Bochs [ 781.455433] Modules linked in: [ 781.455433] Pid: 26910, comm: kworker/0:2 Not tainted 3.4.0+ #51 [ 781.455433] Call Trace: [ 781.455433] [<ffffffff8106bc84>] warn_slowpath_common+0x83/0x9b [ 781.455433] [<ffffffff8106bcb6>] warn_slowpath_null+0x1a/0x1c [ 781.455433] [<ffffffff814919a5>] __debug_object_init+0x173/0x1f1 [ 781.455433] [<ffffffff81491c65>] debug_object_init+0x14/0x16 [ 781.455433] [<ffffffff8108842a>] __init_work+0x20/0x22 [ 781.455433] [<ffffffff8134ea56>] xfs_alloc_vextent+0x6c/0xd5 Use INIT_WORK_ONSTACK in xfs_alloc_vextent instead of INIT_WORK. Reported-by: Wu Fengguang <wfg@linux.intel.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-20 14:58:24 -05:00
Alain Renaud	66f9311381	xfs: xfs_vm_writepage clear iomap_valid when !buffer_uptodate (REV2) On filesytems with a block size smaller than PAGE_SIZE we currently have a problem with unwritten extents. If a we have multi-block page for which an unwritten extent has been allocated, and only some of the buffers have been written to, and they are not contiguous, we can expose stale data from disk in the blocks between the writes after extent conversion. Example of a page with unwritten and real data. buffer content 0 empty b_state = 0 1 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 2 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 3 empty b_state = 0 4 empty b_state = 0 5 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 6 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 7 empty b_state = 0 Buffers 1, 2, 5, and 6 have been written to, leaving 0, 3, 4, and 7 empty. Currently buffers 1, 2, 5, and 6 are added to a single ioend, and when IO has completed, extent conversion creates a real extent from block 1 through block 6, leaving 0 and 7 unwritten. However buffers 3 and 4 were not written to disk, so stale data is exposed from those blocks on a subsequent read. Fix this by setting iomap_valid = 0 when we find a buffer that is not Uptodate. This ensures that buffers 5 and 6 are not added to the same ioend as buffers 1 and 2. Later these blocks will be converted into two separate real extents, leaving the blocks in between unwritten. Signed-off-by: Alain Renaud <arenaud@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-20 14:57:28 -05:00
Chen Baozi	51c84223af	xfs: fix typo in comment of xfs_dinode_t. There should be "XFS_DFORK_DPTR, XFS_DFORK_APTR, and XFS_DFORK_PTR" instead of "XFS_DFORK_PTR, XFS_DFORK_DPTR, and XFS_DFORK_PTR". Signed-off-by: Chen Baozi <baozich@gmail.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-14 12:28:26 -05:00
Dave Chinner	5276432997	xfs: kill copy and paste segment checks in xfs_file_aio_read The generic segment check code now returns a count of the number of bytes in the iovec, so we don't need to roll our own anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-14 12:28:25 -05:00
Dave Chinner	32972383ca	xfs: make largest supported offset less shouty XFS_MAXIOFFSET() is just a simple macro that resolves to mp->m_maxioffset. It doesn't need to exist, and it just makes the code unnecessarily loud and shouty. Make it quiet and easy to read. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-14 12:28:24 -05:00
Dave Chinner	d2c2819117	xfs: m_maxioffset is redundant The m_maxioffset field in the struct xfs_mount contains the same value as the superblock s_maxbytes field. There is no need to carry two copies of this limit around, so use the VFS superblock version. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-14 12:28:22 -05:00
Jeff Liu	0f2cf9d3d9	xfs: fix debug_object WARN at xfs_alloc_vextent() Fengguang reports: [ 780.529603] XFS (vdd): Ending clean mount [ 781.454590] ODEBUG: object is on stack, but not annotated [ 781.455433] ------------[ cut here ]------------ [ 781.455433] WARNING: at /c/kernel-tests/sound/lib/debugobjects.c:301 __debug_object_init+0x173/0x1f1() [ 781.455433] Hardware name: Bochs [ 781.455433] Modules linked in: [ 781.455433] Pid: 26910, comm: kworker/0:2 Not tainted 3.4.0+ #51 [ 781.455433] Call Trace: [ 781.455433] [<ffffffff8106bc84>] warn_slowpath_common+0x83/0x9b [ 781.455433] [<ffffffff8106bcb6>] warn_slowpath_null+0x1a/0x1c [ 781.455433] [<ffffffff814919a5>] __debug_object_init+0x173/0x1f1 [ 781.455433] [<ffffffff81491c65>] debug_object_init+0x14/0x16 [ 781.455433] [<ffffffff8108842a>] __init_work+0x20/0x22 [ 781.455433] [<ffffffff8134ea56>] xfs_alloc_vextent+0x6c/0xd5 Use INIT_WORK_ONSTACK in xfs_alloc_vextent instead of INIT_WORK. Reported-by: Wu Fengguang <wfg@linux.intel.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-14 12:28:21 -05:00
Alain Renaud	7d0fa3ecba	xfs: xfs_vm_writepage clear iomap_valid when !buffer_uptodate (REV2) On filesytems with a block size smaller than PAGE_SIZE we currently have a problem with unwritten extents. If a we have multi-block page for which an unwritten extent has been allocated, and only some of the buffers have been written to, and they are not contiguous, we can expose stale data from disk in the blocks between the writes after extent conversion. Example of a page with unwritten and real data. buffer content 0 empty b_state = 0 1 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 2 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 3 empty b_state = 0 4 empty b_state = 0 5 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 6 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 7 empty b_state = 0 Buffers 1, 2, 5, and 6 have been written to, leaving 0, 3, 4, and 7 empty. Currently buffers 1, 2, 5, and 6 are added to a single ioend, and when IO has completed, extent conversion creates a real extent from block 1 through block 6, leaving 0 and 7 unwritten. However buffers 3 and 4 were not written to disk, so stale data is exposed from those blocks on a subsequent read. Fix this by setting iomap_valid = 0 when we find a buffer that is not Uptodate. This ensures that buffers 5 and 6 are not added to the same ioend as buffers 1 and 2. Later these blocks will be converted into two separate real extents, leaving the blocks in between unwritten. Signed-off-by: Alain Renaud <arenaud@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-06-14 12:28:20 -05:00
Josef Bacik	c3b2da3148	fs: introduce inode operation ->update_time Btrfs has to make sure we have space to allocate new blocks in order to modify the inode, so updating time can fail. We've gotten around this by having our own file_update_time but this is kind of a pain, and Christoph has indicated he would like to make xfs do something different with atime updates. So introduce ->update_time, where we will deal with i_version an a/m/c time updates and indicate which changes need to be made. The normal version just does what it has always done, updates the time and marks the inode dirty, and then filesystems can choose to do something different. I've gone through all of the users of file_update_time and made them check for errors with the exception of the fault code since it's complicated and I wasn't quite sure what to do there, also Jan is going to be pushing the file time updates into page_mkwrite for those who have it so that should satisfy btrfs and make it not a big deal to check the file_update_time() return code in the generic fault path. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-01 12:07:25 -04:00
Al Viro	b0b0382bb4	->encode_fh() API change pass inode + parent's inode or NULL instead of dentry + bool saying whether we want the parent or not. NOTE: that needs ceph fix folded in. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-05-29 23:28:33 -04:00
Al Viro	77ba78776e	xfs: switch to proper __bitwise type for KM_... flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-05-29 23:28:32 -04:00
Linus Torvalds	90324cc1b1	avoid iput() from flusher thread -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP +AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV 4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv 5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63 BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF /9c4zxxekjHZnn2oooEa =oLXf -----END PGP SIGNATURE----- Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally	2012-05-28 09:54:45 -07:00
Dave Chinner	14c26c6a05	xfs: add trace points for log forces To enable easy tracing of the location of log forces and the frequency of them via perf, add a pair of trace points to the log force functions. This will help debug where excessive log forces are being issued from by simple perf commands like: # ~/perf/perf top -e xfs:xfs_log_force -G -U Which gives this sort of output: Events: 141 xfs:xfs_log_force - 100.00% [kernel] [k] xfs_log_force - xfs_log_force 87.04% xfsaild kthread kernel_thread_helper - 12.87% xfs_buf_lock _xfs_buf_find xfs_buf_get xfs_trans_get_buf xfs_da_do_buf xfs_da_get_buf xfs_dir2_data_init xfs_dir2_leaf_addname xfs_dir_createname xfs_create xfs_vn_mknod xfs_vn_create vfs_create do_last.isra.41 path_openat do_filp_open do_sys_open sys_open system_call_fastpath Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sig.com>	2012-05-21 10:45:44 -05:00
Peter Watkins	3ba3160374	xfs: fix memory reclaim deadlock on agi buffer Note xfs_iget can be called while holding a locked agi buffer. If it goes into memory reclaim then inode teardown may try to lock the same buffer. Prevent the deadlock by calling radix_tree_preload with GFP_NOFS. Signed-off-by: Peter Watkins <treestem@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-21 10:45:44 -05:00
Dave Chinner	ea562ed6e7	xfs: fix delalloc quota accounting on failure xfstest 270 was causing quota reservations way beyond what was sane (ten to hundreds of TB) for a 4GB filesystem. There's a sign problem in the error handling path of xfs_bmapi_reserve_delalloc() because xfs_trans_unreserve_quota_nblks() simple negates the value passed - which doesn't work for an unsigned variable. This causes reservations of close to 2^32 block instead of removing a reservation of a handful of blocks. Fix the same problem in the other xfs_trans_unreserve_quota_nblks() callers where unsigned integer variables are used, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-21 10:45:43 -05:00
Ben Myers	1307bbd2af	xfs: protect xfs_sync_worker with s_umount semaphore xfs_sync_worker checks the MS_ACTIVE flag in s_flags to avoid doing work during mount and unmount. This flag can be cleared by unmount after the xfs_sync_worker checks it but before the work is completed. The has caused crashes in the completion handler for the dummy transaction commited by xfs_sync_worker: PID: 27544 TASK: ffff88013544e040 CPU: 3 COMMAND: "kworker/3:0" #0 [ffff88016fdff930] machine_kexec at ffffffff810244e9 #1 [ffff88016fdff9a0] crash_kexec at ffffffff8108d053 #2 [ffff88016fdffa70] oops_end at ffffffff813ad1b8 #3 [ffff88016fdffaa0] no_context at ffffffff8102bd48 #4 [ffff88016fdffaf0] __bad_area_nosemaphore at ffffffff8102c04d #5 [ffff88016fdffb40] bad_area_nosemaphore at ffffffff8102c12e #6 [ffff88016fdffb50] do_page_fault at ffffffff813afaee #7 [ffff88016fdffc60] page_fault at ffffffff813ac635 [exception RIP: xlog_get_lowest_lsn+0x30] RIP: ffffffffa04a9910 RSP: ffff88016fdffd10 RFLAGS: 00010246 RAX: ffffc90014e48000 RBX: ffff88014d879980 RCX: ffff88014d879980 RDX: ffff8802214ee4c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88016fdffd10 R8: ffff88014d879a80 R9: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802214ee400 R13: ffff88014d879980 R14: 0000000000000000 R15: ffff88022fd96605 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff88016fdffd18] xlog_state_do_callback at ffffffffa04aa186 [xfs] #9 [ffff88016fdffd98] xlog_state_done_syncing at ffffffffa04aa568 [xfs] Protect xfs_sync_worker by using the s_umount semaphore at the read level to provide exclusion with unmount while work is progressing. Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-15 14:35:43 -05:00
Jeff Liu	3fe3e6b182	xfs: introduce SEEK_DATA/SEEK_HOLE support This patch adds lseek(2) SEEK_DATA/SEEK_HOLE functionality to xfs. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:21:05 -05:00
Ben Myers	e700a06c71	xfs: make xfs_extent_busy_trim not static Commit e459df5, 'xfs: move busy extent handling to it's own file' moved some code from xfs_alloc.c into xfs_extent_busy.c for convenience in userspace code merges. One of the functions moved is xfs_extent_busy_trim (formerly xfs_alloc_busy_trim) which is defined STATIC. Unfortunately this function is still used in xfs_alloc.c, and this results in an undefined symbol in xfs.ko. Make xfs_extent_busy_trim not static and add its prototype to xfs_extent_busy.h. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com>	2012-05-14 16:21:04 -05:00
Dave Chinner	611c99468c	xfs: make XBF_MAPPED the default behaviour Rather than specifying XBF_MAPPED for almost all buffers, introduce XBF_UNMAPPED for the couple of users that use unmapped buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:21:03 -05:00
Dave Chinner	d4f3512b08	xfs: flush outstanding buffers on log mount failure When we fail to mount the log in xfs_mountfs(), we tear down all the infrastructure we have already allocated. However, the process of mounting the log may have progressed to the point of reading, caching and modifying buffers in memory. Hence before we can free all the infrastructure, we have to flush and remove all the buffers from memory. Problem first reported by Eric Sandeen, later a different incarnation was reported by Ben Myers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:21:02 -05:00
Dave Chinner	12bcb3f7d4	xfs: Properly exclude IO type flags from buffer flags Recent event tracing during a debugging session showed that flags that define the IO type for a buffer are leaking into the flags on the buffer incorrectly. Fix the flag exclusion mask in xfs_buf_alloc() to avoid problems that may be caused by such leakage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:21:01 -05:00
Dave Chinner	ad1e95c54e	xfs: clean up xfs_bit.h includes With the removal of xfs_rw.h and other changes over time, xfs_bit.h is being included in many files that don't actually need it. Clean up the includes as necessary. Also move the only-used-once xfs_ialloc_find_free() static inline function out of a header file that is widely included to reduce the number of needless dependencies on xfs_bit.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:21:00 -05:00
Dave Chinner	2af51f3a4e	xfs: move xfs_do_force_shutdown() and kill xfs_rw.c xfs_do_force_shutdown now is the only thing in xfs_rw.c. There is no need to keep it in it's own file anymore, so move it to xfs_fsops.c next to xfs_fs_goingdown() and kill xfs_rw.c. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:59 -05:00
Dave Chinner	2a0ec1d9ed	xfs: move xfs_get_extsz_hint() and kill xfs_rw.h The only thing left in xfs_rw.h is a function prototype for an inode function. Move that to xfs_inode.h, and kill xfs_rw.h. Also move the function implementing the prototype from xfs_rw.c to xfs_inode.c so we only have one function left in xfs_rw.c Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:58 -05:00
Dave Chinner	fd50092c08	xfs: move xfs_fsb_to_db to xfs_bmap.h This is the only remaining useful function in xfs_rw.h, so move it to a header file responsible for block mapping functions that the callers already include. Soon we can get rid of xfs_rw.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:57 -05:00
Dave Chinner	4ecbfe637c	xfs: clean up busy extent naming Now that the busy extent tracking has been moved out of the allocation files, clean up the namespace it uses to "xfs_extent_busy" rather than a mix of "xfs_busy" and "xfs_alloc_busy". Signed-off-by: Dave Chinner<dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:56 -05:00
Dave Chinner	efc27b5259	xfs: move busy extent handling to it's own file To make it easier to handle userspace code merges, move all the busy extent handling out of the allocation code and into it's own file. The userspace code does not need the busy extent code, so this simplifies the merging of the kernel code into the userspace xfsprogs library. Because the busy extent code has been almost completely rewritten over the past couple of years, also update the copyright on this new file to include the authors that made all those changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:55 -05:00
Dave Chinner	60a34607b2	xfs: move xfsagino_t to xfs_types.h Untangle the header file includes a bit by moving the definition of xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include xfs_ag.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:54 -05:00
Dave Chinner	bc4010ecb8	xfs: use iolock on XFS_IOC_ALLOCSP calls fsstress has a particular effective way of stopping debug XFS kernels. We keep seeing assert failures due finding delayed allocation extents where there should be none. This shows up when extracting extent maps and we are holding all the locks we should be to prevent races, so this really makes no sense to see these errors. After checking that fsstress does not use mmap, it occurred to me that fsstress uses something that no sane application uses - the XFS_IOC_ALLOCSP ioctl interfaces for preallocation. These interfaces do allocation of blocks beyond EOF without using preallocation, and then call setattr to extend and zero the allocated blocks. THe problem here is this is a buffered write, and hence the allocation is a delayed allocation. Unlike the buffered IO path, the allocation and zeroing are not serialised using the IOLOCK. Hence the ALLOCSP operation can race with operations holding the iolock to prevent buffered IO operations from occurring. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:53 -05:00
Dave Chinner	aa5c158ec9	xfs: kill XBF_DONTBLOCK Just about all callers of xfs_buf_read() and xfs_buf_get() use XBF_DONTBLOCK. This is used to make memory allocation use GFP_NOFS rather than GFP_KERNEL to avoid recursion through memory reclaim back into the filesystem. All the blocking get calls in growfs occur inside a transaction, even though they are no part of the transaction, so all allocation will be GFP_NOFS due to the task flag PF_TRANS being set. The blocking read calls occur during log recovery, so they will probably be unaffected by converting to GFP_NOFS allocations. Hence make XBF_DONTBLOCK behaviour always occur for buffers and kill the flag. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:52 -05:00
Dave Chinner	7ca790a507	xfs: kill xfs_read_buf() xfs_read_buf() is effectively the same as xfs_trans_read_buf() when called outside a transaction context. The error handling is slightly different in that xfs_read_buf stales the errored buffer it gets back, but there is probably good reason for xfs_trans_read_buf() for doing this. Hence update xfs_trans_read_buf() to the same error handling as xfs_read_buf(), and convert all the callers of xfs_read_buf() to use the former function. We can then remove xfs_read_buf(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:51 -05:00
Dave Chinner	a8acad7073	xfs: kill XBF_LOCK Buffers are always returned locked from the lookup routines. Hence we don't need to tell the lookup routines to return locked buffers, on to try and lock them. Remove XBF_LOCK from all the callers and from internal buffer cache usage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:50 -05:00
Dave Chinner	795cac72e9	xfs: kill xfs_buf_btoc xfs_buf_btoc and friends are simple macros that do basic block to page index conversion and vice versa. These aren't widely used, and we use open coded masking and shifting everywhere else. Hence remove the macros and open code the work they do. Also, use of PAGE_CACHE_{SIZE\|SHIFT\|MASK} for these macros is now incorrect - we are using pages directly and not the page cache, so use PAGE_{SIZE\|MASK\|SHIFT} instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:49 -05:00
Dave Chinner	aa0e8833b0	xfs: use blocks for storing the desired IO size Now that we pass block counts everywhere, and index buffers by block number and length in units of blocks, convert the desired IO size into block counts rather than bytes. Convert the code to use block counts, and those that need byte counts get converted at the time of use. Rename the b_desired_count variable to something closer to it's purpose - b_io_length - as it is only used to specify the length of an IO for a subset of the buffer. The only time this is used is for log IO - both writing iclogs and during log recovery. In all other cases, the b_io_length matches b_length, and hence a lot of code confuses the two. e.g. the buf item code uses the io count exclusively when it should be using the buffer length. Fix these apprpriately as they are found. Also, remove the XFS_BUF_{SET_}COUNT() macros that are just wrappers around the desired IO length. They only serve to make the code shouty loud, don't actually add any real value, and are often used incorrectly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:48 -05:00
Dave Chinner	4e94b71b70	xfs: use blocks for counting length of buffers Now that we pass block counts everywhere, and index buffers by block number, track the length of the buffer in units of blocks rather than bytes. Convert the code to use block counts, and those that need byte counts get converted at the time of use. Also, remove the XFS_BUF_{SET_}SIZE() macros that are just wrappers around the buffer length. They only serve to make the code shouty loud and don't actually add any real value. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:47 -05:00
Dave Chinner	de1cbee462	xfs: kill b_file_offset Seeing as we pass block numbers around everywhere in the buffer cache now, it makes no sense to index everything by byte offset. Replace all the byte offset indexing with block number based indexing, and replace all uses of the byte offset with direct conversion from the block index. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:46 -05:00
Dave Chinner	e70b73f84f	xfs: clean up buffer get/read call API The xfs_buf_get/read API is not consistent in the units it uses, and does not use appropriate or consistent units/types for the variables. Convert the API to use disk addresses and block counts for all buffer get and read calls. Use consistent naming for all the functions and their declarations, and convert the internal functions to use disk addresses and block counts to avoid need to convert them from one type to another and back again. Fix all the callers to use disk addresses and block counts. In many cases, this removes an additional conversion from the function call as the callers already have a block count. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:45 -05:00
Dave Chinner	bf813cdddf	xfs: use kmem_zone_zalloc for buffers To replace the alloc/memset pair. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:44 -05:00
Dave Chinner	ead360c50d	xfs: fix incorrect b_offset initialisation Because we no longer use the page cache for buffering, there is no direct block number to page offset relationship anymore. xfs_buf_get_pages is still setting up b_offset as if there was some relationship, and that is leading to incorrectly setting up uncached buffers that don't overwrite b_offset once they've had pages allocated. For cached buffers, the first block of the buffer is always at offset zero into the allocated memory. This is true for sub-page sized buffers, as well as for multiple-page buffers. For uncached buffers, b_offset is only non-zero when we are associating specific memory to the buffers, and that is set correctly by the code setting up the buffer. Hence remove the setting of b_offset in xfs_buf_get_pages, because it is now always the wrong thing to do. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:43 -05:00
Dave Chinner	0e95f19ad9	xfs: check for buffer errors before waiting If we call xfs_buf_iowait() on a buffer that failed dispatch due to an IO error, it will wait forever for an Io that does not exist. This is hndled in xfs_buf_read, but there is other code that calls xfs_buf_iowait directly that doesn't. Rather than make the call sites have to handle checking for dispatch errors and then checking for completion errors, make xfs_buf_iowait() check for dispatch errors on the buffer before waiting. This means we handle both dispatch and completion errors with one set of error handling at the caller sites. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:42 -05:00
Dave Chinner	fe2429b096	xfs: fix buffer lookup race on allocation failure When memory allocation fails to add the page array or tht epages to a buffer during xfs_buf_get(), the buffer is left in the cache in a partially initialised state. There is enough state left for the next lookup on that buffer to find the buffer, and for the buffer to then be used without finishing the initialisation. As a result, when an attempt to do IO on the buffer occurs, it fails with EIO because there are no pages attached to the buffer. We cannot remove the buffer from the cache immediately and free it, because there may already be a racing lookup that is blocked on the buffer lock. Hence the moment we unlock the buffer to then free it, the other user is woken and we have a use-after-free situation. To avoid this race condition altogether, allocate the pages for the buffer before we insert it into the cache. This then means that we don't have an allocation failure case to deal after the buffer is already present in the cache, and hence avoid the problem altogether. In most cases we won't have racing inserts for the same buffer, and so won't increase the memory pressure allocation before insertion may entail. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:41 -05:00
Dave Chinner	aff3a9edb7	xfs: Use preallocation for inodes with extsz hints xfstest 229 exposes a problem with buffered IO, delayed allocation and extent size hints. That is when we do delayed allocation during buffered IO, we reserve space for the extent size hint alignment and allocate the physical space to align the extent, but we do not zero the regions of the extent that aren't written by the write(2) syscall. The result is that we expose stale data in unwritten regions of the extent size hints. There are two ways to fix this. The first is to detect that we are doing unaligned writes, check if there is already a mapping or data over the extent size hint range, and if not zero the page cache first before then doing the real write. This can be very expensive for large extent size hints, especially if the subsequent writes fill then entire extent size before the data is written to disk. The second, and simpler way, is simply to turn off delayed allocation when the extent size hint is set and use preallocation instead. This results in unwritten extents being laid down on disk and so only the written portions will be converted. This matches the behaviour for direct IO, and will also work for the real time device. The disadvantage of this approach is that for small extent size hints we can get file fragmentation, but in general extent size hints are fairly large (e.g. stripe width sized) so this isn't a big deal. Implement the second approach as it is simple and effective. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:40 -05:00
Dave Chinner	3ed9116e8a	xfs: limit specualtive delalloc to maxioffset Speculative delayed allocation beyond EOF near the maximum supported file offset can result in creating delalloc extents beyond mp->m_maxioffset (8EB). These can never be trimmed during xfs_free_eof_blocks() because they are beyond mp->m_maxioffset, and that results in assert failures in xfs_fs_destroy_inode() due to delalloc blocks still being present. xfstests 071 exposes this problem. Limit speculative delalloc to mp->m_maxioffset to avoid this problem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:39 -05:00
Dave Chinner	58e2077064	xfs: don't assert on delalloc regions beyond EOF When we are doing speculative delayed allocation beyond EOF, conversion of the region allocated beyond EOF is dependent on the largest free space extent available. If the largest free extent is smaller than the delalloc range, then after allocation we leave a delalloc extent that starts beyond EOF. This extent cannot ever be converted by flushing data, and so will remain there until either the EOF moves into the extent or it is truncated away. Hence if xfs_getbmap() runs on such an inode and is asked to return extents beyond EOF, it will assert fail on this extent even though there is nothing xfs_getbmap() can do to convert it to a real extent. Hence we should simply report these delalloc extents rather than assert that there should be none. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:38 -05:00
Dave Chinner	81158e0cec	xfs: prevent needless mount warning causing test failures Often mounting small filesystem with small logs will emit a warning such as: XFS (vdb): Invalid block length (0x2000) for buffer during log recovery. This causes tests to randomly fail because this output causes the clean filesystem checks on test completion to think the filesystem is inconsistent. The cause of the error is simply that log recovery is asking for a buffer size that is larger than the log when zeroing the tail. This is because the buffer size is rounded up, and if the right head and tail conditions exist then the buffer size can be larger than the log. Limit the variable size xlog_get_bp() callers to requesting buffers smaller than the log. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:37 -05:00
Dave Chinner	d3bc815afb	xfs: punch new delalloc blocks out of failed writes inside EOF. When a partial write inside EOF fails, it can leave delayed allocation blocks lying around because they don't get punched back out. This leads to assert failures like: XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) \|\| ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847 when evicting inodes from the cache. This can be trivially triggered by xfstests 083, which takes between 5 and 15 executions on a 512 byte block size filesystem to trip over this. Debugging shows a failed write due to ENOSPC calling xfs_vm_write_failed such as: [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae and no action is taken on it. This leaves behind a delayed allocation extent that has no page covering it and no data in it: [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0 [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1 [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1 ^^^^^^^^^^^^^^^ [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa So the delayed allocation extent is one block long at offset 0x16c00. Tracing shows that a bigger write: xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags allocates the block, and then fails with ENOSPC trying to allocate the last block on the page, leading to a failed write with stale delalloc blocks on it. Because we've had an ENOSPC when trying to allocate 0x16e00, it means that we are never goinge to call ->write_end on the page and so the allocated new buffer will not get marked dirty or have the buffer_new state cleared. In other works, what the above write is supposed to end up with is this mapping for the page: +------+------+------+------+------+------+------+------+ UMA UMA UMA UMA UMA UMA UND FAIL where: U = uptodate M = mapped N = new A = allocated D = delalloc FAIL = block we ENOSPC'd on. and the key point being the buffer_new() state for the newly allocated delayed allocation block. Except it doesn't - we're not marking buffers new correctly. That buffer_new() problem goes back to the xfs_iomap removal days, where xfs_iomap() used to return a "new" status for any map with newly allocated blocks, so that __xfs_get_blocks() could call set_buffer_new() on it. We still have the "new" variable and the check for it in the set_buffer_new() logic - except we never set it now! Hence that newly allocated delalloc block doesn't have the new flag set on it, so when the write fails we cannot tell which blocks we are supposed to punch out. WHy do we need the buffer_new flag? Well, that's because we can have this case: +------+------+------+------+------+------+------+------+ UMD UMD UMD UMD UMD UMD UND FAIL where all the UMD buffers contain valid data from a previously successful write() system call. We only want to punch the UND buffer because that's the only one that we added in this write and it was only this write that failed. That implies that even the old buffer_new() logic was wrong - because it would result in all those UMD buffers on the page having set_buffer_new() called on them even though they aren't new. Hence we shoul donly be calling set_buffer_new() for delalloc buffers that were allocated (i.e. were a hole before xfs_iomap_write_delay() was called). So, fix this set_buffer_new logic according to how we need it to work for handling failed writes correctly. Also, restore the new buffer logic handling for blocks allocated via xfs_iomap_write_direct(), because it should still set the buffer_new flag appropriately for newly allocated blocks, too. SO, now we have the buffer_new() being set appropriately in __xfs_get_blocks(), we can detect the exact delalloc ranges that we allocated in a failed write, and hence can now do a walk of the buffers on a page to find them. Except, it's not that easy. When block_write_begin() fails, it unlocks and releases the page that we just had an error on, so we can't use that page to handle errors anymore. We have to get access to the page while it is still locked to walk the buffers. Hence we have to open code block_write_begin() in xfs_vm_write_begin() to be able to insert xfs_vm_write_failed() is the right place. With that, we can pass the page and write range to xfs_vm_write_failed() and walk the buffers on the page, looking for delalloc buffers that are either new or beyond EOF and punch them out. Handling buffers beyond EOF ensures we still handle the existing case that xfs_vm_write_failed() handles. Of special note is the truncate_pagecache() handling - that only should be done for pages outside EOF - pages within EOF can still contain valid, dirty data so we must not punch them out of the cache. That just leaves the xfs_vm_write_end() failure handling. The only failure case here is that we didn't copy the entire range, and generic_write_end() handles that by zeroing the region of the page that wasn't copied, we don't have to punch out blocks within the file because they are guaranteed to contain zeros. Hence we only have to handle the existing "beyond EOF" case and don't need access to the buffers on the page. Hence it remains largely unchanged. Note that xfs_getbmap() can still trip over delalloc blocks beyond EOF that are left there by speculative delayed allocation. Hence this bug fix does not solve all known issues with bmap vs delalloc, but it does fix all the the known accidental occurances of the problem. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:36 -05:00
Dave Chinner	6ffc4db5de	xfs: page type check in writeback only checks last buffer xfs_is_delayed_page() checks to see if a page has buffers matching the given IO type passed in. It does so by walking the buffer heads on the page and checking if the state flags match the IO type. However, the "acceptable" variable that is calculated is overwritten every time a new buffer is checked. Hence if the first buffer on the page is of the right type, this state is lost if the second buffer is not of the correct type. This means that xfs_aops_discard_page() may not discard delalloc regions when it is supposed to, and xfs_convert_page() may not cluster IO as efficiently as possible. This problem only occurs on filesystems with a block size smaller than page size. Also, rename xfs_is_delayed_page() to xfs_check_page_type() to better describe what it is doing - it is not delalloc specific anymore. The problem was first noticed by Peter Watkins. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:35 -05:00
Dave Chinner	4c2d542f2e	xfs: Do background CIL flushes via a workqueue Doing background CIL flushes adds significant latency to whatever async transaction that triggers it. To avoid blocking async transactions on things like waiting for log buffer IO to complete, move the CIL push off into a workqueue. By moving the push work into a workqueue, we remove all the latency that the commit adds from the foreground transaction commit path. This also means that single threaded workloads won't do the CIL push procssing, leaving them more CPU to do more async transactions. To do this, we need to keep track of the sequence number we have pushed work for. This avoids having many transaction commits attempting to schedule work for the same sequence, and ensures that we only ever have one push (background or forced) in progress at a time. It also means that we don't need to take the CIL lock in write mode to check for potential background push races, which reduces lock contention. To avoid potential issues with "smart" IO schedulers, don't use the workqueue for log force triggered flushes. Instead, do them directly so that the log IO is done directly by the process issuing the log force and so doesn't get stuck on IO elevator queue idling incorrectly delaying the log IO from the workqueue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:34 -05:00
Dave Chinner	04913fdd91	xfs: pass shutdown method into xfs_trans_ail_delete_bulk xfs_trans_ail_delete_bulk() can be called from different contexts so if the item is not in the AIL we need different shutdown for each context. Pass in the shutdown method needed so the correct action can be taken. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:33 -05:00
Christoph Hellwig	a8569171ba	xfs: remove some obsolete comments in xfs_trans_ail.c Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:32 -05:00
Christoph Hellwig	43ff2122e6	xfs: on-stack delayed write buffer lists Queue delwri buffers on a local on-stack list instead of a per-buftarg one, and write back the buffers per-process instead of by waking up xfsbufd. This is now easily doable given that we have very few places left that write delwri buffers: - log recovery: Only done at mount time, and already forcing out the buffers synchronously using xfs_flush_buftarg - quotacheck: Same story. - dquot reclaim: Writes out dirty dquots on the LRU under memory pressure. We might want to look into doing more of this via xfsaild, but it's already more optimal than the synchronous inode reclaim that writes each buffer synchronously. - xfsaild: This is the main beneficiary of the change. By keeping a local list of buffers to write we reduce latency of writing out buffers, and more importably we can remove all the delwri list promotions which were hitting the buffer cache hard under sustained metadata loads. The implementation is very straight forward - xfs_buf_delwri_queue now gets a new list_head pointer that it adds the delwri buffers to, and all callers need to eventually submit the list using xfs_buf_delwi_submit or xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are skipped in xfs_buf_delwri_queue, assuming they already are on another delwri list. The biggest change to pass down the buffer list was done to the AIL pushing. Now that we operate on buffers the trylock, push and pushbuf log item methods are merged into a single push routine, which tries to lock the item, and if possible add the buffer that needs writeback to the buffer list. This leads to much simpler code than the previous split but requires the individual IOP_PUSH instances to unlock and reacquire the AIL around calls to blocking routines. Given that xfsailds now also handle writing out buffers, the conditions for log forcing and the sleep times needed some small changes. The most important one is that we consider an AIL busy as long we still have buffers to push, and the other one is that we do increment the pushed LSN for buffers that are under flushing at this moment, but still count them towards the stuck items for restart purposes. Without this we could hammer on stuck items without ever forcing the log and not make progress under heavy random delete workloads on fast flash storage devices. [ Dave Chinner: - rebase on previous patches. - improved comments for XBF_DELWRI_Q handling - fix XBF_ASYNC handling in queue submission (test 106 failure) - rename delwri submit function buffer list parameters for clarity - xfs_efd_item_push() should return XFS_ITEM_PINNED ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:31 -05:00
Christoph Hellwig	960c60af8b	xfs: do not add buffers to the delwri queue until pushed Instead of adding buffers to the delwri list as soon as they are logged, even if they can't be written until commited because they are pinned defer adding them to the delwri list until xfsaild pushes them. This makes the code more similar to other log items and prepares for writing buffers directly from xfsaild. The complication here is that we need to fail buffers that were added but not logged yet in xfs_buf_item_unpin, borrowing code from xfs_bioerror. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:30 -05:00
Christoph Hellwig	fe7257fd4b	xfs: do not write the buffer from xfs_qm_dqflush Instead of writing the buffer directly from inside xfs_qm_dqflush return it to the caller and let the caller decide what to do with the buffer. Also remove the pincount check in xfs_qm_dqflush that all non-blocking callers already implement and the now unused flags parameter and the XFS_DQ_IS_DIRTY check that all callers already perform. [ Dave Chinner: fixed build error cause by missing '{'. ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:29 -05:00
Christoph Hellwig	4c46819a80	xfs: do not write the buffer from xfs_iflush Instead of writing the buffer directly from inside xfs_iflush return it to the caller and let the caller decide what to do with the buffer. Also remove the pincount check in xfs_iflush that all non-blocking callers already implement and the now unused flags parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:28 -05:00
Christoph Hellwig	8a48088f64	xfs: don't flush inodes from background inode reclaim We already flush dirty inodes throug the AIL regularly, there is no reason to have second thread compete with it and disturb the I/O pattern. We still do write inodes when doing a synchronous reclaim from the shrinker or during unmount for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:28 -05:00
Christoph Hellwig	211e4d434b	xfs: implement freezing by emptying the AIL Now that we write back all metadata either synchronously or through the AIL we can simply implement metadata freezing in terms of emptying the AIL. The implementation for this is fairly simply and straight-forward: A new routine is added that asks the xfsaild to push the AIL to the end and waits for it to complete and send a wakeup. The routine will then loop if the AIL is not actually empty, and continue to do so until the AIL is compeltely empty. We keep an inode reclaim pass in the freeze process to avoid having memory pressure have to reclaim inodes that require dirtying the filesystem to be reclaimed after the freeze has completed. This means we can also treat unmount in the exact same way as freeze. As an upside we can now remove the radix tree based inode writeback and xfs_unmountfs_writesb. [ Dave Chinner: - Cleaned up commit message. - Added inode reclaim passes back into freeze. - Cleaned up wakeup mechanism to avoid the use of a new sleep counter variable. ] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:27 -05:00
Christoph Hellwig	1c30462542	xfs: allow assigning the tail lsn with the AIL lock held Provide a variant of xlog_assign_tail_lsn that has the AIL lock already held. By doing so we do an additional atomic_read + atomic_set under the lock, which comes down to two instructions. Switch xfs_trans_ail_update_bulk and xfs_trans_ail_delete_bulk to the new version to reduce the number of lock roundtrips, and prepare for a new addition that would require a third lock roundtrip in xfs_trans_ail_delete_bulk. This addition is also the reason for slightly rearranging the conditionals and relying on xfs_log_space_wake for checking that the filesystem has been shut down internally. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:26 -05:00
Christoph Hellwig	32ce90a4b7	xfs: remove log item from AIL in xfs_iflush after a shutdown If a filesystem has been forced shutdown we are never going to write inodes to disk, which means the inode items will stay in the AIL until we free the inode. Currently that is not a problem, but a pending change requires us to empty the AIL before shutting down the filesystem. In that case leaving the inode in the AIL is lethal. Make sure to remove the log item from the AIL to allow emptying the AIL on shutdown filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:25 -05:00
Christoph Hellwig	dea9609527	xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown If a filesystem has been forced shutdown we are never going to write dquots to disk, which means the dquot items will stay in the AIL forever. Currently that is not a problem, but a pending chance requires us to empty the AIL before shutting down the filesystem, in which case this behaviour is lethal. Make sure to remove the log item from the AIL to allow emptying the AIL on shutdown filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:24 -05:00
Shaohua Li	7582df516c	xfs: using GFP_NOFS for blkdev_issue_flush Issuing a block device flush request in transaction context using GFP_KERNEL directly can cause deadlocks due to memory reclaim recursion. Use GFP_NOFS to avoid recursion from reclaim context. Signed-off-by: Shaohua Li <shli@fusionio.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:23 -05:00
Dave Chinner	01c84d2dc1	xfs: punch all delalloc blocks beyond EOF on write failure. I've been seeing regular ASSERT failures in xfstests when running fsstress based tests over the past month. xfs_getbmap() has been failing this test: XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) \|\| (map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c, line: 5650 where it is encountering a delayed allocation extent after writing all the dirty data to disk and then walking the extent map atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed allocation extents from being created. Test 083 on a 512 byte block size filesystem was used to reproduce the problem, because it only had a 5s run timeand would usually fail every 3-4 runs. This test is exercising ENOSPC behaviour by running fsstress on a nearly full filesystem. The following trace extract shows the final few events on the inode that tripped the assert: xfs_ilock: flags ILOCK_EXCL caller xfs_setfilesize xfs_setfilesize: isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680 file size updated to 0x180000 by IO completion xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_iext_insert: state idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180000 count 512 type startoff 0xc00 startblock -1 blockcount 0x1 xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks delalloc write, adding a single block at offset 0x180000 xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180200 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180200 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180200 count 512 type startoff 0xc00 startblock -1 blockcount 0x2 And succeeding on retry after flushing dirty inodes. xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180400 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 And failing the retry, giving a real ENOSPC error. xfs_ilock: flags ILOCK_EXCL caller xfs_vm_write_failed ^^^^^^^^^^^^^^^^^^^ The smoking gun - the write being failed and cleaning up delalloc blocks beyond EOF allocated by the failed write. xfs_getattr: xfs_ilock: flags IOLOCK_SHARED caller xfs_getbmap xfs_ilock: flags ILOCK_SHARED caller xfs_ilock_map_shared And that's where we died almost immediately afterwards. xfs_bmapi_read() found delalloc extent beyond current file in memory file size. Some debug I added to xfs_getbmap() showed the state just before the assert failure: ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000 start_fsb 0x106, end_fsb 0x638 ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00 ext 0: off 0x1fc, fsb 0x24782, len 0x254 ext 1: off 0x450, fsb 0x40851, len 0x30 ext 2: off 0x480, fsb 0xd99, len 0x1b8 ext 3: off 0x92f, fsb 0x4099a, len 0x3b ext 4: off 0x96d, fsb 0x41844, len 0x98 ext 5: off 0xbf1, fsb 0x408ab, len 0xf which shows that we found a single delalloc block beyond EOF (first line of output) when we were returning the map for a length somewhere around 10^16 bytes long (second line), and the on-disk extents showed they didn't go past EOF (last lines). Further debug added to xfs_vm_write_failed() showed this happened when punching out delalloc blocks beyond the end of the file after the failed write: [ 132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000 [ 132.609573] start_fsb 0xc01, end_fsb 0xc08 It punched the range 0xc01 -> 0xc08, but the range we really need to punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was run on a 512 byte block size filesystem (8 blocks per page). the punch from is 0xc00. So end_fsb is correct, but start_fsb is wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks. Hence we are not punching the delalloc block beyond EOF in the case. The fix is simple - it's a silly off-by-one mistake in calculating the range. It's especially silly because the macro used to calculate the start_fsb already takes into account the case where the inode size is an exact multiple of the filesystem block size... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:22 -05:00
Dave Chinner	507630b29f	xfs: use shared ilock mode for direct IO writes by default For the direct IO write path, we only really need the ilock to be taken in exclusive mode during IO submission if we need to do extent allocation instead of all the time. Change the block mapping code to take the ilock in shared mode for the initial block mapping, and only retake it exclusively when we actually have to perform extent allocations. We were already dropping the ilock for the transaction allocation, so this doesn't introduce new race windows. Based on an earlier patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:21 -05:00
Christoph Hellwig	193aec1050	xfs: push the ilock into xfs_zero_eof Instead of calling xfs_zero_eof with the ilock held only take it internally for the minimall required critical section around xfs_bmapi_read. This also requires changing the calling convention for xfs_zero_last_block slightly. The actual zeroing operation is still serialized by the iolock, which must be taken exclusively over the call to xfs_zero_eof. We could in fact use a shared lock for the xfs_bmapi_read calls as long as the extent list has been read in, but given that we already hold the iolock exclusively there is little reason to micro optimize this further. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:20 -05:00
Christoph Hellwig	f38996f576	xfs: reduce ilock hold times in xfs_setattr_size We do not need the ilock for most checks done in the beginning of xfs_setattr_size. Replace the long critical section before starting the transaction with a smaller one around xfs_zero_eof and an optional one inside xfs_qm_dqattach that isn't entered unless using quotas. While this isn't a big optimization for xfs_setattr_size itself it will allow pushing the ilock into xfs_zero_eof itself later. Signed-off-by: Christoph Hellwig <hch@lst.de>	2012-05-14 16:20:18 -05:00
Christoph Hellwig	467f78992a	xfs: reduce ilock hold times in xfs_file_aio_write_checks We do not need the ilock for generic_write_checks and the i_size_read, which are protected by i_mutex and/or iolock, so reduce the ilock critical section to just the call to xfs_zero_eof. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:17 -05:00
Christoph Hellwig	b4d05e3019	xfs: avoid taking the ilock unnessecarily in xfs_qm_dqattach Check if we actually need to attach a dquot before taking the ilock in xfs_qm_dqattach. This avoid superflous lock roundtrips for the common cases of quota support compiled in but not activated on a filesystem and an inode that already has the dquots attached. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-05-14 16:20:15 -05:00
Jan Kara	dbd5768f87	vfs: Rename end_writeback() to clear_inode() After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2012-05-06 13:43:41 +08:00
Dave Chinner	8a00ebe4cf	xfs: Ensure inode reclaim can run during quotacheck Because the mount process can run a quotacheck and consume lots of inodes, we need to be able to run periodic inode reclaim during the mount process. This will prevent running the system out of memory during quota checks. This essentially reverts `2bcf6e97`, but that is safe to do now that the quota sync code that was causing problems during long quotacheck executions is now gone. The reclaim work is currently protected from running during the unmount process by a check against MS_ACTIVE. Unfortunately, this also means that the reclaim work cannot run during mount. The unmount process should stop the reclaim cleanly before freeing anything that the reclaim work depends on, so there is no need to have this guard in place. Also, the inode reclaim work is demand driven, so there is no need to start it immediately during mount. It will be started the moment an inode is queued for reclaim, so qutoacheck will trigger it just fine. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-04-17 11:19:47 -05:00
Jie Liu	da5bf95e3c	xfs: don't fill statvfs with project quota for a directory if it was not enabled. Check if the project quota is running or not before performing xfs_qm_statvfs(), just return if not. Otherwise the ASSERT XFS_IS_QUOTA_RUNNING in xfs_qm_dqget will be popped. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-04-16 16:32:20 -05:00
Linus Torvalds	0195c00244	Disintegrate and delete asm/system.h -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk 8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45 HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+ /5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9 Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK 4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD NypQthI85pc= =G9mT -----END PGP SIGNATURE----- Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system Pull "Disintegrate and delete asm/system.h" from David Howells: "Here are a bunch of patches to disintegrate asm/system.h into a set of separate bits to relieve the problem of circular inclusion dependencies. I've built all the working defconfigs from all the arches that I can and made sure that they don't break. The reason for these patches is that I recently encountered a circular dependency problem that came about when I produced some patches to optimise get_order() by rewriting it to use ilog2(). This uses bitops - and on the SH arch asm/bitops.h drags in asm-generic/get_order.h by a circuituous route involving asm/system.h. The main difficulty seems to be asm/system.h. It holds a number of low level bits with no/few dependencies that are commonly used (eg. memory barriers) and a number of bits with more dependencies that aren't used in many places (eg. switch_to()). These patches break asm/system.h up into the following core pieces: (1) asm/barrier.h Move memory barriers here. This already done for MIPS and Alpha. (2) asm/switch_to.h Move switch_to() and related stuff here. (3) asm/exec.h Move arch_align_stack() here. Other process execution related bits could perhaps go here from asm/processor.h. (4) asm/cmpxchg.h Move xchg() and cmpxchg() here as they're full word atomic ops and frequently used by atomic_xchg() and atomic_cmpxchg(). (5) asm/bug.h Move die() and related bits. (6) asm/auxvec.h Move AT_VECTOR_SIZE_ARCH here. Other arch headers are created as needed on a per-arch basis." Fixed up some conflicts from other header file cleanups and moving code around that has happened in the meantime, so David's testing is somewhat weakened by that. We'll find out anything that got broken and fix it.. * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits) Delete all instances of asm/system.h Remove all #inclusions of asm/system.h Add #includes needed to permit the removal of asm/system.h Move all declarations of free_initmem() to linux/mm.h Disintegrate asm/system.h for OpenRISC Split arch_align_stack() out from asm-generic/system.h Split the switch_to() wrapper out of asm-generic/system.h Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h Create asm-generic/barrier.h Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h Disintegrate asm/system.h for Xtensa Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt] Disintegrate asm/system.h for Tile Disintegrate asm/system.h for Sparc Disintegrate asm/system.h for SH Disintegrate asm/system.h for Score Disintegrate asm/system.h for S390 Disintegrate asm/system.h for PowerPC Disintegrate asm/system.h for PA-RISC Disintegrate asm/system.h for MN10300 ...	2012-03-28 15:58:21 -07:00
Linus Torvalds	f21ce8f844	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs Pull XFS update (part 2) from Ben Myers: "Fixes for tracing of xfs_name strings, flag handling in open_by_handle, a log space hang with freeze/unfreeze, fstrim offset calculations, a section mismatch with xfs_qm_exit, an oops in xlog_recover_process_iunlinks, and a deadlock in xfs_rtfree_extent. There are also additional trace points for attributes, and the addition of a workqueue for allocation to work around kernel stack size limitations." * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: add lots of attribute trace points xfs: Fix oops on IO error during xlog_recover_process_iunlinks() xfs: fix fstrim offset calculations xfs: Account log unmount transaction correctly xfs: don't cache inodes read through bulkstat xfs: trace xfs_name strings correctly xfs: introduce an allocation workqueue xfs: Fix open flag handling in open_by_handle code xfs: fix deadlock in xfs_rtfree_extent fs: xfs: fix section mismatch in linux-next	2012-03-28 15:23:52 -07:00
David Howells	9ffc93f203	Remove all #inclusions of asm/system.h Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\sinclude\s<asm/system[.]h>.\n!!' `grep -Irl '^#\sinclude\s<asm/system[.]h>' ` Signed-off-by: David Howells <dhowells@redhat.com>	2012-03-28 18:30:03 +01:00
Dave Chinner	5a5881cdee	xfs: add lots of attribute trace points Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-27 17:18:21 -05:00
Jan Kara	d97d32edcd	xfs: Fix oops on IO error during xlog_recover_process_iunlinks() When an IO error happens during inode deletion run from xlog_recover_process_iunlinks() filesystem gets shutdown. Thus any subsequent attempt to read buffers fails. Code in xlog_recover_process_iunlinks() does not count with the fact that read of a buffer which was read a while ago can really fail which results in the oops on agi = XFS_BUF_TO_AGI(agibp); Fix the problem by cleaning up the buffer handling in xlog_recover_process_iunlinks() as suggested by Dave Chinner. We release buffer lock but keep buffer reference to AG buffer. That is enough for buffer to stay pinned in memory and we don't have to call xfs_read_agi() all the time. CC: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-27 16:34:10 -05:00
Dave Chinner	a66d636385	xfs: fix fstrim offset calculations xfs_ioc_fstrim() doesn't treat the incoming offset and length correctly. It treats them as a filesystem block address, rather than a disk address. This is wrong because the range passed in is a linear representation, while the filesystem block address notation is a sparse representation. Hence we cannot convert the range direct to filesystem block units and then use that for calculating the range to trim. While this sounds dangerous, the problem is limited to calculating what AGs need to be trimmed. The code that calcuates the actual ranges to trim gets the right result (i.e. only ever discards free space), even though it uses the wrong ranges to limit what is trimmed. Hence this is not a bug that endangers user data. Fix this by treating the range as a disk address range and use the appropriate functions to convert the range into the desired formats for calculations. Further, fix the first free extent lookup (the longest) to actually find the largest free extent. Currently this lookup uses a <= lookup, which results in finding the extent to the left of the largest because we can never get an exact match on the largest extent. This is due to the fact that while we know it's size, we don't know it's location and so the exact match fails and we move one record to the left to get the next largest extent. Instead, use a >= search so that the lookup returns the largest extent regardless of the fact we don't get an exact match on it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-27 16:07:03 -05:00
Dave Chinner	3948659e30	xfs: Account log unmount transaction correctly There have been a few reports of this warning appearing recently: XFS (dm-4): xlog_space_left: head behind tail tail_cycle = 129, tail_bytes = 20163072 GH cycle = 129, GH bytes = 20162880 The common cause appears to be lots of freeze and unfreeze cycles, and the output from the warnings indicates that we are leaking around 8 bytes of log space per freeze/unfreeze cycle. When we freeze the filesystem, we write an unmount record and that uses xlog_write directly - a special type of transaction, effectively. What it doesn't do, however, is correctly account for the log space it uses. The unmount record writes an 8 byte structure with a special magic number into the log, and the space this consumes is not accounted for in the log ticket tracking the operation. Hence we leak 8 bytes every unmount record that is written. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-26 17:47:24 -05:00
Dave Chinner	5132ba8f2b	xfs: don't cache inodes read through bulkstat When we read inodes via bulkstat, we generally only read them once and then throw them away - they never get used again. If we retain them in cache, then it simply causes the working set of inodes and other cached items to be reclaimed just so the inode cache can grow. Avoid this problem by marking inodes read by bulkstat not to be cached and check this flag in .drop_inode to determine whether the inode should be added to the VFS LRU or not. If the inode lookup hits an already cached inode, then don't set the flag. If the inode lookup hits an inode marked with no cache flag, remove the flag and allow it to be cached once the current reference goes away. Inodes marked as not cached will get cleaned up by the background inode reclaim or via memory pressure, so they will still generate some short term cache pressure. They will, however, be reclaimed much sooner and in preference to cache hot inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-26 17:19:08 -05:00
Christoph Hellwig	f616137519	xfs: trace xfs_name strings correctly Strings store in an xfs_name structure are often not NUL terminated, print them using the correct printf specifiers that make use of the string length store in the xfs_name structure. Reported-by: Brian Candler <B.Candler@pobox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-26 13:58:48 -05:00
Linus Torvalds	49d99a2f9c	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs Pull XFS updates from Ben Myers: "Scalability improvements for dquots, log grant code cleanups, plus bugfixes and cleanups large and small" Fix up various trivial conflicts that were due to some of the earlier patches already having been integrated into v3.3 as bugfixes, and then there were development patches on top of those. Easily merged by just taking the newer version from the pulled branch. * 'for-linus' of git://oss.sgi.com/xfs/xfs: (45 commits) xfs: fallback to vmalloc for large buffers in xfs_getbmap xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get xfs: remove remaining scraps of struct xfs_iomap xfs: fix inode lookup race xfs: clean up minor sparse warnings xfs: remove the global xfs_Gqm structure xfs: remove the per-filesystem list of dquots xfs: use per-filesystem radix trees for dquot lookup xfs: per-filesystem dquot LRU lists xfs: use common code for quota statistics xfs: reimplement fdatasync support xfs: split in-core and on-disk inode log item fields xfs: make xfs_inode_item_size idempotent xfs: log timestamp updates xfs: log file size updates at I/O completion time xfs: log file size updates as part of unwritten extent conversion xfs: do not require an ioend for new EOF calculation xfs: use per-filesystem I/O completion workqueues quota: make Q_XQUOTASYNC a noop xfs: include reservations in quota reporting ...	2012-03-23 09:19:22 -07:00
Dave Chinner	c999a223c2	xfs: introduce an allocation workqueue We currently have significant issues with the amount of stack that allocation in XFS uses, especially in the writeback path. We can easily consume 4k of stack between mapping the page, manipulating the bmap btree and allocating blocks from the free list. Not to mention btree block readahead and other functionality that issues IO in the allocation path. As a result, we can no longer fit allocation in the writeback path in the stack space provided on x86_64. To alleviate this problem, introduce an allocation workqueue and move all allocations to a seperate context. This can be easily added as an interposing layer into xfs_alloc_vextent(), which takes a single argument structure and does not return until the allocation is complete or has failed. To do this, add a work structure and a completion to the allocation args structure. This allows xfs_alloc_vextent to queue the args onto the workqueue and wait for it to be completed by the worker. This can be done completely transparently to the caller. The worker function needs to ensure that it sets and clears the PF_TRANS flag appropriately as it is being run in an active transaction context. Work can also be queued in a memory reclaim context, so a rescuer is needed for the workqueue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-22 16:12:24 -05:00
Dave Chinner	1a1d772433	xfs: Fix open flag handling in open_by_handle code Sparse identified some unsafe handling of open flags in the xfs open by handle ioctl code. Update the code to use the correct access macros to ensure that we handle the open flags correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-22 15:56:52 -05:00
Kamal Dasu	5575acc780	xfs: fix deadlock in xfs_rtfree_extent To fix the deadlock caused by repeatedly calling xfs_rtfree_extent - removed xfs_ilock() and xfs_trans_ijoin() from xfs_rtfree_extent(), instead added asserts that the inode is locked and has an inode_item attached to it. - in xfs_bunmapi() when dealing with an inode with the rt flag call xfs_ilock() and xfs_trans_ijoin() so that the reference count is bumped on the inode and attached it to the transaction before calling into xfs_bmap_del_extent, similar to what we do in xfs_bmap_rtalloc. Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-22 15:31:06 -05:00
Gerard Snitselaar	1c2ccc66bc	fs: xfs: fix section mismatch in linux-next xfs_qm_exit() is called in init_xfs_fs(). Signed-off-by: Gerard Snitselaar <dev@snitselaar.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-22 13:48:55 -05:00
Al Viro	48fde701af	switch open-coded instances of d_make_root() to new helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 21:29:35 -04:00
Al Viro	8de5277879	vfs: check i_nlink limits in vfs_{mkdir,rename_dir,link} New field of struct super_block - ->s_max_links. Maximal allowed value of ->i_nlink or 0; in the latter case all checks still need to be done in ->link/->mkdir/->rename instances. Note that this limit applies both to directoris and to non-directories. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 21:29:32 -04:00
Dave Chinner	f074211f60	xfs: fallback to vmalloc for large buffers in xfs_getbmap xfs_getbmap uses for a large buffer for extents, which is kmalloc'd. This can fail after the system has been running for some time as it is a high order allocation. Add a fallback to vmalloc so that it doesn't require contiguous memory and so won't randomly fail on files with large extent lists. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-15 14:54:23 -05:00
Dave Chinner	ad650f5b27	xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get xfsdump uses for a large buffer for extended attributes, which has a kmalloc'd shadow buffer in the kernel. This can fail after the system has been running for some time as it is a high order allocation. Add a fallback to vmalloc so that it doesn't require contiguous memory and so won't randomly fail while xfsdump is running. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-15 14:14:33 -05:00
Dave Chinner	6eb2466036	xfs: remove remaining scraps of struct xfs_iomap Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-15 13:40:16 -05:00
Dave Chinner	f30d500f80	xfs: fix inode lookup race When we get concurrent lookups of the same inode that is not in the per-AG inode cache, there is a race condition that triggers warnings in unlock_new_inode() indicating that we are initialising an inode that isn't in a the correct state for a new inode. When we do an inode lookup via a file handle or a bulkstat, we don't serialise lookups at a higher level through the dentry cache (i.e. pathless lookup), and so we can get concurrent lookups of the same inode. The race condition is between the insertion of the inode into the cache in the case of a cache miss and a concurrently lookup: Thread 1 Thread 2 xfs_iget() xfs_iget_cache_miss() xfs_iread() lock radix tree radix_tree_insert() rcu_read_lock radix_tree_lookup lock inode flags XFS_INEW not set igrab() unlock inode flags rcu_read_unlock use uninitialised inode ..... lock inode flags set XFS_INEW unlock inode flags unlock radix tree xfs_setup_inode() inode flags = I_NEW unlock_new_inode() WARNING as inode flags != I_NEW This can lead to inode corruption, inode list corruption, etc, and is generally a bad thing to occur. Fix this by setting XFS_INEW before inserting the inode into the radix tree. This will ensure any concurrent lookup will find the new inode with XFS_INEW set and that forces the lookup to wait until the XFS_INEW flag is removed before allowing the lookup to succeed. cc: <stable@vger.kernel.org> # for 3.0.x, 3.2.x Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-15 13:16:42 -05:00
Dave Chinner	8d2a5e6ee3	xfs: clean up minor sparse warnings Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-14 13:21:17 -05:00
Christoph Hellwig	a05931ceb0	xfs: remove the global xfs_Gqm structure If we initialize the slab caches for the quota code when XFS is loaded there is no need for a global and reference counted quota manager structure. Drop all this overhead and also fix the error handling during quota initialization. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-14 12:06:32 -05:00
Christoph Hellwig	b84a3a9675	xfs: remove the per-filesystem list of dquots Instead of keeping a separate per-filesystem list of dquots we can walk the radix tree for the two places where we need to iterate all quota structures. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-14 11:53:34 -05:00
Christoph Hellwig	9f920f1164	xfs: use per-filesystem radix trees for dquot lookup Replace the global hash tables for looking up in-memory dquot structures with per-filesystem radix trees to allow scaling to a large number of in-memory dquot structures. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-14 11:09:06 -05:00
Christoph Hellwig	f8739c3ce2	xfs: per-filesystem dquot LRU lists Replace the global dquot lru lists with a per-filesystem one. Note that the shrinker isn't wire up to the per-superblock VFS shrinker infrastructure as would have problems summing up and splitting the counts for inodes and dquots. I don't think this is a major problem as the quota cache isn't as interwinded with the inode cache as the dentry cache is, because an inode that is dropped from the cache will generally release a dquot reference, but most of the time it won't be the last one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-14 11:09:06 -05:00
Christoph Hellwig	48776fd223	xfs: use common code for quota statistics Switch the quota code over to use the generic XFS statistics infrastructure. While the legacy /proc/fs/xfs/xqm and /proc/fs/xfs/xqmstats interfaces are preserved for now the statistics that still have a meaning with the current code are now also available from /proc/fs/xfs/stats. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-14 11:09:06 -05:00
Christoph Hellwig	8f639ddea0	xfs: reimplement fdatasync support Add an in-memory only flag to say we logged timestamps only, and use it to check if fdatasync can optimize away the log force. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-13 17:18:14 -05:00
Christoph Hellwig	f5d8d5c4bf	xfs: split in-core and on-disk inode log item fields Add a new ili_fields member to the inode log item to isolate the in-memory flags from the ones that actually go to the log. This will allow tracking timestamp-only updates for fdatasync and O_DSYNC in the next patch and prepares for divorcing the on-disk log format from the in-memory log item a little further down the road. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-13 17:08:17 -05:00
Christoph Hellwig	339a5f5dd9	xfs: make xfs_inode_item_size idempotent Move all code messing with the inode log item flags into xfs_inode_item_format to make sure xfs_inode_item_size really only calculates the the number of vectors, but doesn't modify any state of the inode item. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-13 17:05:08 -05:00
Christoph Hellwig	8a9c9980f2	xfs: log timestamp updates Timestamps on regular files are the last metadata that XFS does not update transactionally. Now that we use the delaylog mode exclusively and made the log scode scale extremly well there is no need to bypass that code for timestamp updates. Logging all updates allows to drop a lot of code, and will allow for further performance improvements later on. Note that this patch drops optimized handling of fdatasync - it will be added back in a separate commit. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-13 17:01:15 -05:00
Christoph Hellwig	281627df3e	xfs: log file size updates at I/O completion time Do not use unlogged metadata updates and the VFS dirty bit for updating the file size after writeback. In addition to causing various problems with updates getting delayed for far too long this also drags in the unscalable VFS dirty tracking, and is one of the few remaining unlogged metadata updates. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-13 16:30:49 -05:00
Christoph Hellwig	84803fb782	xfs: log file size updates as part of unwritten extent conversion If we convert and unwritten extent past the current i_size log the size update as part of the extent manipulation transactions instead of doing an unlogged metadata update later. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-05 11:53:16 -06:00
Christoph Hellwig	6923e686f1	xfs: do not require an ioend for new EOF calculation Replace xfs_ioend_new_eof with a new inline xfs_new_eof helper that doesn't require and ioend, and is available also outside of xfs_aops.c. Also make the code a bit more clear by using a normal if statement instead of a slightly misleading MIN(). Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-05 11:19:26 -06:00
Christoph Hellwig	aa6bf01d39	xfs: use per-filesystem I/O completion workqueues The new concurrency managed workqueues are cheap enough that we can create per-filesystem instead of global workqueues. This allows us to remove the trylock or defer scheme on the ilock, which is not helpful once we have outstanding log reservations until finishing a size update. Also allow the default concurrency on this workqueues so that I/O completions blocking on the ilock for one inode do not block process for another inode. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-03-05 11:07:42 -06:00
Christoph Hellwig	8960501191	xfs: include reservations in quota reporting Report all quota usage including the currently pending reservations. This avoids the need to flush delalloc space before gathering quota information, and matches quota enforcement, which already takes the reservations into account. This fixes xfstests 270. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-29 14:09:06 -06:00
Christoph Hellwig	18535a7e01	xfs: merge xfs_qm_export_dquot into xfs_qm_scall_getquota The is no good reason to have these two separate, and for the next change we would need the full struct xfs_dquot in xfs_qm_export_dquot, so better just fold the code now instead of changing it spuriously. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-29 11:57:36 -06:00
Alex Elder	ad637a10f4	xfs: only take the ILOCK in xfs_reclaim_inode() At the end of xfs_reclaim_inode(), the inode is locked in order to we wait for a possible concurrent lookup to complete before the inode is freed. This synchronization step was taking both the ILOCK and the IOLOCK, but the latter was causing lockdep to produce reports of the possibility of deadlock. It turns out that there's no need to acquire the IOLOCK at this point anyway. It may have been required in some earlier version of the code, but there should be no need to take the IOLOCK in xfs_iget(), so there's no (longer) any need to get it here for synchronization. Add an assertion in xfs_iget() as a reminder of this assumption. Dave Chinner diagnosed this on IRC, and Christoph Hellwig suggested no longer including the IOLOCK. I just put together the patch. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-25 13:55:49 -06:00
Christoph Hellwig	9006fb91cf	xfs: split and cleanup xfs_log_reserve Split the log regrant case out of xfs_log_reserve into a separate function, and merge xlog_grant_log_space and xlog_regrant_write_log_space into their respective callers. Also replace the XFS_LOG_PERM_RESERV flag, which easily got misused before the previous cleanups with a simple boolean parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:37:04 -06:00
Christoph Hellwig	42ceedb3ca	xfs: share code for grant head availability checks Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:34:03 -06:00
Christoph Hellwig	e179840d74	xfs: share code for grant head wakeups Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:31:45 -06:00
Christoph Hellwig	23ee3df349	xfs: share code for grant head waiting Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:29:39 -06:00
Christoph Hellwig	a79bf2d75b	xfs: add xlog_grant_head_wake_all Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:26:47 -06:00
Christoph Hellwig	c303c5b8c3	xfs: add xlog_grant_head_init Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:21:39 -06:00
Christoph Hellwig	28496968a6	xfs: add the xlog_grant_head structure Add a new data structure to allow sharing code between the log grant and regrant code. Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:19:53 -06:00
Christoph Hellwig	14a7235fba	xfs: remove log space waitqueues The tic->t_wait waitqueues can never have more than a single waiter on them, so we can easily replace them with a task_struct pointer and wake_up_process. Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:17:00 -06:00
Christoph Hellwig	cfb7cdca0a	xfs: cleanup xfs_log_space_wake Remove the now unused opportunistic parameter, and use the the xlog_writeq_wake and xlog_reserveq_wake helpers now that we don't have to care about the opportunistic wakeups. Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:17:00 -06:00
Christoph Hellwig	5b03ff1b24	xfs: remove xfs_trans_unlocked_item There is no reason to wake up log space waiters when unlocking inodes or dquots, and the commit log has no explanation for this function either. Given that we now have exact log space wakeups everywhere we can assume the reason for this function was to paper over log space races in earlier XFS versions. Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:17:00 -06:00
Christoph Hellwig	3af1de753b	xfs: do exact log space wakeups in xlog_ungrant_log_space The only reason that xfs_log_space_wake had to do opportunistic wakeups was that the old xfs_log_move_tail calling convention didn't allow for exact wakeups when not updating the log tail LSN. Since this issue has been fixed we can do exact wakeups now. Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:17:00 -06:00
Christoph Hellwig	09a423a3d6	xfs: split tail_lsn assignments from log space wakeups Currently xfs_log_move_tail has a tail_lsn argument that is horribly overloaded: it may contain either an actual lsn to assign to the log tail, 0 as a special case to use the last sync LSN, or 1 to indicate that no tail LSN assignment should be performed, and we should opportunisticly wake up at one task waiting for log space even if we did not move the LSN. Remove the tail lsn assigned from xfs_log_move_tail and make the two callers use xlog_assign_tail_lsn instead of the current variant of partially using the code in xfs_log_move_tail and partially opencoding it. Note that means we grow an addition lock roundtrip on the AIL lock for each bulk update or delete, which is still far less than what we had before introducing the bulk operations. If this proves to be a problem we can still add a variant of xlog_assign_tail_lsn that expects the lock to be held already. Also rename the remainder of xfs_log_move_tail to xfs_log_space_wake as that name describes its functionality much better. Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 22:17:00 -06:00
Mitsuo Hayasaka	70b5437653	xfs: cleanup quota check on disk blocks and inodes reservations This patch is a cleanup of quota check on disk blocks and inodes reservations, and changes it as follows. (1) add a total_count variable to store the total number of current usages and new reservations for disk blocks and inodes, respectively. (2) make it more readable to check if the local variables softlimit and hardlimit are positive. It has been changed as follows. if (softlimit > 0ULL) -> if (softlimit) if (hardlimit > 0ULL) -> if (hardlimit) This is because they are defined as xfs_qcnt_t which is unsigned. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-22 21:47:52 -06:00
Mitsuo Hayasaka	33e0edafd7	xfs: make inode quota check more general The xfs checks quota when reserving disk blocks and inodes. In the block reservation, it checks if the total number of blocks including current usage and new reservation exceed quota. In the inode reservation, it checks using the total number of inodes including only current usage without new reservation. However, this inode quota check works well since the caller of xfs_trans_dquot() always sets the argument of the number of new inode reservation to 1 or 0 and inode is reserved one by one in current xfs. To make it more general, this patch changes it to the same way as the block quota check. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `c922bbc819`)	2012-02-21 10:13:59 -06:00
Mitsuo Hayasaka	d0a3fe67e3	xfs: change available ranges of softlimit and hardlimit in quota check In general, quota allows us to use disk blocks and inodes up to each limit, that is, they are available if they don't exceed their limitations. Current xfs sets their available ranges to lower than them except disk inode quota check. So, this patch changes the ranges to not beyond them. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `20f12d8ac0`)	2012-02-21 10:13:49 -06:00
Mitsuo Hayasaka	c922bbc819	xfs: make inode quota check more general The xfs checks quota when reserving disk blocks and inodes. In the block reservation, it checks if the total number of blocks including current usage and new reservation exceed quota. In the inode reservation, it checks using the total number of inodes including only current usage without new reservation. However, this inode quota check works well since the caller of xfs_trans_dquot() always sets the argument of the number of new inode reservation to 1 or 0 and inode is reserved one by one in current xfs. To make it more general, this patch changes it to the same way as the block quota check. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-21 10:12:43 -06:00
Mitsuo Hayasaka	20f12d8ac0	xfs: change available ranges of softlimit and hardlimit in quota check In general, quota allows us to use disk blocks and inodes up to each limit, that is, they are available if they don't exceed their limitations. Current xfs sets their available ranges to lower than them except disk inode quota check. So, this patch changes the ranges to not beyond them. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-21 10:12:43 -06:00
Jesper Juhl	f65020a83a	XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended It looks to me like the two ASSERT()s in xfs_trans_add_item() really want to do a compare (==) rather than assignment (=). This patch changes it from the latter to the former. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `05293485a0`)	2012-02-13 17:09:21 -06:00
Jesper Juhl	05293485a0	XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended It looks to me like the two ASSERT()s in xfs_trans_add_item() really want to do a compare (==) rather than assignment (=). This patch changes it from the latter to the former. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-13 17:06:39 -06:00
Christoph Hellwig	92b2e5b31d	xfs: use a normal shrinker for the dquot freelist Stop reusing dquots from the freelist when allocating new ones directly, and implement a shrinker that actually follows the specifications for the interface. The shrinker implementation is still highly suboptimal at this point, but we can gradually work on it. This also fixes an bug in the previous lock ordering, where we would take the hash and dqlist locks inside of the freelist lock against the normal lock ordering. This is only solvable by introducing the dispose list, and thus not when using direct reclaim of unused dquots for new allocations. As a side-effect the quota upper bound and used to free ratio values in /proc/fs/xfs/xqm are set to 0 as these values don't make any sense in the new world order. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `04da0c8196`)	2012-02-10 12:38:09 -06:00
Christoph Hellwig	04da0c8196	xfs: use a normal shrinker for the dquot freelist Stop reusing dquots from the freelist when allocating new ones directly, and implement a shrinker that actually follows the specifications for the interface. The shrinker implementation is still highly suboptimal at this point, but we can gradually work on it. This also fixes an bug in the previous lock ordering, where we would take the hash and dqlist locks inside of the freelist lock against the normal lock ordering. This is only solvable by introducing the dispose list, and thus not when using direct reclaim of unused dquots for new allocations. As a side-effect the quota upper bound and used to free ratio values in /proc/fs/xfs/xqm are set to 0 as these values don't make any sense in the new world order. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-10 12:02:05 -06:00
Chandra Seetharaman	4177af3a8a	Define new macro XFS_ALL_QUOTA_ACTIVE and simply some usage Define new macro XFS_ALL_QUOTA_ACTIVE and simply some usage of quota macros. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-03 11:32:20 -06:00
Chandra Seetharaman	6bd92a239f	Change xfs_sb_from_disk() interface to take a mount pointer Change xfs_sb_from_disk() interface to take a mount pointer instead of a superblock pointer. This is to print mount point specific error messages in future fixes. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-03 11:21:33 -06:00
Chandra Seetharaman	3673141083	Define a new function xfs_inode_dquot() Define a new function xfs_inode_dquot() that takes a inode pointer and a disk quota type and returns the quota pointer for the specified quota type. This simplifies the xfs_qm_dqget() error path significantly. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-03 10:53:39 -06:00
Chandra Seetharaman	6967b964c1	Define a new function xfs_this_quota_on() Create a new function xfs_this_quota_on() that takes a xfs_mount data structure and a disk quota type and returns true if the specified type of quota is ON in the xfs_mount data structure. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-03 09:42:53 -06:00
Amit Sahrawat	b995730845	xfs: kill the unused XFS_BB_FSB_OFFSET macro Removing the macro, as this is no more needed in the code. Tried to find the reference when it was last used - but the usage for this seemed to have been dropped long time ago. Signed-off-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-02-02 17:08:04 -06:00
Mitsuo Hayasaka	021000e59c	xfs: show uuid when mount fails due to duplicate uuid When a system tries to mount a filesystem (FS) using UUID, the xfs returns -EINVAL and shows a message if a FS with the same UUID has been already mounted. It is useful to output the duplicate UUID with it. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-31 13:37:33 -06:00
Mitsuo Hayasaka	4505360376	xfs: pass KM_SLEEP flag to kmem_realloc() in xlog_recover_add_to_cnt_trans() The kmem_realloc() in xfs is given KM_* memory allocation flags. And it allocates memory using kmalloc() after they are converted to gfp_mask flags. In xlog_recover_add_to_cont_trans(), 0u is passed to kmem_realloc(), instead of them. I guess it is preferred to use them, and here memory must be allocated but don't have to be done with GFP_ATOMIC. So, this patch changes it to KM_SLEEP. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-31 12:11:18 -06:00
Jan Kara	9b025eb3a8	xfs: Fix missing xfs_iunlock() on error recovery path in xfs_readlink() Commit `b52a360b` forgot to call xfs_iunlock() when it detected corrupted symplink and bailed out. Fix it by jumping to 'out' instead of doing return. CC: stable@kernel.org CC: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Alex Elder <elder@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-25 11:01:31 -06:00
Christoph Hellwig	d060646436	xfs: cleanup xfs_file_aio_write With all the size field updates out of the way xfs_file_aio_write can be further simplified by pushing all iolock handling into xfs_file_dio_aio_write and xfs_file_buffered_aio_write and using the generic generic_write_sync helper for synchronous writes. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:12:33 -06:00
Christoph Hellwig	5bf1f26227	xfs: always return with the iolock held from xfs_file_aio_write_checks While xfs_iunlock is fine with 0 lockflags the calling conventions are much cleaner if xfs_file_aio_write_checks never returns without the iolock held. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:11:07 -06:00
Christoph Hellwig	2813d682e8	xfs: remove the i_new_size field in struct xfs_inode Now that we use the VFS i_size field throughout XFS there is no need for the i_new_size field any more given that the VFS i_size field gets updated in ->write_end before unlocking the page, and thus is always uptodate when writeback could see a page. Removing i_new_size also has the advantage that we will never have to trim back di_size during a failed buffered write, given that it never gets updated past i_size. Note that currently the generic direct I/O code only updates i_size after calling our end_io handler, which requires a small workaround to make sure di_size actually makes it to disk. I hope to fix this properly in the generic code. A downside is that we lose the support for parallel non-overlapping O_DIRECT appending writes that recently was added. I don't think keeping the complex and fragile i_new_size infrastructure for this is a good tradeoff - if we really care about parallel appending writers we should investigate turning the iolock into a range lock, which would also allow for parallel non-overlapping buffered writers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:10:19 -06:00
Christoph Hellwig	ce7ae151dd	xfs: remove the i_size field in struct xfs_inode There is no fundamental need to keep an in-memory inode size copy in the XFS inode. We already have the on-disk value in the dinode, and the separate in-memory copy that we need for regular files only in the XFS inode. Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the VFS inode i_size field for regular files. Switch code that was directly accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where we are limited to regular files direct access of the VFS inode i_size field. This also allows dropping some fairly complicated code in the write path which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size that is getting updated inside ->write_end. Note that we do not bother resetting the VFS i_size when truncating a file that gets freed to zero as there is no point in doing so because the VFS inode is no longer in use at this point. Just relax the assert in xfs_ifree to only check the on-disk size instead. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:08:53 -06:00
Christoph Hellwig	f392e6319a	xfs: replace i_pin_wait with a bit waitqueue Replace i_pin_wait, which is only used during synchronous inode flushing with a bit waitqueue. This trades off a much smaller inode against slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit) bytes in the XFS inode. Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:07:54 -06:00
Christoph Hellwig	474fce0675	xfs: replace i_flock with a sleeping bitlock We almost never block on i_flock, the exception is synchronous inode flushing. Instead of bloating the inode with a 16/24-byte completion that we abuse as a semaphore just implement it as a bitlock that uses a bit waitqueue for the rare sleeping path. This primarily is a tradeoff between a much smaller inode and a faster non-blocking path vs faster wakeups, and we are much better off with the former. A small downside is that we will lose lockdep checking for i_flock, but given that it's always taken inside the ilock that should be acceptable. Note that for example the inode writeback locking is implemented in a very similar way. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:06:45 -06:00
Christoph Hellwig	49e4c70e52	xfs: make i_flags an unsigned long To be used for bit wakeup i_flags needs to be an unsigned long or we'll run into trouble on big endian systems. Because of the 1-byte i_update field right after it this actually causes a fairly large size increase on its own (4 or 8 bytes), but that increase will be more than offset by the next two patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:03:50 -06:00
Christoph Hellwig	8096b1ebb5	xfs: remove the if_ext_max field in struct xfs_ifork We spent a lot of effort to maintain this field, but it always equals to the fork size divided by the constant size of an extent. The prime use of it is to assert that the two stay in sync. Just divide the fork size by the extent size in the few places that we actually use it and remove the overhead of maintaining it. Also introduce a few helpers to consolidate the places where we actually care about the value. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-17 15:02:28 -06:00
Christoph Hellwig	3d2b3129c2	xfs: remove the unused dm_attrs structure .. and the just as dead bhv_desc forward declaration while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-13 12:11:46 -06:00
Christoph Hellwig	bf322d983e	xfs: cleanup xfs_iomap_eof_align_last_fsb Replace the nasty if, else if, elseif condition with more natural C flow that expressed the logic we want here better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-13 12:11:46 -06:00
Christoph Hellwig	673e8e597c	xfs: remove xfs_itruncate_data This wrapper isn't overly useful, not to say rather confusing. Around the call to xfs_itruncate_extents it does: - add tracing - add a few asserts in debug builds - conditionally update the inode size in two places - log the inode Both the tracing and the inode logging can be moved to xfs_itruncate_extents as they are useful for the attribute fork as well - in fact the attr code already does an equivalent xfs_trans_log_inode call just after calling xfs_itruncate_extents. The conditional size updates are a mess, and there was no reason to do them in two places anyway, as the first one was conditional on the inode having extents - but without extents we xfs_itruncate_extents would be a no-op and the placement wouldn't matter anyway. Instead move the size assignments and the asserts that make sense to the callers that want it. As a side effect of this clean up xfs_setattr_size by introducing variables for the old and new inode size, and moving the size updates into a common place. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-13 12:11:45 -06:00
Linus Torvalds	993ecff81a	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix endian conversion issue in discard code	2012-01-09 12:50:15 -08:00
Linus Torvalds	98793265b4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits) Kconfig: acpi: Fix typo in comment. misc latin1 to utf8 conversions devres: Fix a typo in devm_kfree comment btrfs: free-space-cache.c: remove extra semicolon. fat: Spelling s/obsolate/obsolete/g SCSI, pmcraid: Fix spelling error in a pmcraid_err() call tools/power turbostat: update fields in manpage mac80211: drop spelling fix types.h: fix comment spelling for 'architectures' typo fixes: aera -> area, exntension -> extension devices.txt: Fix typo of 'VMware'. sis900: Fix enum typo 'sis900_rx_bufer_status' decompress_bunzip2: remove invalid vi modeline treewide: Fix comment and string typo 'bufer' hyper-v: Update MAINTAINERS treewide: Fix typos in various parts of the kernel, and fix some comments. clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR gpio: Kconfig: drop unknown symbol 'CS5535_GPIO' leds: Kconfig: Fix typo 'D2NET_V2' sound: Kconfig: drop unknown symbol ARCH_CLPS7500 ... Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new kconfig additions, close to removed commented-out old ones)	2012-01-08 13:21:22 -08:00
Linus Torvalds	eb59c505f8	Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits) PM / Hibernate: Implement compat_ioctl for /dev/snapshot PM / Freezer: fix return value of freezable_schedule_timeout_killable() PM / shmobile: Allow the A4R domain to be turned off at run time PM / input / touchscreen: Make st1232 use device PM QoS constraints PM / QoS: Introduce dev_pm_qos_add_ancestor_request() PM / shmobile: Remove the stay_on flag from SH7372's PM domains PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode PM: Drop generic_subsys_pm_ops PM / Sleep: Remove forward-only callbacks from AMBA bus type PM / Sleep: Remove forward-only callbacks from platform bus type PM: Run the driver callback directly if the subsystem one is not there PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412. PM / Sleep: Merge internal functions in generic_ops.c PM / Sleep: Simplify generic system suspend callbacks PM / Hibernate: Remove deprecated hibernation snapshot ioctls PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled() ARM: S3C64XX: Implement basic power domain support PM / shmobile: Use common always on power domain governor ... Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused XBT_FORCE_SLEEP bit	2012-01-08 13:10:57 -08:00
Linus Torvalds	29ad0de279	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (22 commits) xfs: mark the xfssyncd workqueue as non-reentrant xfs: simplify xfs_qm_detach_gdquots xfs: fix acl count validation in xfs_acl_from_disk() xfs: remove unused XBT_FORCE_SLEEP bit xfs: remove XFS_QMOPT_DQSUSER xfs: kill xfs_qm_idtodq xfs: merge xfs_qm_dqinit_core into the only caller xfs: add a xfs_dqhold helper xfs: simplify xfs_qm_dqattach_grouphint xfs: nest qm_dqfrlist_lock inside the dquot qlock xfs: flatten the dquot lock ordering xfs: implement lazy removal for the dquot freelist xfs: remove XFS_DQ_INACTIVE xfs: cleanup xfs_qm_dqlookup xfs: cleanup dquot locking helpers xfs: remove the sync_mode argument to xfs_qm_dqflush_all xfs: remove xfs_qm_sync xfs: make sure to really flush all dquots in xfs_qm_quotacheck xfs: untangle SYNC_WAIT and SYNC_TRYLOCK meanings for xfs_qm_dqflush xfs: remove the lid_size field in struct log_item_desc ... Fix up trivial conflict in fs/xfs/xfs_sync.c	2012-01-08 13:05:29 -08:00
Al Viro	34c80b1d93	vfs: switch ->show_options() to struct dentry * Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-06 23:19:54 -05:00
Al Viro	576b1d67ce	xfs: propagate umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:55:00 -05:00
Al Viro	1a67aafb5f	switch ->mknod() to umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:54 -05:00
Al Viro	4acdaf27eb	switch ->create() to umode_t vfs_create() ignores everything outside of 16bit subset of its mode argument; switching it to umode_t is obviously equivalent and it's the only caller of the method Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:53 -05:00
Al Viro	18bb1db3e7	switch vfs_mkdir() and ->mkdir() to umode_t vfs_mkdir() gets int, but immediately drops everything that might not fit into umode_t and that's the only caller of ->mkdir()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:53 -05:00
Al Viro	6b520e0565	vfs: fix the stupidity with i_dentry in inode destructors Seeing that just about every destructor got that INIT_LIST_HEAD() copied into it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once(); the cost of taking it into inode_init_always() will be negligible for pipes and sockets and negative for everything else. Not to mention the removal of boilerplate code from ->destroy_inode() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:52:40 -05:00
Al Viro	2a79f17e4a	vfs: mnt_drop_write_file() new helper (wrapper around mnt_drop_write()) to be used in pair with mnt_want_write_file(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:52:40 -05:00
Al Viro	a561be7100	switch a bunch of places to mnt_want_write_file() it's both faster (in case when file has been opened for write) and cleaner. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:52:35 -05:00
Dave Chinner	b1c770c273	xfs: fix endian conversion issue in discard code When finding the longest extent in an AG, we read the value directly out of the AGF buffer without endian conversion. This will give an incorrect length, resulting in FITRIM operations potentially not trimming everything that it should. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-03 11:39:55 -06:00
Christoph Hellwig	be4f1ac828	xfs: log all dirty inodes in xfs_fs_sync_fs Since Linux 2.6.36 the writeback code has introduces various measures for live lock prevention during sync(). Unfortunately some of these are actively harmful for the XFS model, where the inode gets marked dirty for metadata from the data I/O handler. The older_than_this checks that are now more strictly enforced since writeback: avoid livelocking WB_SYNC_ALL writeback by only calling into __writeback_inodes_sb and thus only sampling the current cut off time once. But on a slow enough devices the previous asynchronous sync pass might not have fully completed yet, and thus XFS might mark metadata dirty only after that sampling of the cut off time for the blocking pass already happened. I have not myself reproduced this myself on a real system, but by introducing artificial delay into the XFS I/O completion workqueues it can be reproduced easily. Fix this by iterating over all XFS inodes in ->sync_fs and log all that are dirty. This might log inode that only got redirtied after the previous pass, but given how cheap delayed logging of inodes is it isn't a major concern for performance. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-23 16:41:47 -06:00
Christoph Hellwig	0b8fd3033c	xfs: log the inode in ->write_inode calls for kupdate If the writeback code writes back an inode because it has expired we currently use the non-blockin ->write_inode path. This means any inode that is pinned is skipped. With delayed logging and a workload that has very little log traffic otherwise it is very likely that an inode that gets constantly written to is always pinned, and thus we keep refusing to write it. The VM writeback code at that point redirties it and doesn't try to write it again for another 30 seconds. This means under certain scenarious time based metadata writeback never happens. Fix this by calling into xfs_log_inode for kupdate in addition to data integrity syncs, and thus transfer the inode to the log ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-23 16:41:47 -06:00
Rafael J. Wysocki	b00f4dc5ff	Merge branch 'master' into pm-sleep * master: (848 commits) SELinux: Fix RCU deref check warning in sel_netport_insert() binary_sysctl(): fix memory leak mm/vmalloc.c: remove static declaration of va from __get_vm_area_node ipmi_watchdog: restore settings when BMC reset oom: fix integer overflow of points in oom_badness memcg: keep root group unchanged if creation fails nilfs2: potential integer overflow in nilfs_ioctl_clean_segments() nilfs2: unbreak compat ioctl cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask evm: prevent racing during tfm allocation evm: key must be set once during initialization mmc: vub300: fix type of firmware_rom_wait_states module parameter Revert "mmc: enable runtime PM by default" mmc: sdhci: remove "state" argument from sdhci_suspend_host x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT IB/qib: Correct sense on freectxts increment and decrement RDMA/cma: Verify private data length cgroups: fix a css_set not found bug in cgroup_attach_proc oprofile: Fix uninitialized memory access when writing to writing to oprofilefs Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel" ... Conflicts: kernel/cgroup_freezer.c	2011-12-21 21:59:45 +01:00
Christoph Hellwig	40d344ec5e	xfs: mark the xfssyncd workqueue as non-reentrant On a system with lots of memory pressure that is stuck on synchronous inode reclaim the workqueue code will run one instance of the inode reclaim work item on every CPU. which is not what we want. Make sure to mark the xfssyncd workqueue as non-reentrant to make sure there only is one instace of each running globally. Also stop using special paramater for the workqueue; now that we guarantee each fs has only running one of each works at a time there is no need to artificially lower max_active and compensate for that by setting the WQ_CPU_INTENSIVE flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-19 22:17:01 -06:00
Christoph Hellwig	28fb588c9b	xfs: simplify xfs_qm_detach_gdquots There is no reason to drop qi_dqlist_lock around calls to xfs_qm_dqrele because the free list lock now nests inside qi_dqlist_lock and the dquot lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-16 15:33:30 -06:00
Xi Wang	093019cf1b	xfs: fix acl count validation in xfs_acl_from_disk() Commit `fa8b18ed` didn't prevent the integer overflow and possible memory corruption. "count" can go negative and bypass the check. Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-16 15:17:42 -06:00
Eric Sandeen	687d1c5e8e	xfs: remove unused XBT_FORCE_SLEEP bit XBT_FORCE_SLEEP is no longer ever tested; it is only set and cleared. Remove it. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-16 15:12:33 -06:00
Christoph Hellwig	7ae4440723	xfs: remove XFS_QMOPT_DQSUSER Just read the id 0 dquot from disk directly in xfs_qm_init_quotainfo instead of going through dqget and requiring a special flag to not add the dquot to any lists. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-15 14:38:30 -06:00
Christoph Hellwig	97e7ade506	xfs: kill xfs_qm_idtodq This function doesn't help the code flow, so merge the dquot allocation and transaction handling into xfs_qm_dqread. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-15 14:37:32 -06:00
Christoph Hellwig	49d35a5cf1	xfs: merge xfs_qm_dqinit_core into the only caller Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-15 14:37:32 -06:00
Christoph Hellwig	78e55892d6	xfs: add a xfs_dqhold helper Factor the common pattern of: xfs_dqlock(dqp); XFS_DQHOLD(dqp); xfs_dqunlock(dqp); into a new helper, and remove XFS_DQHOLD now that only one other caller is left. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-15 14:37:32 -06:00
Christoph Hellwig	ab680bb739	xfs: simplify xfs_qm_dqattach_grouphint No need to play games with the qlock now that the freelist lock nests inside it. Also clean up various outdated comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-15 10:04:31 -06:00
Christoph Hellwig	bf72de3194	xfs: nest qm_dqfrlist_lock inside the dquot qlock Allow xfs_qm_dqput to work without trylock loops by nesting the freelist lock inside the dquot qlock. In turn that requires trylocks in the reclaim path instead, but given it's a classic tradeoff between fast and slow path, and we follow the model of the inode and dentry caches. Document our new lock order now that it has settled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-14 21:15:42 -06:00
Christoph Hellwig	92678554ab	xfs: flatten the dquot lock ordering Introduce a new XFS_DQ_FREEING flag that tells lookup and mplist walks to skip a dquot that is beeing freed, and use this avoid the trylock on the hash and mplist locks in xfs_qm_dqreclaim_one. Also simplify xfs_dqpurge by moving the inodes to a dispose list after marking them XFS_DQ_FREEING and avoid the locker ordering constraints. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-14 16:32:21 -06:00
Christoph Hellwig	be7ffc38a8	xfs: implement lazy removal for the dquot freelist Do not remove dquots from the freelist when we grab a reference to them in xfs_qm_dqlookup, but leave them on the freelist util scanning notices that they have a reference. This speeds up the lookup fastpath, and greatly simplifies the lock ordering constraints. Note that the same scheme is used by the VFS inode and dentry caches. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-13 16:46:28 -06:00
Christoph Hellwig	80a376bfb7	xfs: remove XFS_DQ_INACTIVE Free dquots when purging them during umount instead of keeping them around on the freelist in a degraded state. The out of order locking in xfs_qm_dqpurge will be removed again later in this series. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-13 14:55:54 -06:00
Christoph Hellwig	497507b9ee	xfs: cleanup xfs_qm_dqlookup Rearrange the code to avoid the conditional locking around the flist_locked variable. This means we lose a (rather pointless) assert, and hold the freelist lock a bit longer for one corner case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-13 11:43:35 -06:00
Christoph Hellwig	800b484ec0	xfs: cleanup dquot locking helpers Mark the trivial lock wrappers as inline, and make the naming consistent for all of them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-12 17:28:20 -06:00
Christoph Hellwig	a7ef9bd79f	xfs: remove the sync_mode argument to xfs_qm_dqflush_all It always is zero, and removing it will make future changes easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-12 16:46:20 -06:00
Christoph Hellwig	34625c661b	xfs: remove xfs_qm_sync Now that we can't have any dirty dquots around that aren't in the AIL we can get rid of the explicit dquot syncing from xfssyncd and xfs_fs_sync_fs and instead rely on AIL pushing to write out any quota updates. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-12 16:41:44 -06:00
Christoph Hellwig	f2fba558d3	xfs: make sure to really flush all dquots in xfs_qm_quotacheck Make sure we do not skip any dquots when flushing them out after a quotacheck to make sure that we will never have any dirty dquots on a live filesystem. At this point no dquot should be pinnable, but lets be pedantic about it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-12 16:39:22 -06:00
Christoph Hellwig	fdedf28b94	xfs: untangle SYNC_WAIT and SYNC_TRYLOCK meanings for xfs_qm_dqflush Only skip pinned dquots if SYNC_TRYLOCK is specified, and adjust the callers to keep the behaviour unchanged. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-12 16:31:01 -06:00
Christoph Hellwig	b39342134a	xfs: remove the lid_size field in struct log_item_desc Outside the now removed nodelaylog code this field is only used for asserts and can be safely removed now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-08 13:53:30 -06:00
Christoph Hellwig	0244b9603d	xfs: cleanup the transaction commit path a bit Now that the nodelaylog mode is gone we can simplify the transaction commit path a bit by removing the xfs_trans_commit_cil routine. Restoring the process flags is merged into xfs_trans_commit which already does it for the error path, and allocating the log vectors is merged into xlog_cil_format_items, which already fills them with data, thus avoiding one loop over all log items. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-08 13:53:30 -06:00
Christoph Hellwig	93b8a5854f	xfs: remove the deprecated nodelaylog option The delaylog mode has been the default for a long time, and the nodelaylog option has been scheduled for removal in Linux 3.3. Remove it and code only used by it now that we have opened the 3.3 window. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-08 12:30:32 -06:00
Christoph Hellwig	9f9c19ec1a	xfs: fix the logspace waiting algorithm Apply the scheme used in log_regrant_write_log_space to wake up any other threads waiting for log space before the newly added one to log_regrant_write_log_space as well, and factor the code into readable helpers. For each of the queues we have add two helpers: - one to try to wake up all waiting threads. This helper will also be usable by xfs_log_move_tail once we remove the current opportunistic wakeups in it. - one to sleep on t_wait until enough log space is available, loosely modelled after Linux waitqueues. And use them to reimplement the guts of log_regrant_write_log_space and log_regrant_write_log_space. These two function now use one and the same algorithm for waiting on log space instead of subtly different ones before, with an option to completely unify them in the near future. Also move the filesystem shutdown handling to the common caller given that we had to touch it anyway. Based on hard debugging and an earlier patch from Chandra Seetharaman <sekharan@us.ibm.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> Tested-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-06 14:19:47 -06:00
Christoph Hellwig	c29f7d457a	xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels The i_ino field in the VFS inode is of type unsigned long and thus can't hold the full 64-bit inode number on 32-bit kernels. We have the full inode number in the XFS inode, so use that one for nfs exports. Note that I've also switched the 32-bit file handles types to it, just to make the code more consistent and copy & paste errors less likely to happen. Reported-by: Guoquan Yang <ygq51@hotmail.com> Reported-by: Hank Peng <pengxihan@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-12-06 10:46:23 -06:00
Paul Bolle	90802ed9c3	treewide: Fix comment and string typo 'bufer' Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-12-06 09:53:40 +01:00
Dave Chinner	a99ebf43f4	xfs: fix allocation length overflow in xfs_bmapi_write() When testing the new xfstests --large-fs option that does very large file preallocations, this assert was tripped deep in xfs_alloc_vextent(): XFS: Assertion failed: args->minlen <= args->maxlen, file: fs/xfs/xfs_alloc.c, line: 2239 The allocation was trying to allocate a zero length extent because the lower 32 bits of the allocation length was zero. The remaining length of the allocation to be done was an exact multiple of 2^32 - the first case I saw was at 496TB remaining to be allocated. This turns out to be an overflow when converting the allocation length (a 64 bit quantity) into the extent length to allocate (a 32 bit quantity), and it requires the length to be allocated an exact multiple of 2^32 blocks to trip the assert. Fix it by limiting the extent lenth to allocate to MAXEXTLEN. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-12-02 16:24:02 -06:00
Justin P. Mattock	42b2aa86c6	treewide: Fix typos in various parts of the kernel, and fix some comments. The below patch fixes some typos in various parts of the kernel, as well as fixes some comments. Please let me know if I missed anything, and I will try to get it changed and resent. Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-12-02 14:57:31 +01:00
Christoph Hellwig	4c393a6059	xfs: fix attr2 vs large data fork assert With Dmitry fsstress updates I've seen very reproducible crashes in xfs_attr_shortform_remove because xfs_attr_shortform_bytesfit claims that the attributes would not fit inline into the inode after removing an attribute. It turns out that we were operating on an inode with lots of delalloc extents, and thus an if_bytes values for the data fork that is larger than biggest possible on-disk storage for it which utterly confuses the code near the end of xfs_attr_shortform_bytesfit. Fix this by always allowing the current attribute fork, like we already do for the attr1 format, given that delalloc conversion will take care for moving either the data or attribute area out of line if it doesn't fit at that point - or making the point moot by merging extents at this point. Also document the function better, and clean up some loose bits. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-11-29 13:03:12 -06:00
Christoph Hellwig	4dd2cb4a28	xfs: force buffer writeback before blocking on the ilock in inode reclaim If we are doing synchronous inode reclaim we block the VM from making progress in memory reclaim. So if we encouter a flush locked inode promote it in the delwri list and wake up xfsbufd to write it out now. Without this we can get hangs of up to 30 seconds during workloads hitting synchronous inode reclaim. The scheme is copied from what we do for dquot reclaims. Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-11-29 12:06:14 -06:00
Christoph Hellwig	fa8b18edd7	xfs: validate acl count This prevents in-memory corruption and possible panics if the on-disk ACL is badly corrupted. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-11-28 22:14:24 -06:00
Tejun Heo	a0acae0e88	freezer: unexport refrigerator() and update try_to_freeze() slightly There is no reason to export two functions for entering the refrigerator. Calling refrigerator() instead of try_to_freeze() doesn't save anything noticeable or removes any race condition. * Rename refrigerator() to __refrigerator() and make it return bool indicating whether it scheduled out for freezing. * Update try_to_freeze() to return bool and relay the return value of __refrigerator() if freezing(). * Convert all refrigerator() users to try_to_freeze(). * Update documentation accordingly. * While at it, add might_sleep() to try_to_freeze(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Samuel Ortiz <samuel@sortiz.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Christoph Hellwig <hch@infradead.org>	2011-11-21 12:32:22 -08:00
Mitsuo Hayasaka	db3e74b582	xfs: use doalloc flag in xfs_qm_dqattach_one() The doalloc arg in xfs_qm_dqattach_one() is a flag that indicates whether a new area to handle quota information will be allocated if needed. Originally, it was passed to xfs_qm_dqget(), but has been removed by the following commit (probably by mistake): commit `8e9b6e7fa4` Author: Christoph Hellwig <hch@lst.de> Date: Sun Feb 8 21:51:42 2009 +0100 xfs: remove the unused XFS_QMOPT_DQLOCK flag As the result, xfs_qm_dqget() called from xfs_qm_dqattach_one() never allocates the new area even if it is needed. This patch gives the doalloc arg to xfs_qm_dqget() in xfs_qm_dqattach_one() to fix this problem. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2011-11-15 14:45:09 -06:00
Christoph Hellwig	810627d9a6	xfs: fix force shutdown handling in xfs_end_io Ensure ioend->io_error gets propagated back to e.g. AIO completions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-11-08 10:48:23 -06:00
Christoph Hellwig	272e42b215	xfs: constify xfs_item_ops The log item ops aren't nessecarily the biggest exploit vector, but marking them const is easy enough. Also remove the unused xfs_item_ops_t typedef while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-11-08 10:48:23 -06:00
Carlos Maiolino	b52a360b2a	xfs: Fix possible memory corruption in xfs_readlink Fixes a possible memory corruption when the link is larger than MAXPATHLEN and XFS_DEBUG is not enabled. This also remove the S_ISLNK assert, since the inode mode is checked previously in xfs_readlink_by_handle() and via VFS. Updated to address concerns raised by Ben Hutchings about the loose attention paid to 32- vs 64-bit values, and the lack of handling a potentially negative pathlen value: - Changed type of "pathlen" to be xfs_fsize_t, to match that of ip->i_d.di_size - Added checking for a negative pathlen to the too-long pathlen test, and generalized the message that gets reported in that case to reflect the change As a result, if a negative pathlen were encountered, this function would return EFSCORRUPTED (and would fail an assertion for a debug build)--just as would a too-long pathlen. Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-11-08 10:48:23 -06:00
Miklos Szeredi	bfe8684869	filesystems: add set_nlink() Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:43 +01:00
Joe Perches	b9075fa968	treewide: use __printf not __attribute__((format(printf,...))) Standardize the style for compiler based printf format verification. Standardized the location of __printf too. Done via script and a little typing. $ grep -rPl --include=.[ch] -w "__attribute__" \| \ grep -vP "^(tools\|scripts\|include/linux/compiler-gcc.h)" \| \ xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s$\s\(\sformat\s\(\sprintf\s,\s(.+)\s,\s(.+)\s$\s\)\s\)/__printf($1, $2)/g ; print; }' [akpm@linux-foundation.org: revert arch bits] Signed-off-by: Joe Perches <joe@perches.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:54 -07:00
Mel Gorman	94054fa3fc	xfs: warn if direct reclaim tries to writeback pages Direct reclaim should never writeback pages. For now, handle the situation and warn about it. Ultimately, this will be a BUG_ON. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:46 -07:00
Linus Torvalds	5619a69396	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (69 commits) xfs: add AIL pushing tracepoints xfs: put in missed fix for merge problem xfs: do not flush data workqueues in xfs_flush_buftarg xfs: remove XFS_bflush xfs: remove xfs_buf_target_name xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks xfs: clean up xfs_ioerror_alert xfs: clean up buffer allocation xfs: remove buffers from the delwri list in xfs_buf_stale xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF xfs: remove XFS_BUF_FINISH_IOWAIT xfs: remove xfs_get_buftarg_list xfs: fix buffer flushing during unmount xfs: optimize fsync on directories xfs: reduce the number of log forces from tail pushing xfs: Don't allocate new buffers on every call to _xfs_buf_find xfs: simplify xfs_trans_ijoin* again xfs: unlock the inode before log force in xfs_change_file_space xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata ...	2011-10-28 10:31:42 -07:00
Linus Torvalds	59e5253417	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits) MAINTAINERS: linux-m32r is moderated for non-subscribers linux@lists.openrisc.net is moderated for non-subscribers Drop default from "DM365 codec select" choice parisc: Kconfig: cleanup Kernel page size default Kconfig: remove redundant CONFIG_ prefix on two symbols cris: remove arch/cris/arch-v32/lib/nand_init.S microblaze: add missing CONFIG_ prefixes h8300: drop puzzling Kconfig dependencies MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers tty: drop superfluous dependency in Kconfig ARM: mxc: fix Kconfig typo 'i.MX51' Fix file references in Kconfig files aic7xxx: fix Kconfig references to READMEs Fix file references in drivers/ide/ thinkpad_acpi: Fix printk typo 'bluestooth' bcmring: drop commented out line in Kconfig btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888' doc: raw1394: Trivial typo fix CIFS: Don't free volume_info->UNC until we are entirely done with it. treewide: Correct spelling of successfully in comments ...	2011-10-25 12:11:02 +02:00
Linus Torvalds	36b8d186e6	Merge branch 'next' of git://selinuxproject.org/~jmorris/linux-security * 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits) TOMOYO: Fix incomplete read after seek. Smack: allow to access /smack/access as normal user TOMOYO: Fix unused kernel config option. Smack: fix: invalid length set for the result of /smack/access Smack: compilation fix Smack: fix for /smack/access output, use string instead of byte Smack: domain transition protections (v3) Smack: Provide information for UDS getsockopt(SO_PEERCRED) Smack: Clean up comments Smack: Repair processing of fcntl Smack: Rule list lookup performance Smack: check permissions from user space (v2) TOMOYO: Fix quota and garbage collector. TOMOYO: Remove redundant tasklist_lock. TOMOYO: Fix domain transition failure warning. TOMOYO: Remove tomoyo_policy_memory_lock spinlock. TOMOYO: Simplify garbage collector. TOMOYO: Fix make namespacecheck warnings. target: check hex2bin result encrypted-keys: check hex2bin result ...	2011-10-25 09:45:31 +02:00
Christoph Hellwig	9e4c109ac8	xfs: add AIL pushing tracepoints Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-18 15:12:04 -05:00
Alex Elder	2900b33999	xfs: put in missed fix for merge problem I intended to do this as part of fixing part of the conflict with the merge with Linus' tree, but evidently it didn't get included in the commit. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-10-18 20:00:14 +00:00
Alex Elder	9508534c5f	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux Resolved conflicts: fs/xfs/xfs_trans_priv.h: - deleted struct xfs_ail field xa_flags - kept field xa_log_flush in struct xfs_ail fs/xfs/xfs_trans_ail.c: - in xfsaild_push(), in XFS_ITEM_PUSHBUF case, replaced "flush_log = 1" with "ailp->xa_log_flush++" Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-17 15:42:02 -05:00
Christoph Hellwig	5a93a064d2	xfs: do not flush data workqueues in xfs_flush_buftarg When we call xfs_flush_buftarg (generally from sync or umount) it already is too late to flush the data workqueues, as I/O completion is signalled for them and we are thus already done with the data we would flush here. There are places where flushing them might be useful, but the current sync interface doesn't give us that opportunity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 22:34:31 -05:00
Christoph Hellwig	a9add83e5a	xfs: remove XFS_bflush Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:11 -05:00
Christoph Hellwig	02b102df15	xfs: remove xfs_buf_target_name The calling convention that returns a pointer to a static buffer is fairly nasty, so just opencode it in the only caller that is left. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:11 -05:00
Christoph Hellwig	b38505b09b	xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks Use xfs_ioerror_alert instead of opencoding a very similar error message. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	901796afca	xfs: clean up xfs_ioerror_alert Instead of passing the block number and mount structure explicitly get them off the bp and fix make the argument order more natural. Also move it to xfs_buf.c and stop printing the device name given that we already get the fs name as part of xfs_alert, and we know what device is operates on because of the caller that gets printed, finally rename it to xfs_buf_ioerror_alert and pass __func__ as argument where it makes sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	4347b9d7ad	xfs: clean up buffer allocation Change _xfs_buf_initialize to allocate the buffer directly and rename it to xfs_buf_alloc now that is the only buffer allocation routine. Also remove the xfs_buf_deallocate wrapper around the kmem_zone_free calls for buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	af5c4bee49	xfs: remove buffers from the delwri list in xfs_buf_stale For each call to xfs_buf_stale we call xfs_buf_delwri_dequeue either directly before or after it, or are guaranteed by the surrounding conditionals that we are never called on delwri buffers. Simply this situation by moving the call to xfs_buf_delwri_dequeue into xfs_buf_stale. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	c867cb6164	xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	38f2323244	xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Christoph Hellwig	5fde0326dd	xfs: remove XFS_BUF_FINISH_IOWAIT Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Christoph Hellwig	b17b833443	xfs: remove xfs_get_buftarg_list The code is unused and under a config option that doesn't exist, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Christoph Hellwig	87c7bec7fc	xfs: fix buffer flushing during unmount The code to flush buffers in the umount code is a bit iffy: we first flush all delwri buffers out, but then might be able to queue up a new one when logging the sb counts. On a normal shutdown that one would get flushed out when doing the synchronous superblock write in xfs_unmountfs_writesb, but we skip that one if the filesystem has been shut down. Fix this by moving the delwri list flushing until just before unmounting the log, and while we're at it also remove the superflous delwri list and buffer lru flusing for the rt and log device that can never have cached or delwri buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Tested-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Christoph Hellwig	1da2f2dbf2	xfs: optimize fsync on directories Directories are only updated transactionally, which means fsync only needs to flush the log the inode is currently dirty, but not bother with checking for dirty data, non-transactional updates, and most importanly doesn't have to flush disk caches except as part of a transaction commit. While the first two optimizations can't easily be measured, the latter actually makes a difference when doing lots of fsync that do not actually have to commit the inode, e.g. because an earlier fsync already pushed the log far enough. The new xfs_dir_fsync is identical to xfs_nfs_commit_metadata except for the prototype, but I'm not sure creating a common helper for the two is worth it given how simple the functions are. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Dave Chinner	670ce93fef	xfs: reduce the number of log forces from tail pushing The AIL push code will issue a log force on ever single push loop that it exits and has encountered pinned items. It doesn't rescan these pinned items until it revisits the AIL from the start. Hence we only need to force the log once per walk from the start of the AIL to the target LSN. This results in numbers like this: xs_push_ail_flush..... 1456 xs_log_force......... 1485 For an 8-way 50M inode create workload - almost all the log forces are coming from the AIL pushing code. Reduce the number of log forces by only forcing the log if the previous walk found pinned buffers. This reduces the numbers to: xs_push_ail_flush..... 665 xs_log_force......... 682 For the same test. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Dave Chinner	3815832a2a	xfs: Don't allocate new buffers on every call to _xfs_buf_find Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing ~1 million cache hit lookups to ~3000 buffer creates. That's almost 3 orders of magnitude more cahce hits than misses, so optimising for cache hits is quite important. In the cache hit case, we do not need to allocate a new buffer in case of a cache miss, so we are effectively hitting the allocator for no good reason for vast the majority of calls to _xfs_buf_find. 8-way create workloads are showing similar cache hit/miss ratios. The result is profiles that look like this: samples pcnt function DSO _______ _____ _______________________________ _________________ 1036.00 10.0% _xfs_buf_find [kernel.kallsyms] 582.00 5.6% kmem_cache_alloc [kernel.kallsyms] 519.00 5.0% __memcpy [kernel.kallsyms] 468.00 4.5% __ticket_spin_lock [kernel.kallsyms] 388.00 3.7% kmem_cache_free [kernel.kallsyms] 331.00 3.2% xfs_log_commit_cil [kernel.kallsyms] Further, there is a fair bit of work involved in initialising a new buffer once a cache miss has occurred and we currently do that under the rbtree spinlock. That increases spinlock hold time on what are heavily used trees. To fix this, remove the initialisation of the buffer from _xfs_buf_find() and only allocate the new buffer once we've had a cache miss. Initialise the buffer immediately after allocating it in xfs_buf_get, too, so that is it ready for insert if we get another cache miss after allocation. This minimises lock hold time and avoids unnecessary allocator churn. The resulting profiles look like: samples pcnt function DSO _______ _____ ___________________________ _________________ 8111.00 9.1% _xfs_buf_find [kernel.kallsyms] 4380.00 4.9% __memcpy [kernel.kallsyms] 4341.00 4.8% __ticket_spin_lock [kernel.kallsyms] 3401.00 3.8% kmem_cache_alloc [kernel.kallsyms] 2856.00 3.2% xfs_log_commit_cil [kernel.kallsyms] 2625.00 2.9% __kmalloc [kernel.kallsyms] 2380.00 2.7% kfree [kernel.kallsyms] 2016.00 2.3% kmem_cache_free [kernel.kallsyms] Showing a significant reduction in time spent doing allocation and freeing from slabs (kmem_cache_alloc and kmem_cache_free). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	ddc3415aba	xfs: simplify xfs_trans_ijoin* again There is no reason to keep a reference to the inode even if we unlock it during transaction commit because we never drop a reference between the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref back into xfs_trans_ijoin - the third argument decides if an unlock is needed now. I'm actually starting to wonder if allowing inodes to be unlocked at transaction commit really is worth the effort. The only real benefit is that they can be unlocked earlier when commiting a synchronous transactions, but that could be solved by doing the log force manually after the unlock, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	23bb0be1a2	xfs: unlock the inode before log force in xfs_change_file_space Let the transaction commit unlock the inode before it potentially causes a synchronous log force. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	8292d88c5c	xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata Only read the LSN we need to push to with the ilock held, and then release it before we do the log force to improve concurrency. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	b103705853	xfs: unlock the inode before log force in xfs_fsync Only read the LSN we need to push to with the ilock held, and then release it before we do the log force to improve concurrency. This also removes the only direct caller of _xfs_trans_commit, thus allowing it to be merged into the plain xfs_trans_commit again. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	815cb21662	xfs: XFS_TRANS_SWAPEXT is not a valid flag for xfs_trans_commit XFS_TRANS_SWAPEXT is a transaction type, not a flag for xfs_trans_commit, so don't pass it in xfs_swap_extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Lukas Czerner	c029a50d51	xfs: fix possible overflow in xfs_ioc_trim() In xfs_ioc_trim it is possible that computing the last allocation group to discard might overflow for big start & len values, because the result might be bigger then xfs_agnumber_t which is 32 bit long. Fix this by not allowing the start and end block of the range to be beyond the end of the file system. Note that if the start is beyond the end of the file system we have to return -EINVAL, but in the "end" case we have to truncate it to the fs size. Also introduce "end" variable, rather than using start+len which which might be more confusing to get right as this bug shows. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	d952e2f812	xfs: cleanup xfs_bmap.h Convert all function prototypes to the short form used elsewhere, and remove duplicates of comments already placed at the function body. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	b0eab14e74	xfs: dont ignore error code from xfs_bmbt_update Fix a case in xfs_bmap_add_extent_unwritten_real where we aren't passing the returned error on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	c653424985	xfs: pass bmalloca to xfs_bmap_add_extent_hole_real All the parameters passed to xfs_bmap_add_extent_hole_real() are in the xfs_bmalloca structure now. Just pass the bmalloca parameter to the function instead of 8 separate parameters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	572a4cf04a	xfs: pass bmalloca to xfs_bmap_add_extent_delay_real All the parameters passed to xfs_bmap_add_extent_delay_real() are in the xfs_bmalloca structure now. Just pass the bmalloca parameter to the function instead of 8 separate parameters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	c315c90b7d	xfs: move logflags into bmalloca Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	e0c3da5d89	xfs: move lastx and nallocs into bmalloca Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	29c8d17a89	xfs: move btree cursor into bmalloca Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	963c30cf45	xfs: do not keep local copies of allocation ranges in xfs_bmapi_allocate Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	3a75667e90	xfs: rename allocation range fields in struct xfs_bmalloca Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	0937e0fd8b	xfs: move firstblock and bmap freelist cursor into bmalloca structure Rather than passing the firstblock and freelist structure around, embed it into the bmalloca structure and remove it from the function parameters. This also enables the minleft parameter to be set only once in xfs_bmapi_write(), and the freelist cursor directly queried in xfs_bmapi_allocate to clear it when the lowspace algorithm is activated. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Dave Chinner	baf41a52b9	xfs: move extent records into bmalloca structure Rather that putting extent records on the stack and then pointing to them in the bmalloca structure which is in the same stack frame, put the extent records directly in the bmalloca structure. This reduces the number of args that need to be passed around. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Dave Chinner	1b16447ba2	xfs: pass bmalloca structure to xfs_bmap_isaeof All the variables xfs_bmap_isaeof() is passed are contained within the xfs_bmalloca structure. Pass that instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Christoph Hellwig	a5bd606ba6	xfs: remove xfs_bmap_add_extent There is no real need to the xfs_bmap_add_extent, as the callers know what kind of extents they need to it. Removing it means duplicating the extents to btree conversion logic in three places, but overall it's still much simpler code and quite a bit less code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Christoph Hellwig	27a3f8f2de	xfs: introduce xfs_bmap_last_extent Add a common helper for finding the last extent in a file. Largely based on a patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Dave Chinner	c0dc7828af	xfs: rename xfs_bmapi to xfs_bmapi_write Now that all the read-only users of xfs_bmapi have been converted to use xfs_bmapi_read(), we can remove all the read-only handling cases from xfs_bmapi(). Once this is done, rename xfs_bmapi to xfs_bmapi_write to reflect the fact it is for allocation only. This enables us to kill the XFS_BMAPI_WRITE flag as well. Also clean up xfs_bmapi_write to the style used in the newly added xfs_bmapi_read/delay functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Dave Chinner	b447fe5a05	xfs: factor unwritten extent map manipulations out of xfs_bmapi To further improve the readability of xfs_bmapi(), factor the unwritten extent conversion out into a separate function. This removes large block of logic from the xfs_bmapi() code loop and makes it easier to see the operational logic flow for xfs_bmapi(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Dave Chinner	7e47a4efde	xfs: factor extent allocation out of xfs_bmapi To further improve the readability of xfs_bmapi(), factor the extent allocation out into a separate function. This removes a large block of logic from the xfs_bmapi() code loop and makes it easier to see the operational logic flow for xfs_bmapi(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Christoph Hellwig	1fd044d9c6	xfs: do not use xfs_bmap_add_extent for adding delalloc extents We can just call xfs_bmap_add_extent_hole_delay directly to add a delayed allocated regions to the extent tree, instead of going through all the complexities of xfs_bmap_add_extent that aren't needed for this simple case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Christoph Hellwig	4403280aa5	xfs: introduce xfs_bmapi_delay() Delalloc reservations are much simpler than allocations, so give them a separate bmapi-level interface. Using the previously added xfs_bmapi_reserve_delalloc we get a function that is only minimally more complicated than xfs_bmapi_read, which is far from the complexity in xfs_bmapi. Also remove the XFS_BMAPI_DELAY code after switching over the only user to xfs_bmapi_delay. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Christoph Hellwig	b64dfe4e18	xfs: factor delalloc reservations out of xfs_bmapi Move the reservation of delayed allocations, and addition of delalloc regions to the extent trees into a new helper function. For now this adds some twisted goto logic to xfs_bmapi, but that will be cleaned up in the following patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Dave Chinner	5b777ad517	xfs: remove xfs_bmapi_single() Now we have xfs_bmapi_read, there is no need for xfs_bmapi_single(). Change the remaining caller over and kill the function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Dave Chinner	5c8ed2021f	xfs: introduce xfs_bmapi_read() xfs_bmapi() currently handles both extent map reading and allocation. As a result, the code is littered with "if (wr)" branches to conditionally do allocation operations if required. This makes the code much harder to follow and causes significant indent issues with the code. Given that read mapping is much simpler than allocation, we can split out read mapping from xfs_bmapi() and reuse the logic that we have already factored out do do all the hard work of handling the extent map manipulations. The results in a much simpler function for the common extent read operations, and will allow the allocation code to be simplified in another commit. Once xfs_bmapi_read() is implemented, convert all the callers of xfs_bmapi() that are only reading extents to use the new function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Dave Chinner	aef9a89586	xfs: factor extent map manipulations out of xfs_bmapi To further improve the readability of xfs_bmapi(), factor the pure extent map manipulations out into separate functions. This removes large blocks of logic from the xfs_bmapi() code loop and makes it easier to see the operational logic flow for xfs_bmapi(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Christoph Hellwig	ecee76ba9d	xfs: remove the nextents variable in xfs_bmapi Instead of using a local variable that needs to updated when we modify the extent map just check ifp->if_bytes directly where we use it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Christoph Hellwig	b9b984d784	xfs: remove impossible to read code in xfs_bmap_add_extent_delay_real We already have the worst case blocks reserved, so xfs_icsb_modify_counters won't fail in xfs_bmap_add_extent_delay_real. In fact we've had an assert to catch this case since day and it never triggered. So remove the code to try smaller reservations, and just return the error for that case in addition to keeping the assert. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Christoph Hellwig	e7455e02e5	xfs: remove the first extent special case in xfs_bmap_add_extent Both xfs_bmap_add_extent_hole_delay and xfs_bmap_add_extent_hole_real already contain code to handle the case where there is no extent to merge with, which is effectively the same as the code duplicated here. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Mitsuo Hayasaka	ed32201e65	xfs: Return -EIO when xfs_vn_getattr() failed An attribute of inode can be fetched via xfs_vn_getattr() in XFS. Currently it returns EIO, not negative value, when it failed. As a result, the system call returns not negative value even though an error occured. The stat(2), ls and mv commands cannot handle this error and do not work correctly. This patch fixes this bug, and returns -EIO, not EIO when an error is detected in xfs_vn_getattr(). Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:02 -05:00
Chandra Seetharaman	eabbaf1182	xfs: Fix the incorrect comment in the header of _xfs_buf_find Fix the incorrect comment in the header of the function _xfs_buf_find(). Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:02 -05:00
Chandra Seetharaman	2a30f36d90	xfs: Check the return value of xfs_trans_get_buf() Check the return value of xfs_trans_get_buf() and fail appropriately. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Chandra Seetharaman	b522950f0a	xfs: Check the return value of xfs_buf_get() Check the return value of xfs_buf_get() and fail appropriately. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	04f658ee22	xfs: improve ioend error handling Return unwritten extent conversion errors to aio_complete. Skip both unwritten extent conversion and size updates if we had an I/O error or the filesystem has been shut down. Return -EIO to the aio/buffer completion handlers in case of a forced shutdown. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	c58cb165bd	xfs: avoid direct I/O write vs buffered I/O race Currently a buffered reader or writer can add pages to the pagecache while we are waiting for the iolock in xfs_file_dio_aio_write. Prevent this by re-checking mapping->nrpages after we got the iolock, and if nessecary upgrade the lock to exclusive mode. To simplify this a bit only take the ilock inside of xfs_file_aio_write_checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	859f57ca00	xfs: avoid synchronous transactions when deleting attr blocks Currently xfs_attr_inactive causes a synchronous transactions if we are removing a file that has any extents allocated to the attribute fork, and thus makes XFS extremely slow at removing files with out of line extended attributes. The code looks a like a relict from the days before the busy extent list, but with the busy extent list we avoid reusing data and attr extents that have been freed but not commited yet, so this code is just as superflous as the synchronous transactions for data blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	4a06fd262d	xfs: remove i_iocount We now have an i_dio_count filed and surrounding infrastructure to wait for direct I/O completion instead of i_icount, and we have never needed to iocount waits for buffered I/O given that we only set the page uptodate after finishing all required work. Thus remove i_iocount, and replace the actually needed waits with calls to inode_dio_wait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	2b3ffd7eb7	xfs: wait for I/O completion when writing out pages in xfs_setattr_size The current code relies on the xfs_ioend_wait call later on to make sure all I/O actually has completed. The xfs_ioend_wait call will go away soon, so prepare for that by using the waiting filemap function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	fc0063c447	xfs: reduce ioend latency There is no reason to queue up ioends for processing in user context unless we actually need it. Just complete ioends that do not convert unwritten extents or need a size update from the end_io context. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	c859cdd1da	xfs: defer AIO/DIO completions We really shouldn't complete AIO or DIO requests until we have finished the unwritten extent conversion and size update. This means fsync never has to pick up any ioends as all work has been completed when signalling I/O completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	398d25ef23	xfs: remove dead ENODEV handling in xfs_destroy_ioend No driver returns ENODEV from it bio completion handler, not has this ever been documented. Remove the dead code dealing with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	c4e1c098ee	xfs: use the "delwri" terminology consistently And also remove the strange local lock and delwri list pointers in a few functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	c2b006c1da	xfs: let xfs_bwrite callers handle the xfs_buf_relse Remove the xfs_buf_relse from xfs_bwrite and let the caller handle it to mirror the delwri and read paths. Also remove the mount pointer passed to xfs_bwrite, which is superflous now that we have a mount pointer in the buftarg. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	61551f1ee5	xfs: call xfs_buf_delwri_queue directly Unify the ways we add buffers to the delwri queue by always calling xfs_buf_delwri_queue directly. The xfs_bdwrite functions is removed and opencoded in its callers, and the two places setting XBF_DELWRI while a buffer is locked and expecting xfs_buf_unlock to pick it up are converted to call xfs_buf_delwri_queue directly, too. Also replace the XFS_BUF_UNDELAYWRITE macro with direct calls to xfs_buf_delwri_dequeue to make the explicit queuing/dequeuing more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Christoph Hellwig	5a8ee6bafd	xfs: move more delwri setup into xfs_buf_delwri_queue Do not transfer a reference held by the caller to the buffer on the list, or decrement it in xfs_buf_delwri_queue, but instead grab a new reference if needed, and let the caller drop its own reference. Also move setting of the XBF_DELWRI and XBF_ASYNC flags into xfs_buf_delwri_queue, and only do it if needed. Note that for now xfs_buf_unlock already has XBF_DELWRI, but that will change in the following patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Christoph Hellwig	527cfdf19d	xfs: remove the unlock argument to xfs_buf_delwri_queue We can just unlock the buffer in the caller, and the decrement of b_hold would also be needed in the !unlock, we just never hit that case currently given that the caller handles that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Christoph Hellwig	375ec69d2e	xfs: remove delwri buffer handling from xfs_buf_iorequest We cannot ever reach xfs_buf_iorequest for a buffer with XBF_DELWRI set, given that all write handlers make sure that the buffer is remove from the delwri queue before, and we never do reads with the XBF_DELWRI flag set (which the code would not handle correctly anyway). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Dave Chinner	7271d243f9	xfs: don't serialise adjacent concurrent direct IO appending writes For append write workloads, extending the file requires a certain amount of exclusive locking to be done up front to ensure sanity in things like ensuring that we've zeroed any allocated regions between the old EOF and the start of the new IO. For single threads, this typically isn't a problem, and for large IOs we don't serialise enough for it to be a problem for two threads on really fast block devices. However for smaller IO and larger thread counts we have a problem. Take 4 concurrent sequential, single block sized and aligned IOs. After the first IO is submitted but before it completes, we end up with this state: IO 1 IO 2 IO 3 IO 4 +-------+-------+-------+-------+ ^ ^ \| \| \| \| \| \| \| \- ip->i_new_size \- ip->i_size And the IO is done without exclusive locking because offset <= ip->i_size. When we submit IO 2, we see offset > ip->i_size, and grab the IO lock exclusive, because there is a chance we need to do EOF zeroing. However, there is already an IO in progress that avoids the need for IO zeroing because offset <= ip->i_new_size. hence we could avoid holding the IO lock exlcusive for this. Hence after submission of the second IO, we'd end up this state: IO 1 IO 2 IO 3 IO 4 +-------+-------+-------+-------+ ^ ^ \| \| \| \| \| \| \| \- ip->i_new_size \- ip->i_size There is no need to grab the i_mutex of the IO lock in exclusive mode if we don't need to invalidate the page cache. Taking these locks on every direct IO effective serialises them as taking the IO lock in exclusive mode has to wait for all shared holders to drop the lock. That only happens when IO is complete, so effective it prevents dispatch of concurrent direct IO writes to the same inode. And so you can see that for the third concurrent IO, we'd avoid exclusive locking for the same reason we avoided the exclusive lock for the second IO. Fixing this is a bit more complex than that, because we need to hold a write-submission local value of ip->i_new_size to that clearing the value is only done if no other thread has updated it before our IO completes..... Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Dave Chinner	0c38a2512d	xfs: don't serialise direct IO reads on page cache checks There is no need to grab the i_mutex of the IO lock in exclusive mode if we don't need to invalidate the page cache. Taking these locks on every direct IO effective serialises them as taking the IO lock in exclusive mode has to wait for all shared holders to drop the lock. That only happens when IO is complete, so effective it prevents dispatch of concurrent direct IO reads to the same inode. Fix this by taking the IO lock shared to check the page cache state, and only then drop it and take the IO lock exclusively if there is work to be done. Hence for the normal direct IO case, no exclusive locking will occur. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Joern Engel <joern@logfs.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Christoph Hellwig	0030807c66	xfs: revert to using a kthread for AIL pushing Currently we have a few issues with the way the workqueue code is used to implement AIL pushing: - it accidentally uses the same workqueue as the syncer action, and thus can be prevented from running if there are enough sync actions active in the system. - it doesn't use the HIGHPRI flag to queue at the head of the queue of work items At this point I'm not confident enough in getting all the workqueue flags and tweaks right to provide a perfectly reliable execution context for AIL pushing, which is the most important piece in XFS to make forward progress when the log fills. Revert back to use a kthread per filesystem which fixes all the above issues at the cost of having a task struct and stack around for each mounted filesystem. In addition this also gives us much better ways to diagnose any issues involving hung AIL pushing and removes a small amount of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 11:02:49 -05:00
Christoph Hellwig	17b38471c3	xfs: force the log if we encounter pinned buffers in .iop_pushbuf We need to check for pinned buffers even in .iop_pushbuf given that inode items flush into the same buffers that may be pinned directly due operations on the unlinked inode list operating directly on buffers. To do this add a return value to .iop_pushbuf that tells the AIL push about this and use the existing log force mechanisms to unpin it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 11:02:48 -05:00
Christoph Hellwig	bc6e588a89	xfs: do not update xa_last_pushed_lsn for locked items If an item was locked we should not update xa_last_pushed_lsn and thus skip it when restarting the AIL scan as we need to be able to lock and write it out as soon as possible. Otherwise heavy lock contention might starve AIL pushing too easily, especially given the larger backoff once we moved xa_last_pushed_lsn all the way to the target lsn. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 11:02:48 -05:00
Jiri Kosina	e060c38434	Merge branch 'master' into for-next Fast-forward merge with Linus to be able to merge patches based on more recent version of the tree.	2011-09-15 15:08:18 +02:00
Joe Perches	558feb0818	fs: Convert vmalloc/memset to vzalloc Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Alex Elder <aelder@sgi.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-09-15 13:56:28 +02:00
Christoph Hellwig	2d2422aebc	xfs: fix a use after free in xfs_end_io_direct_write There is a window in which the ioend that we call inode_dio_wake on in xfs_end_io_direct_write is already free. Fix this by storing the inode pointer in a local variable. This is a fix for the regression introduced in 3.1-rc by "fs: move inode_dio_done to the end_io handler". Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-09-14 08:56:35 -05:00
Christoph Hellwig	58d84c4ee0	xfs: fix ->write_inode return values Currently we always redirty an inode that was attempted to be written out synchronously but has been cleaned by an AIL pushed internall, which is rather bogus. Fix that by doing the i_update_core check early on and return 0 for it. Also include async calls for it, as doing any work for those is just as pointless. While we're at it also fix the sign for the EIO return in case of a filesystem shutdown, and fix the completely non-sensical locking around xfs_log_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> (cherry picked from commit 297db93bb74cf687510313eb235a7aec14d67e97) Signed-off-by: Alex Elder <aelder@sgi.com>	2011-09-01 09:46:11 -05:00
Christoph Hellwig	866e4ed774	xfs: fix xfs_mark_inode_dirty during umount During umount we do not add a dirty inode to the lru and wait for it to become clean first, but force writeback of data and metadata with I_WILL_FREE set. Currently there is no way for XFS to detect that the inode has been redirtied for metadata operations, as we skip the mark_inode_dirty call during teardown. Fix this by setting i_update_core nanually in that case, so that the inode gets flushed during inode reclaim. Alternatively we could enable calling mark_inode_dirty for inodes in I_WILL_FREE state, and let the VFS dirty tracking handle this. I decided against this as we will get better I/O patterns from reclaim compared to the synchronous writeout in write_inode_now, and always marking the inode dirty in some way from xfs_mark_inode_dirty is a better safetly net in either case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> (cherry picked from commit da6742a5a4cc844a9982fdd936ddb537c0747856) Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-31 17:59:39 -05:00
Christoph Hellwig	242d621964	xfs: deprecate the nodelaylog mount option Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-25 10:30:05 -05:00
Christoph Hellwig	b6bede3b4c	xfs: fix tracing builds inside the source tree The code really requires the current source directory to be in the header search path. We already do this if building with an object tree separate from the source, but it needs to be added manually if building inside the source. The cflags addition for it accidentally got removed when collapsing the xfs directory structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-22 16:37:24 -05:00
Christoph Hellwig	c59d87c460	xfs: remove subdirectories Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the annoying subdirectories in the XFS source code. Besides the large amount of file rename the only changes are to the Makefile, a few files including headers with the subdirectory prefix, and the binary sysctl compat code that includes a header under fs/xfs/ from kernel/. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-12 16:21:35 -05:00
Alex Elder	06f8e2d675	xfs: don't expect xfs headers to be in subdirectories Fix up some #include directives in preparation for moving a few header files out of xfs source subdirectories. Note that "xfs_linux.h" also got its quoting convention for included files switched. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-08-12 13:57:55 -05:00
Chandra Seetharaman	e570280521	xfs: replace xfs_buf_geterror() with bp->b_error Since we just checked bp for NULL, it is ok to replace xfs_buf_geterror() with bp->b_error in these places. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-12 13:39:40 -05:00
Chandra Seetharaman	ac4d6888b2	xfs: Check the return value of xfs_buf_read() for NULL Check the return value of xfs_buf_read() for NULL and return ENOMEM if it is NULL. This is necessary in a few spots to avoid subsequent code blindly dereferencing the null buffer pointer. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-12 13:39:29 -05:00
Ajeet Yadav	9e978d8f7d	"xfs: fix error handling for synchronous writes" revisited xfs: fix for hang during synchronous buffer write error If removed storage while synchronous buffer write underway, "xfslogd" hangs. Detailed log http://oss.sgi.com/archives/xfs/2011-07/msg00740.html Related work `bfc60177f8` "xfs: fix error handling for synchronous writes" Given that xfs_bwrite actually does the shutdown already after waiting for the b_iodone completion and given that we actually found that calling xfs_force_shutdown from inside xfs_buf_iodone_callbacks was a major contributor the problem it better to drop this call. Signed-off-by: Ajeet Yadav <ajeet.yadav.77@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-10 17:00:21 -05:00
Alex Elder	e44f4112a4	xfs: set cursor in xfs_ail_splice() even when AIL was empty In xfs_ail_splice(), if a cursor is provided it is updated to point to the last item on the list being spliced into the AIL. But if the AIL was found to be empty, the cursor (if provided) is just initialized instead. There is no reason the empty AIL case needs to be treated any differently. And treating it the same way allows this code to be rearranged a bit, with a somewhat tidier result. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-08-09 15:30:43 -05:00
James Morris	5a2f3a02ae	Merge branch 'next-evm' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next Conflicts: fs/attr.c Resolve conflict manually. Signed-off-by: James Morris <jmorris@namei.org>	2011-08-09 10:31:03 +10:00
Alex Elder	2ddb4e9406	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux	2011-08-08 07:06:24 -05:00
Linus Torvalds	1b8e94993c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree() VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree() switch posix_acl_chmod() to umode_t switch posix_acl_from_mode() to umode_t switch posix_acl_equiv_mode() to umode_t * switch posix_acl_create() to umode_t * block: initialise bd_super in bdget() vfs: avoid call to inode_lru_list_del() if possible vfs: avoid taking inode_hash_lock on pipes and sockets vfs: conditionally call inode_wb_list_del() VFS: Fix automount for negative autofs dentries Btrfs: load the key from the dir item in readdir into a fake dentry devtmpfs: missing initialialization in never-hit case hppfs: missing include	2011-08-01 13:48:31 -10:00
Markus Trippelsdorf	206d440f64	xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set commit `4e34e719e4`, that takes the ACL checks to common code, accidentely broke the build when CONFIG_FS_POSIX_ACL is not set: CC fs/xfs/linux-2.6/xfs_iops.o fs/xfs/linux-2.6/xfs_iops.c:1025:14: error: ‘xfs_get_acl’ undeclared here (not in a function) Fix this by declaring xfs_get_acl a static inline function. Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:35:04 -04:00
Al Viro	d6952123b5	switch posix_acl_equiv_mode() to umode_t * ... so that &inode->i_mode could be passed to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:10:06 -04:00
Al Viro	d3fb612076	switch posix_acl_create() to umode_t * so we can pass &inode->i_mode to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:09:42 -04:00
Markus Trippelsdorf	a5a7bbcc01	xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set commit `4e34e719e4`, that takes the ACL checks to common code, accidentely broke the build when CONFIG_FS_POSIX_ACL is not set: CC fs/xfs/linux-2.6/xfs_iops.o fs/xfs/linux-2.6/xfs_iops.c:1025:14: error: ‘xfs_get_acl’ undeclared here (not in a function) Fix this by declaring xfs_get_acl a static inline function. Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-29 12:26:14 -05:00
Linus Torvalds	597a67e0ba	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: optimize the negative xattr caching xfs: prevent against ioend livelocks in xfs_file_fsync xfs: flag all buffers as metadata xfs: encapsulate a block of debug code	2011-07-27 13:41:51 -07:00
Christoph Hellwig	510792ee29	xfs: optimize the negative xattr caching Since the addition of file capabilities every write needs to read xattrs to check if we have any capabilities to clear. In Linux 3.0 Andi Kleen added a flag to cache the fact that we do not have any attributes on an inode. Make sure to already mark a file as not having any attributes when reading it from disk in case it doesn't even have an attribute fork. Based on an earlier patch from Andi Kleen. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-26 22:06:50 -05:00
Christoph Hellwig	d1166ec792	xfs: prevent against ioend livelocks in xfs_file_fsync We need to take some locks to prevent new ioends from coming in when we wait for all existing ones to go away. Up to Linux 3.0 that was done using the i_mutex held by the VFS fsync code, but now that we are called without it we need to take care of it ourselves. Use the I/O lock instead of i_mutex just like we do in other places. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-26 22:06:39 -05:00
Christoph Hellwig	34951f5cb7	xfs: flag all buffers as metadata Now that REQ_META bios aren't treated specially in the CFQ I/O schedule anymore, we can tag all buffers as metadata to make blktrace traces more meaningful. Note that we use buffers also to zero out partial blocks in the preallocation / hole punching code, and while they operate on data blocks the zeros written certainly aren't data. I think this case is borderline metadata enough to not bother special casing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-26 22:05:48 -05:00
Alex Elder	1c4f33296e	xfs: encapsulate a block of debug code Pull into a helper function some debug-only code that validates a xfs_da_blkinfo structure that's been read from disk. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@infradead.org>	2011-07-26 22:05:34 -05:00
Al Viro	03209378b4	xfs: fix misspelled S_IS...() mode_t is not a bitmap... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 15:05:30 -04:00
Al Viro	abbede1b3a	xfs: get rid of open-coded S_ISREG(), etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 15:05:16 -04:00
Linus Torvalds	d3ec4844d4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) fs: Merge split strings treewide: fix potentially dangerous trailing ';' in #defined values/expressions uwb: Fix misspelling of neighbourhood in comment net, netfilter: Remove redundant goto in ebt_ulog_packet trivial: don't touch files that are removed in the staging tree lib/vsprintf: replace link to Draft by final RFC number doc: Kconfig: `to be' -> `be' doc: Kconfig: Typo: square -> squared doc: Konfig: Documentation/power/{pm => apm-acpi}.txt drivers/net: static should be at beginning of declaration drivers/media: static should be at beginning of declaration drivers/i2c: static should be at beginning of declaration XTENSA: static should be at beginning of declaration SH: static should be at beginning of declaration MIPS: static should be at beginning of declaration ARM: static should be at beginning of declaration rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check Update my e-mail address PCIe ASPM: forcedly -> forcibly gma500: push through device driver tree ... Fix up trivial conflicts: - arch/arm/mach-ep93xx/dma-m2p.c (deleted) - drivers/gpio/gpio-ep93xx.c (renamed and context nearby) - drivers/net/r8169.c (just context changes)	2011-07-25 13:56:39 -07:00
Chandra Seetharaman	c35a549c8b	xfs: Remove the macro XFS_BUFTARG_NAME Remove the definition and usages of the macro XFS_BUFTARG_NAME. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:37 -05:00
Chandra Seetharaman	49074c069c	xfs: Remove the macro XFS_BUF_TARGET Remove the definition and usages of the macro XFS_BUF_TARGET Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:31 -05:00
Chandra Seetharaman	e38c9b87e5	xfs: Remove the macro XFS_BUF_SET_TARGET Remove the macro XFS_BUF_SET_TARGET. hch: As all the buffer allocator already set ->b_target it should be safe to simply remove these calls. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:26 -05:00
Chandra Seetharaman	811e64c716	Replace the macro XFS_BUF_ISPINNED with helper xfs_buf_ispinned Replace the macro XFS_BUF_ISPINNED with an inline helper function xfs_buf_ispinned, and change all its usages. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:22 -05:00
Chandra Seetharaman	02fe03d909	xfs: Remove the macro XFS_BUF_SET_PTR Remove the definition and usages of the macro XFS_BUF_SET_PTR. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:17 -05:00
Chandra Seetharaman	6292604447	xfs: Remove the macro XFS_BUF_PTR Remove the definition and usages of the macro XFS_BUF_PTR. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:13 -05:00
Chandra Seetharaman	0095a21eb6	xfs: Remove macro XFS_BUF_SET_START Remove the definition and usage of the macro XFS_BUF_SET_START. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:09 -05:00
Chandra Seetharaman	72790aa119	xfs: Remove macro XFS_BUF_HOLD Remove the definition and usage of the macro XFS_BUF_HOLD Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:06 -05:00
Chandra Seetharaman	b75e40a419	xfs: Remove macro XFS_BUF_BUSY and family Remove the definitions and uses of the macros XFS_BUF_BUSY, XFS_BUF_UNBUSY, and XFS_BUF_ISBUSY. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:00 -05:00
Chandra Seetharaman	5a52c2a581	xfs: Remove the macro XFS_BUF_ERROR and family Remove the definitions and usage of the macros XFS_BUF_ERROR, XFS_BUF_GETERROR and XFS_BUF_ISERROR. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 14:57:46 -05:00
Chandra Seetharaman	ed43233be9	xfs: Remove the macro XFS_BUF_BFLAGS Remove the definition of the macro XFS_BUF_BFLAGS and its usage. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 14:57:36 -05:00
Christoph Hellwig	4e34e719e4	fs: take the ACL checks to common code Replace the ->check_acl method with a ->get_acl method that simply reads an ACL from disk after having a cache miss. This means we can replace the ACL checking boilerplate code with a single implementation in namei.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:30:23 -04:00
Al Viro	826cae2f2b	kill boilerplates around posix_acl_create_masq() new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with modified clone, on failure releases acl and replaces with NULL. Returns 0 or -ve on error. All callers of posix_acl_create_masq() switched. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:27:32 -04:00
Al Viro	bc26ab5f65	kill boilerplate around posix_acl_chmod_masq() new helper: posix_acl_chmod(&acl, gfp, mode). Replaces acl with modified clone or with NULL if that has failed; returns 0 or -ve on error. All callers of posix_acl_chmod_masq() switched to that - they'd been doing exactly the same thing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:27:30 -04:00
Christoph Hellwig	6311b10800	xfs: cache negative ACLs if there is no attribute fork Always set up a negative ACL cache entry if the inode doesn't have an attribute fork. That behaves much better than doing this check inside ->check_acl. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:25:38 -04:00
Linus Torvalds	e77819e57f	vfs: move ACL cache lookup into generic code This moves logic for checking the cached ACL values from low-level filesystems into generic code. The end result is a streamlined ACL check that doesn't need to load the inode->i_op->check_acl pointer at all for the common cached case. The filesystems also don't need to check for a non-blocking RCU walk case in their acl_check() functions, because that is all handled at a VFS layer. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:23:39 -04:00
Markus Trippelsdorf	340a0a01b9	xfs: Fix wrong return value of xfs_file_aio_write The fsync prototype change commit `02c24a8218` accidentally overwrote the ssize_t return value of xfs_file_aio_write with 0 for SYNC type writes. Fix this by checking if an error occured when calling xfs_file_fsync and only change the return value in this case. In addition xfs_file_fsync actually returns a normal negative error, so fix this, too. Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:23:21 -04:00
Linus Torvalds	bbd9d6f7fb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.	2011-07-22 19:02:39 -07:00
Jean Delvare	df2e301fee	fs: Merge split strings No idea why these were split in the first place... Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-07-22 16:47:15 +02:00
Josef Bacik	02c24a8218	fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:59 -04:00
Christoph Hellwig	72c5052ddc	fs: move inode_dio_done to the end_io handler For filesystems that delay their end_io processing we should keep our i_dio_count until the the processing is done. Enable this by moving the inode_dio_done call to the end_io handler if one exist. Note that the actual move to the workqueue for ext4 and XFS is not done in this patch yet, but left to the filesystem maintainers. At least for XFS it's not needed yet either as XFS has an internal equivalent to i_dio_count. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:50 -04:00
Dave Chinner	8daaa83145	xfs: make use of new shrinker callout for the inode cache Convert the inode reclaim shrinker to use the new per-sb shrinker operations. This allows much bigger reclaim batches to be used, and allows the XFS inode cache to be shrunk in proportion with the VFS dentry and inode caches. This avoids the problem of the VFS caches being shrunk significantly before the XFS inode cache is shrunk resulting in imbalances in the caches during reclaim. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:42 -04:00
Dave Chinner	55fb25d5b3	xfs: add size update tracepoint to IO completion For improving insight into IO completion behaviour. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:38:04 -05:00
Dave Chinner	af3e40228f	xfs: convert AIL cursors to use struct list_head The list of active AIL cursors uses a roll-your-own linked list with special casing for the AIL push cursor. Simplify this code by replacing the list with standard struct list_head lists, and use a separate list_head to track the active cursors. This allows us to treat the AIL push cursor as a generic cursor rather than as a special case, further simplifying the code. Further, fix the duplicate push cursor initialisation that the special case handling was hiding, and clean up all the comments around the active cursor list handling. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:37:46 -05:00
Dave Chinner	16b5902943	xfs: remove confusing ail cursor wrapper xfs_trans_ail_cursor_set() doesn't set the cursor to the current log item, it sets it to the next item. There is already a function for doing this - xfs_trans_ail_cursor_next() - and the _set function is simply a two line wrapper. Remove it and open code the setting of the cursor in the two locations that call it to remove the confusion. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:37:37 -05:00
Dave Chinner	1d8c95a363	xfs: use a cursor for bulk AIL insertion Delayed logging can insert tens of thousands of log items into the AIL at the same LSN. When the committing of log commit records occur, we can get insertions occurring at an LSN that is not at the end of the AIL. If there are thousands of items in the AIL on the tail LSN, each insertion has to walk the AIL to find the correct place to insert the new item into the AIL. This can consume large amounts of CPU time and block other operations from occurring while the traversals are in progress. To avoid this repeated walk, use a AIL cursor to record where we should be inserting the new items into the AIL without having to repeat the walk. The cursor infrastructure already provides this functionality for push walks, so is a simple extension of existing code. While this will not avoid the initial walk, it will avoid repeating it tens of thousands of times during a single checkpoint commit. This version includes logic improvements from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:37:20 -05:00
J. Bruce Fields	ad1a2c878c	xfs: failure mapping nfs fh to inode should return ESTALE On xfs exports, nfsd is incorrectly returning ENOENT instead of ESTALE on attempts to use a filehandle of a deleted file (spotted with pynfs test PUTFH3). The ENOENT was coming from xfs_iget. (It's tempting to wonder whether we should just map all xfs_iget errors to ESTALE, but I don't believe so--xfs_iget can also return ENOMEM at least, which we wouldn't want mapped to ESTALE.) While we're at it, the other return of ENOENT in xfs_nfs_get_inode() also looks wrong. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:35:21 -05:00
Chandra Seetharaman	adab0f67d1	xfs: Remove the second parameter to xfs_sb_count() Remove the second parameter to xfs_sb_count() since all callers of the function set them. Also, fix the header comment regarding it being called periodically. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:35:03 -05:00
Al Viro	7e40145eb1	->permission() sanitizing: don't pass flags to ->check_acl() not used in the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:21 -04:00
Al Viro	9c2c703929	->permission() sanitizing: pass MAY_NOT_BLOCK to ->check_acl() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:19 -04:00
Mimi Zohar	9d8f13ba3f	security: new security_inode_init_security API adds function callback This patch changes the security_inode_init_security API by adding a filesystem specific callback to write security extended attributes. This change is in preparation for supporting the initialization of multiple LSM xattrs and the EVM xattr. Initially the callback function walks an array of xattrs, writing each xattr separately, but could be optimized to write multiple xattrs at once. For existing security_inode_init_security() calls, which have not yet been converted to use the new callback function, such as those in reiserfs and ocfs2, this patch defines security_old_inode_init_security(). Signed-off-by: Mimi Zohar <zohar@us.ibm.com>	2011-07-18 12:29:38 -04:00
Christoph Hellwig	d0f9e8fb4c	xfs: remove the dead XFS_DABUF_DEBUG code Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:50 +02:00
Christoph Hellwig	c84470dda7	xfs: remove leftovers of the old btree tracing code Remove various bits left over from the old kdb-only btree tracing code, but leave the actual trace point stubs in place to ease adding new event based btree tracing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:50 +02:00
Christoph Hellwig	ea15ab3cdd	xfs: remove the dead QUOTADEBUG code Remove the dead hash table test rid which has been rotting away under QUOTADEBUG, including some code that was compiled for normal debug builds, but not actually called without QUOTADEBUG, and enable a few cheap debug checks that were hidden under QUOTADEBUG for normal debug builds. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:50 +02:00
Christoph Hellwig	54244fec67	xfs: remove the unused xfs_buf_delwri_sort function Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	cb669ca570	xfs: remove wrappers around b_iodone Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	adadbeefb3	xfs: remove wrappers around b_fspriv Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	bf9d9013a2	xfs: add a proper transaction pointer to struct xfs_buf Replace the typeless b_fspriv2 and the ugly macros around it with a properly typed transaction pointer. As a fallout the log buffer state debug checks are also removed. We could have kept them using casts, but as they do not have a real purpose we can as well just remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	77936d0280	xfs: factor out xfs_da_grow_inode_int xfs_da_grow_inode and xfs_dir2_grow_inode are mostly duplicate code. Factor the meat of those two functions into a new common helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	a230a1df40	xfs: factor out xfs_dir2_leaf_find_stale Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:48 +02:00
Christoph Hellwig	a00b7745c6	xfs: cleanup struct xfs_dir2_free Change the bests array to be a proper variable sized entry. This is done easily as no one relies on the size of the structure. Also change XFS_DIR2_MAX_FREE_BESTS to an inline function while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:48 +02:00
Christoph Hellwig	5792664070	xfs: reshuffle dir2 headers Replace the current mess of dir2 headers with just three that have a clear purpose: - xfs_dir2_format.h for all format definitions, including the inline helpers to access our variable size structures - xfs_dir2_priv.h for all prototypes that are internal to the dir2 code and not needed by anything outside of the directory code. For this purpose xfs_da_btree.c, and phase6.c in xfs_repair are considered part of the directory code. - xfs_dir2.h for the public interface to the directory code In addition to the reshuffle I have also update the comments to not only match the new file structure, but also to describe the directory format better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:48 +02:00
Christoph Hellwig	2bcf6e970f	xfs: start periodic workers later Start the periodic sync workers only after we have finished xfs_mountfs and thus fully set up the filesystem structures. Without this we can call into xfs_qm_sync before the quotainfo strucute is set up if the mount takes unusually long, and probably hit other incomplete states as well. Also clean up the xfs_fs_fill_super error path by using consistent label names, and removing an impossible to reach case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-07-13 13:43:48 +02:00
Alex Elder	b2ce397400	Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc" This reverts commit `7a249cf83d`. That commit created a situation that could lead to a filesystem hang. As Dave Chinner pointed out, xfs_trans_alloc() could hold a reference to m_active_trans (i.e., keep it non-zero) and then wait for SB_FREEZE_TRANS to complete. Meanwhile a filesystem freeze request could set SB_FREEZE_TRANS and then wait for m_active_trans to drop to zero. Nobody benefits from this sequence of events... Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-11 10:21:03 -05:00
Chandra Seetharaman	81463b1ca8	xfs: remove variables that serve no purpose in xfs_alloc_ag_vextent_exact() Remove two variables that serve no purpose in xfs_alloc_ag_vextent_exact(). Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-08 11:32:51 -05:00
Eric Sandeen	c0e090ced2	xfs: consolidate & clarify mount sanity checks Pavol pointed out that there is one silent error case in the mount path, and that others are rather uninformative. I've taken Pavol's suggested patch and extended it a bit to also: * fix a message which says "turned off" but actually errors out * consolidate the vaguely differentiated "SB sanity check [12]" messages, and hexdump the superblock for analysis Original-patch-by: Pavol Gono <Pavol.Gono@siemens.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-08 11:32:51 -05:00
Christoph Hellwig	e163cbde98	xfs: avoid a few disk cache flushes There is no need for a pre-flush when doing writing the second part of a split log buffer, and if we are using an external log there is no need to do a full cache flush of the log device at all given that all writes to it use the FUA flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:36 +02:00
Christoph Hellwig	1d5ae5dfee	xfs: cleanup I/O-related buffer flags Remove the unused and misnamed _XBF_RUN_QUEUES flag, rename XBF_LOG_BUFFER to the more fitting XBF_SYNCIO, and split XBF_ORDERED into XBF_FUA and XBF_FLUSH to allow more fine grained control over the bio flags. Also cleanup processing of the flags in _xfs_buf_ioapply to make more sense, and renumber the sparse flag number space to group flags by purpose. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:32 +02:00
Christoph Hellwig	c8da0faf6b	xfs: return the buffer locked from xfs_buf_get_uncached All other xfs_buf_get/read-like helpers return the buffer locked, make sure xfs_buf_get_uncached isn't different for no reason. Half of the callers already lock it directly after, and the others probably should also keep it locked if only for consistency and beeing able to use xfs_buf_rele, but I'll leave that for later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:25 +02:00
Christoph Hellwig	0c842ad46a	xfs: clean up buffer locking helpers Rename xfs_buf_cond_lock and reverse it's return value to fit most other trylock operations in the Kernel and XFS (with the exception of down_trylock, after which xfs_buf_cond_lock was modelled), and replace xfs_buf_lock_val with an xfs_buf_islocked for use in asserts, or and opencoded variant in tracing. remove the XFS_BUF_* wrappers for all the locking helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:19 +02:00
Christoph Hellwig	bbb4197c73	xfs: remove the unused xfs_bufhash structure Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:10 +02:00
Christoph Hellwig	69ef921b55	xfs: byteswap constants instead of variables Micro-optimize various comparisms by always byteswapping the constant instead of the variable, which allows to do the swap at compile instead of runtime. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:05 +02:00
Christoph Hellwig	218106a110	xfs: use generic get_unaligned_beXX helpers Switch the shortform directory code over to use the generic get_unaligned_beXX helpers instead of reinventing them. As a result kill off xfs_arch.h and move the setting of XFS_NATIVE_HOST into xfs_linux.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:58 +02:00
Christoph Hellwig	2282396d81	xfs: cleanup struct xfs_dir2_leaf Simplify the confusing xfs_dir2_leaf structure. It is supposed to describe an XFS dir2 leaf format btree block, but due to the variable sized nature of almost all elements in it it can't actuall do anything close to that job. Remove the members that are after the first variable sized array, given that they could only be used for sizeof expressions that can as well just use the underlying types directly, and make the ents array a real C99 variable sized array. Also factor out the xfs_dir2_leaf_size, to make the sizing of a leaf entry which already was convoluted somewhat readable after using the longer type names in the sizeof expressions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:53 +02:00
Christoph Hellwig	3ed8638f88	xfs: cleanup the definition of struct xfs_dir2_data_entry Remove the tag member which is at a variable offset after the actual name, and make name a real variable sized C99 array instead of the incorrect one-sized array which confuses (not only) gcc. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:50 +02:00
Christoph Hellwig	0ba9cd84ef	xfs: kill struct xfs_dir2_data Remove the confusing xfs_dir2_data structure. It is supposed to describe an XFS dir2 data btree block, but due to the variable sized nature of almost all elements in it it can't actuall do anything close to that job. In addition to accessing the fixed offset header structure it was only used to get a pointer to the first dir or unused entry after it, which can be trivially replaced by pointer arithmetics on the header pointer. For most users that is actually more natural anyway, as they don't use a typed pointer but rather a character pointer for further arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:42 +02:00
Christoph Hellwig	c2066e2662	xfs: avoid usage of struct xfs_dir2_data In most places we can simply pass around and use the struct xfs_dir2_data_hdr, which is the first and most important member of struct xfs_dir2_data instead of the full structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:38 +02:00
Christoph Hellwig	a64b041797	xfs: kill struct xfs_dir2_block Remove the confusing xfs_dir2_block structure. It is supposed to describe an XFS dir2 block format btree block, but due to the variable sized nature of almost all elements in it it can't actuall do anything close to that job. In addition to accessing the fixed offset header structure it was only used to get a pointer to the first dir or unused entry after it, which can be trivially replaced by pointer arithmetics on the header pointer. For most users that is actually more natural anyway, as they don't use a typed pointer but rather a character pointer for further arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:32 +02:00
Christoph Hellwig	4f6ae1a49e	xfs: avoid usage of struct xfs_dir2_block In most places we can simply pass around and use the struct xfs_dir2_data_hdr, which is the first and most important member of struct xfs_dir2_block instead of the full structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:27 +02:00
Christoph Hellwig	78f70cd7b7	xfs: cleanup the definition of struct xfs_dir2_sf_entry Remove the inumber member which is at a variable offset after the actual name, and make name a real variable sized C99 array instead of the incorrect one-sized array which confuses (not only) gcc. Based on this clean up the helpers to calculate the entry size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:19 +02:00
Christoph Hellwig	ac8ba50f6b	xfs: kill struct xfs_dir2_sf The list field of it is never cactually used, so all uses can simply be replaced with the xfs_dir2_sf_hdr_t type that it has as first member. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:13 +02:00
Christoph Hellwig	8bc3878758	xfs: cleanup shortform directory inode number handling Refactor the shortform directory helpers that deal with the 32-bit vs 64-bit wide inode numbers into more sensible helpers, and kill the xfs_intino_t typedef that is now superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:03 +02:00
Christoph Hellwig	4fb44c8272	xfs: factor out xfs_dir2_leaf_find_entry Add a new xfs_dir2_leaf_find_entry helper to factor out some duplicate code from xfs_dir2_leaf_addname xfs_dir2_leafn_add. Found by Eric Sandeen using an automated code duplication checker. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:59 +02:00
Christoph Hellwig	29d104af0a	xfs: kill the unused struct xfs_sync_work Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:51 +02:00
Christoph Hellwig	f3ca87389d	xfs: remove i_transp Remove the transaction pointer in the inode. It's only used to avoid passing down an argument in the bmap code, and for a few asserts in the transaction code right now. Also use the local variable ip in a few more places in xfs_inode_item_unlock, so that it isn't only used for debug builds after the above change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:47 +02:00
Christoph Hellwig	7a249cf83d	xfs: fix filesystsem freeze race in xfs_trans_alloc As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem freeze when it sleeps during the memory allocation. Fix this by moving the wait_for_freeze call after the memory allocation. This means moving the freeze into the low-level _xfs_trans_alloc helper, which thus grows a new argument. Also fix up some comments in that area while at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <david@fromorbit.com>	2011-07-08 14:34:42 +02:00
Christoph Hellwig	33b8f7c247	xfs: improve sync behaviour in the face of aggressive dirtying The following script from Wu Fengguang shows very bad behaviour in XFS when aggressively dirtying data during a sync on XFS, with sync times up to almost 10 times as long as ext4. A large part of the issue is that XFS writes data out itself two times in the ->sync_fs method, overriding the livelock protection in the core writeback code, and another issue is the lock-less xfs_ioend_wait call, which doesn't prevent new ioend from being queue up while waiting for the count to reach zero. This patch removes the XFS-internal sync calls and relies on the VFS to do it's work just like all other filesystems do. Note that the i_iocount wait which is rather suboptimal is simply removed here. We already do it in ->write_inode, which keeps the current supoptimal behaviour. We'll eventually need to remove that as well, but that's material for a separate commit. ------------------------------ snip ------------------------------ #!/bin/sh umount /dev/sda7 mkfs.xfs -f /dev/sda7 # mkfs.ext4 /dev/sda7 # mkfs.btrfs /dev/sda7 mount /dev/sda7 /fs echo $((50<<20)) > /proc/sys/vm/dirty_bytes pid= for i in `seq 10` do dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 & pid="$pid $!" done sleep 1 tic=$(date +'%s') sync tac=$(date +'%s') echo echo sync time: $((tac-tic)) egrep '(Dirty\|Writeback\|NFS_Unstable)' /proc/meminfo pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; } ------------------------------ snip ------------------------------ Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:39 +02:00
Christoph Hellwig	8f04c47aa9	xfs: split xfs_itruncate_finish Split the guts of xfs_itruncate_finish that loop over the existing extents and calls xfs_bunmapi on them into a new helper, xfs_itruncate_externs. Make xfs_attr_inactive call it directly instead of xfs_itruncate_finish, which allows to simplify the latter a lot, by only letting it deal with the data fork. As a result xfs_itruncate_finish is renamed to xfs_itruncate_data to make its use case more obvious. Also remove the sync parameter from xfs_itruncate_data, which has been unessecary since the introduction of the busy extent list in 2002, and completely dead code since 2003 when the XFS_BMAPI_ASYNC parameter was made a no-op. I can't actually see why the xfs_attr_inactive needs to set the transaction sync, but let's keep this patch simple and without changes in behaviour. Also avoid passing a useless argument to xfs_isize_check, and make it private to xfs_inode.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:34 +02:00
Christoph Hellwig	857b9778d8	xfs: kill xfs_itruncate_start xfs_itruncate_start is a rather length wrapper that evaluates to a call to xfs_ioend_wait and xfs_tosspages, and only has two callers. Instead of using the complicated checks left over from IRIX where we can to truncate the pagecache just call xfs_tosspages (aka truncate_inode_pages) directly as we want to get rid of all data after i_size, and truncate_inode_pages handles incorrect alignments and too large offsets just fine. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:30 +02:00
Christoph Hellwig	681b120018	xfs: always log timestamp updates in xfs_setattr_size Get rid of the special case where we use unlogged timestamp updates for a truncate to the current inode size, and just call xfs_setattr_nonsize for it to treat it like a utimes calls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:26 +02:00
Christoph Hellwig	c4ed4243c4	xfs: split xfs_setattr Split up xfs_setattr into two functions, one for the complex truncate handling, and one for the trivial attribute updates. Also move both new routines to xfs_iops.c as they are fairly Linux-specific. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:23 +02:00
Christoph Hellwig	dec58f1dfd	xfs: work around bogus gcc warning in xfs_allocbt_init_cursor GCC 4.6 complains about an array subscript is above array bounds when using the btree index to index into the agf_levels array. The only two indices passed in are 0 and 1, and we have an assert insuring that. Replace the trick of using the array index directly with using constants in the already existing branch for assigning the XFS_BTREE_LASTREC_UPDATE flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:18 +02:00
Christoph Hellwig	dbcdde3e76	xfs: re-enable non-blocking behaviour in xfs_map_blocks The non-blockig behaviour in xfs_vm_writepage currently is conditional on having both the WB_SYNC_NONE sync_mode and the nonblocking flag set. The latter used to be used by both pdflush, kswapd and a few other places in older kernels, but has been fading out starting with the introduction of the per-bdi flusher threads. Enable the non-blocking behaviour for all WB_SYNC_NONE calls to get back the behaviour we want. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:14 +02:00
Christoph Hellwig	680a647b49	xfs: PF_FSTRANS should never be set in ->writepage Now that we reject direct reclaim in addition to always using GFP_NOFS allocation there's no chance we'll ever end up in ->writepage with PF_FSTRANS set. Add a WARN_ON if we hit this case, and stop checking if we'd actually need to start a transaction. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:05 +02:00
Dave Chinner	1316d4da3f	xfs: unpin stale inodes directly in IOP_COMMITTED When inodes are marked stale in a transaction, they are treated specially when the inode log item is being inserted into the AIL. It tries to avoid moving the log item forward in the AIL due to a race condition with the writing the underlying buffer back to disk. The was "fixed" in commit `de25c18` ("xfs: avoid moving stale inodes in the AIL"). To avoid moving the item forward, we return a LSN smaller than the commit_lsn of the completing transaction, thereby trying to trick the commit code into not moving the inode forward at all. I'm not sure this ever worked as intended - it assumes the inode is already in the AIL, but I don't think the returned LSN would have been small enough to prevent moving the inode. It appears that the reason it worked is that the lower LSN of the inodes meant they were inserted into the AIL and flushed before the inode buffer (which was moved to the commit_lsn of the transaction). The big problem is that with delayed logging, the returning of the different LSN means insertion takes the slow, non-bulk path. Worse yet is that insertion is to a position -before- the commit_lsn so it is doing a AIL traversal on every insertion, and has to walk over all the items that have already been inserted into the AIL. It's expensive. To compound the matter further, with delayed logging inodes are likely to go from clean to stale in a single checkpoint, which means they aren't even in the AIL at all when we come across them at AIL insertion time. Hence these were all getting inserted into the AIL when they simply do not need to be as inodes marked XFS_ISTALE are never written back. Transactional/recovery integrity is maintained in this case by the other items in the unlink transaction that were modified (e.g. the AGI btree blocks) and committed in the same checkpoint. So to fix this, simply unpin the stale inodes directly in xfs_inode_item_committed() and return -1 to indicate that the AIL insertion code does not need to do any further processing of these inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-06 15:44:40 -05:00
Dave Chinner	4a33821236	xfs: prevent bogus assert when trying to remove non-existent attribute If the attribute fork on an inode is in btree format and has multiple levels (i.e node format rather than leaf format), then a lookup failure will trigger an assert failure in xfs_da_path_shift if the flag XFS_DA_OP_OKNOENT is not set. This flag is used to indicate to the directory btree code that not finding an entry is not a fatal error. In the case of doing a lookup for a directory name removal, this is valid as a user cannot insert an arbitrary name to remove from the directory btree. However, in the case of the attribute tree, a user has direct control over the attribute name and can ask for any random name to be removed without any validation. In this case, fsstress is asking for a non-existent user.selinux attribute to be removed, and that is causing xfs_da_path_shift() to fall off the bottom of the tree where it asserts that a lookup failure is allowed. Because the flag is not set, we die a horrible death on a debug enable kernel. Prevent this assert from firing on attribute removes by adding the op_flag XFS_DA_OP_OKNOENT to atribute removal operations. Discovered when testing on a SELinux enabled system by fsstress in test 070 by trying to remove a non-existent user.selinux attribute. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-23 22:13:51 -05:00
Dave Chinner	df4368a146	xfs: clear XFS_IDIRTY_RELEASE on truncate down When an inode is truncated down, speculative preallocation is removed from the inode. This should also reset the state bits for controlling whether preallocation is subsequently removed when the file is next closed. The flag is not being cleared, so repeated operations on a file that first involve a truncate (e.g. multiple repeated dd invocations on a file) give different file layouts for the second and subsequent invocations. Fix this by clearing the XFS_IDIRTY_RELEASE state bit when the XFS_ITRUNCATED bit is detected in xfs_release() and hence ensure that speculative delalloc is removed on files that have been truncated down. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-23 22:13:46 -05:00
Dave Chinner	778e24bb6d	xfs: reset inode per-lifetime state when recycling it XFS inodes has several per-lifetime state fields that determine the behaviour of the inode. These state fields are not all reset when an inode is reused from the reclaimable state. This can lead to unexpected behaviour of the new inode such as speculative preallocation not being truncated away in the expected manner for local files until the inode is subsequently truncated, freed or cycles out of the cache. It can also lead to an inode being considered to be a filestream inode or having been truncated when that is not the case. Rework the reinitialisation of the inode when it is recycled to ensure that it is pristine before it is reused. While there, also fix the resetting of state flags in the recycling error paths so the inode does not become unreclaimable. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-23 22:13:31 -05:00
Christoph Hellwig	a27a263bae	xfs: make log devices with write back caches work There's no reason not to support cache flushing on external log devices. The only thing this really requires is flushing the data device first both in fsync and log commits. A side effect is that we also have to remove the barrier write test during mount, which has been superflous since the new FLUSH+FUA code anyway. Also use the chance to flush the RT subvolume write cache before the fsync commit, which is required for correct semantics. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-16 10:52:39 -05:00
Al Viro	c46a131c0c	xfs: fix ->mknod() return value on xfs_get_acl() failure ->mknod() should return negative on errors and PTR_ERR() gives already negative value... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-14 11:02:13 -05:00
Christoph Hellwig	aa38572954	fs: pass exact type of data dirties to ->dirty_inode Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or anything else, so that the filesystem can track internally if it needs to push out a transaction for fdatasync or not. This is just the prototype change with no user for it yet. I plan to push large XFS changes for the next merge window, and getting this trivial infrastructure in this window would help a lot to avoid tree interdependencies. Also remove incorrect comments that ->dirty_inode can't block. That has been changed a long time ago, and many implementations rely on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-27 07:04:40 -04:00
Linus Torvalds	8a0599dd24	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: correctly decrement the extent buffer index in xfs_bmap_del_extent xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irec xfs: fix up asserts in xfs_iflush_fork xfs: do not do pointer arithmetic on extent records xfs: do not use unchecked extent indices in xfs_bunmapi xfs: do not use unchecked extent indices in xfs_bmapi xfs: do not use unchecked extent indices in xfs_bmap_add_extent_* xfs: remove if_lastex xfs: remove the unused XFS_BMAPI_RSVBLOCKS flag xfs: do not discard alloc btree blocks xfs: add online discard support	2011-05-26 10:49:11 -07:00
Christoph Hellwig	233eebb9a9	xfs: correctly decrement the extent buffer index in xfs_bmap_del_extent The code in xfs_bmap_del_extent does not correctly decrement the extent buffer index when deleting a whole extent. Most of the time this gets caught by checks in xfs_bmapi that work around it and decrement it manually and thus wasn't noticed so far. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:38 -05:00
Christoph Hellwig	87bef1812d	xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irec Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:37 -05:00
Christoph Hellwig	ab1908a5bb	xfs: fix up asserts in xfs_iflush_fork Remove asserts in xfs_iflush_fork that would call xfs_iext_get_ext with a potentially invalid extent buffer index. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:37 -05:00
Christoph Hellwig	f1c63b73cf	xfs: do not do pointer arithmetic on extent records We need to call xfs_iext_get_ext for the previous extent to get a valid pointer, and can't just do pointer arithmetics as they might be in different pages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:37 -05:00
Christoph Hellwig	00239acf36	xfs: do not use unchecked extent indices in xfs_bunmapi Make sure to only call xfs_iext_get_ext after we've validate the extent index when moving on to the next index in xfs_bunmapi. Also remove the old workaround for too large indices that has been superceeded by the proper fix in xfs_bmap_del_extent. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:37 -05:00
Christoph Hellwig	5690f92199	xfs: do not use unchecked extent indices in xfs_bmapi Make sure to only call xfs_iext_get_ext after we've validate the extent index when moving on to the next index in xfs_bmapi. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:37 -05:00
Christoph Hellwig	2f2b3220b0	xfs: do not use unchecked extent indices in xfs_bmap_add_extent_* Make sure to only call xfs_iext_get_ext after we've validate the extent index in the various xfs_bmap_add_extent_* helpers. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:37 -05:00
Christoph Hellwig	ec90c55634	xfs: remove if_lastex The if_lastex field in struct xfs_ifork is only used as a temporary index during xfs_bmapi and xfs_bunmapi. Instead of using the inode fork to store it keep it local in the callchain. Fortunately this is very easy as we already pass a stack copy of it down the whole chain which can simplify be changed to be passed by reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:37 -05:00
Christoph Hellwig	548932739b	xfs: remove the unused XFS_BMAPI_RSVBLOCKS flag The XFS_BMAPI_RSVBLOCKS is unused, and as far as I can see has always been. Remove it to simplify the bmapi implementation and conserve stack space. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-25 10:48:36 -05:00
Ying Han	1495f230fa	vmscan: change shrinker API by passing shrink_control struct Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:26 -07:00
Christoph Hellwig	55a7bc5a30	xfs: do not discard alloc btree blocks Blocks for the allocation btree are allocated from and released to the AGFL, and thus frequently reused. Even worse we do not have an easy way to avoid using an AGFL block when it is discarded due to the simple FILO list of free blocks, and thus can frequently stall on blocks that are currently undergoing a discard. Add a flag to the busy extent tracking structure to skip the discard for allocation btree blocks. In normal operation these blocks are reused frequently enough that there is no need to discard them anyway, but if they spill over to the allocation btree as part of a balance we "leak" blocks that we would otherwise discard. We could fix this by adding another flag and keeping these block in the rbtree even after they aren't busy any more so that we could discard them when they migrate out of the AGFL. Given that this would cause significant overhead I don't think it's worthwile for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-24 11:17:22 -05:00
Christoph Hellwig	e84661aa84	xfs: add online discard support Now that we have reliably tracking of deleted extents in a transaction we can easily implement "online" discard support which calls blkdev_issue_discard once a transaction commits. The actual discard is a two stage operation as we first have to mark the busy extent as not available for reuse before we can start the actual discard. Note that we don't bother supporting discard for the non-delaylog mode. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-24 11:17:13 -05:00
Linus Torvalds	a77febbef1	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: obey minleft values during extent allocation correctly xfs: reset buffer pointers before freeing them xfs: avoid getting stuck during async inode flushes xfs: fix xfs_itruncate_start tracing xfs: fix duplicate workqueue initialisation xfs: kill off xfs_printk() xfs: fix race condition in AIL push trigger xfs: make AIL target updates and compares 32bit safe. xfs: always push the AIL to the target xfs: exit AIL push work correctly when AIL is empty xfs: ensure reclaim cursor is reset correctly at end of AG xfs: add an x86 compat handler for XFS_IOC_ZERO_RANGE xfs: fix compiler warning in xfs_trace.h xfs: cleanup duplicate initializations xfs: reduce the number of pagb_lock roundtrips in xfs_alloc_clear_busy xfs: exact busy extent tracking xfs: do not immediately reuse busy extent ranges xfs: optimize AGFL refills	2011-05-23 15:19:16 -07:00
Linus Torvalds	57d19e80f4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) b43: fix comment typo reqest -> request Haavard Skinnemoen has left Atmel cris: typo in mach-fs Makefile Kconfig: fix copy/paste-ism for dell-wmi-aio driver doc: timers-howto: fix a typo ("unsgined") perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course'). treewide: fix a few typos in comments regulator: change debug statement be consistent with the style of the rest Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations" audit: acquire creds selectively to reduce atomic op overhead rtlwifi: don't touch with treewide double semicolon removal treewide: cleanup continuations and remove logging message whitespace ath9k_hw: don't touch with treewide double semicolon removal include/linux/leds-regulator.h: fix syntax in example code tty: fix typo in descripton of tty_termios_encode_baud_rate xtensa: remove obsolete BKL kernel option from defconfig m68k: fix comment typo 'occcured' arch:Kconfig.locks Remove unused config option. treewide: remove extra semicolons ...	2011-05-23 09:12:26 -07:00
Dave Chinner	bf59170a66	xfs: obey minleft values during extent allocation correctly When allocating an extent that is long enough to consume the remaining free space in an AG, we need to ensure that the allocation leaves enough space in the AG for any subsequent bmap btree blocks that are needed to track the new extent. These have to be allocated in the same AG as we only reserve enough blocks in an allocation transaction for modification of the freespace trees in a single AG. xfs_alloc_fix_minleft() has been considering blocks on the AGFL as free blocks available for extent and bmbt block allocation, which is not correct - blocks on the AGFL are there exclusively for the use of the free space btrees. As a result, when minleft is less than the number of blocks on the AGFL, xfs_alloc_fix_minleft() does not trim the given extent to leave minleft blocks available for bmbt allocation, and hence we can fail allocation during bmbt record insertion. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-19 12:03:48 -05:00
Dave Chinner	44396476a0	xfs: reset buffer pointers before freeing them When we free a vmapped buffer, we need to ensure the vmap address and length we free is the same as when it was allocated. In various places in the log code we change the memory the buffer is pointing to before issuing IO, but we never reset the buffer to point back to it's original memory (or no memory, if that is the case for the buffer). As a result, when we free the buffer it points to memory that is owned by something else and attempts to unmap and free it. Because the range does not match any known mapped range, it can trigger BUG_ON() traps in the vmap code, and potentially corrupt the vmap area tracking. Fix this by always resetting these buffers to their original state before freeing them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-19 12:03:45 -05:00
Dave Chinner	ee58abdfcc	xfs: avoid getting stuck during async inode flushes When the underlying inode buffer is locked and xfs_sync_inode_attr() is doing a non-blocking flush, xfs_iflush() can return EAGAIN. When this happens, clear the error rather than returning it to xfs_inode_ag_walk(), as returning EAGAIN will result in the AG walk delaying for a short while and trying again. This can result in background walks getting stuck on the one AG until inode buffer is unlocked by some other means. This behaviour was noticed when analysing event traces followed by code inspection and verification of the fix via further traces. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-19 12:03:42 -05:00
Dave Chinner	e57375153d	xfs: fix xfs_itruncate_start tracing Variables are ordered incorrectly in trace call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-19 12:03:36 -05:00
Dave Chinner	1beb65ad45	xfs: fix duplicate workqueue initialisation The workqueue initialisation function is called twice when initialising the XFS subsystem. Remove the second initialisation call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-19 12:03:24 -05:00
Joe Perches	e69522a8cc	xfs: kill off xfs_printk() xfs_alert_tag() can be defined using xfs_alert(), and thereby avoid using xfs_printk() altogether. This is the only remaining use of xfs_printk(), so changing it this way means xfs_printk() can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated. Also add format checking to the non-debug inline function xfs_debug. Miscellaneous function prototype argument alignment. (Updated to delete the definition of xfs_printk(), which is no longer used or needed.) Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-05-19 11:38:09 -05:00
Justin P. Mattock	70f23fd66b	treewide: fix a few typos in comments - kenrel -> kernel - whetehr -> whether - ttt -> tt - sss -> ss Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-05-10 10:16:21 +02:00
Dave Chinner	7ac956576d	xfs: fix race condition in AIL push trigger The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One is caused by a race condition in determining whether there is a psh in progress or not. The XFS_AIL_PUSHING_BIT is used to determine whether a push is currently in progress. When the AIL push work completes, it checked whether the target changed and cleared the PUSHING bit to allow a new push to be requeued. The race condition is as follows: Thread 1 push work smp_wmb() smp_rmb() check ailp->xa_target unchanged update ailp->xa_target test/set PUSHING bit does not queue clear PUSHING bit does not requeue Now that the push target is updated, new attempts to push the AIL will not trigger as the push target will be the same, and hence despite trying to push the AIL we won't ever wake it again. The fix is to ensure that the AIL push work clears the PUSHING bit before it checks if the target is unchanged. As a result, both push triggers operate on the same test/set bit criteria, so even if we race in the push work and miss the target update, the thread requesting the push will still set the PUSHING bit and queue the push work to occur. For safety sake, the same queue check is done if the push work detects the target change, though only one of the two will will queue new work due to the use of test_and_set_bit() checks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit `e4d3c4a43b`)	2011-05-09 18:35:04 -05:00
Dave Chinner	fe0da76731	xfs: make AIL target updates and compares 32bit safe. The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One of the problems noticed was that updates of the push target are not 32 bit safe as the target is a 64 bit value. We cannot copy a 64 bit LSN without the possibility of corrupting the result when racing with another updating thread. We have function to do this update safely without needing to care about 32/64 bit issues - xfs_trans_ail_copy_lsn() - so use that when updating the AIL push target. Also move the reading of the target in the push work inside the AIL lock, and use XFS_LSN_CMP() for the unlocked comparison during work termination to close read holes as well. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit `fd5670f22f`)	2011-05-09 18:35:04 -05:00
Dave Chinner	50e86686df	xfs: always push the AIL to the target The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One of the problems discovered is a target mismatch between the item pushing loop and the target itself. The push trigger checks for the target increasing (i.e. new target > current) while the push loop only pushes items that have a LSN < current. As a result, we can get the situation where the push target is X, the items at the tail of the AIL have LSN X and they don't get pushed. The push work then completes thinking it is done, and cannot be restarted until the push target increases to >= X + 1. If the push target then never increases (because the tail is not moving), then we never run the push work again and we stall. Fix it by making sure log items with a LSN that matches the target exactly are pushed during the loop. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit `cb64026b6e`)	2011-05-09 18:35:03 -05:00
Dave Chinner	9e7004e741	xfs: exit AIL push work correctly when AIL is empty The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. The main cause is a regression where a work exit path fails to clear the PUSHING state and recheck the target correctly. Make both exit paths do the same PUSHING bit clearing and target checking when the "no more work to be done" condition is hit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit `ea35a20021`)	2011-05-09 18:35:03 -05:00
Dave Chinner	228d62dd3f	xfs: ensure reclaim cursor is reset correctly at end of AG On a 32 bit highmem PowerPC machine, the XFS inode cache was growing without bound and exhausting low memory causing the OOM killer to be triggered. After some effort, the problem was reproduced on a 32 bit x86 highmem machine. The problem is that the per-ag inode reclaim index cursor was not getting reset to the start of the AG if the radix tree tag lookup found no more reclaimable inodes. Hence every further reclaim attempt started at the same index beyond where any reclaimable inodes lay, and no further background reclaim ever occurred from the AG. Without background inode reclaim the VM driven cache shrinker simply cannot keep up with cache growth, and OOM is the result. While the change that exposed the problem was the conversion of the inode reclaim to use work queues for background reclaim, it was not the cause of the bug. The bug was introduced when the cursor code was added, just waiting for some weird configuration to strike.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-By: Christian Kujau <lists@nerdbynature.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit `b223221956`)	2011-05-09 18:35:03 -05:00
Dave Chinner	e4d3c4a43b	xfs: fix race condition in AIL push trigger The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One is caused by a race condition in determining whether there is a psh in progress or not. The XFS_AIL_PUSHING_BIT is used to determine whether a push is currently in progress. When the AIL push work completes, it checked whether the target changed and cleared the PUSHING bit to allow a new push to be requeued. The race condition is as follows: Thread 1 push work smp_wmb() smp_rmb() check ailp->xa_target unchanged update ailp->xa_target test/set PUSHING bit does not queue clear PUSHING bit does not requeue Now that the push target is updated, new attempts to push the AIL will not trigger as the push target will be the same, and hence despite trying to push the AIL we won't ever wake it again. The fix is to ensure that the AIL push work clears the PUSHING bit before it checks if the target is unchanged. As a result, both push triggers operate on the same test/set bit criteria, so even if we race in the push work and miss the target update, the thread requesting the push will still set the PUSHING bit and queue the push work to occur. For safety sake, the same queue check is done if the push work detects the target change, though only one of the two will will queue new work due to the use of test_and_set_bit() checks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-05-09 12:17:04 -05:00
Dave Chinner	fd5670f22f	xfs: make AIL target updates and compares 32bit safe. The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One of the problems noticed was that updates of the push target are not 32 bit safe as the target is a 64 bit value. We cannot copy a 64 bit LSN without the possibility of corrupting the result when racing with another updating thread. We have function to do this update safely without needing to care about 32/64 bit issues - xfs_trans_ail_copy_lsn() - so use that when updating the AIL push target. Also move the reading of the target in the push work inside the AIL lock, and use XFS_LSN_CMP() for the unlocked comparison during work termination to close read holes as well. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-05-09 12:17:04 -05:00
Dave Chinner	cb64026b6e	xfs: always push the AIL to the target The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One of the problems discovered is a target mismatch between the item pushing loop and the target itself. The push trigger checks for the target increasing (i.e. new target > current) while the push loop only pushes items that have a LSN < current. As a result, we can get the situation where the push target is X, the items at the tail of the AIL have LSN X and they don't get pushed. The push work then completes thinking it is done, and cannot be restarted until the push target increases to >= X + 1. If the push target then never increases (because the tail is not moving), then we never run the push work again and we stall. Fix it by making sure log items with a LSN that matches the target exactly are pushed during the loop. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-05-09 12:17:04 -05:00
Dave Chinner	ea35a20021	xfs: exit AIL push work correctly when AIL is empty The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. The main cause is a regression where a work exit path fails to clear the PUSHING state and recheck the target correctly. Make both exit paths do the same PUSHING bit clearing and target checking when the "no more work to be done" condition is hit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-05-09 12:17:03 -05:00
Dave Chinner	b223221956	xfs: ensure reclaim cursor is reset correctly at end of AG On a 32 bit highmem PowerPC machine, the XFS inode cache was growing without bound and exhausting low memory causing the OOM killer to be triggered. After some effort, the problem was reproduced on a 32 bit x86 highmem machine. The problem is that the per-ag inode reclaim index cursor was not getting reset to the start of the AG if the radix tree tag lookup found no more reclaimable inodes. Hence every further reclaim attempt started at the same index beyond where any reclaimable inodes lay, and no further background reclaim ever occurred from the AG. Without background inode reclaim the VM driven cache shrinker simply cannot keep up with cache growth, and OOM is the result. While the change that exposed the problem was the conversion of the inode reclaim to use work queues for background reclaim, it was not the cause of the bug. The bug was introduced when the cursor code was added, just waiting for some weird configuration to strike.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-By: Christian Kujau <lists@nerdbynature.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-05-09 12:17:03 -05:00
Christoph Hellwig	8c1fdd0be5	xfs: add an x86 compat handler for XFS_IOC_ZERO_RANGE XFS_IOC_ZERO_RANGE uses struct xfs_flock64, and thus requires argument translation for 32-bit binaries on x86. Add the required XFS_IOC_ZERO_RANGE_32 defined and add it to the list of commands that require xfs_flock64 translation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-28 13:27:46 -05:00
Christoph Hellwig	1a18a29478	xfs: fix compiler warning in xfs_trace.h xfs_fsblock_t may be a 32-bit type on if XFS_BIG_BLKNOS is not set, make sure to cast a value of this type to an unsigned long long before using the ll printk qualifier. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-28 13:27:06 -05:00
David Sterba	45c51b9994	xfs: cleanup duplicate initializations follow these guidelines: - leave initialization in the declaration block if it fits the line - move to the code where it's more suitable ('for' init block) The last chunk was modified from David's original to be a correct fix for what appeared to be a duplicate initialization. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-04-28 13:25:29 -05:00
Christoph Hellwig	8a072a4d4c	xfs: reduce the number of pagb_lock roundtrips in xfs_alloc_clear_busy Instead of finding the per-ag and then taking and releasing the pagb_lock for every single busy extent completed sort the list of busy extents and only switch betweens AGs where nessecary. This becomes especially important with the online discard support which will hit this lock more often. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-28 13:18:09 -05:00
Christoph Hellwig	97d3ac75e5	xfs: exact busy extent tracking Update the extent tree in case we have to reuse a busy extent, so that it always is kept uptodate. This is done by replacing the busy list searches with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree in case of a reuse. This allows us to allow reusing metadata extents unconditionally, and thus avoid log forces especially for allocation btree blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-28 13:18:04 -05:00
Christoph Hellwig	e26f0501cf	xfs: do not immediately reuse busy extent ranges Every time we reallocate a busy extent, we cause a synchronous log force to occur to ensure the freeing transaction is on disk before we continue and use the newly allocated extent. This is extremely sub-optimal as we have to mark every transaction with blocks that get reused as synchronous. Instead of searching the busy extent list after deciding on the extent to allocate, check each candidate extent during the allocation decisions as to whether they are in the busy list. If they are in the busy list, we trim the busy range out of the extent we have found and determine if that trimmed range is still OK for allocation. In many cases, this check can be incorporated into the allocation extent alignment code which already does trimming of the found extent before determining if it is a valid candidate for allocation. Based on earlier patches from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-28 13:18:01 -05:00
Christoph Hellwig	a870acd9b2	xfs: optimize AGFL refills While we need to make sure we do not reuse busy extents, there is no need to force out busy extents when moving them between the AGFL and the freespace btree as we still take care of that when doing the real allocation. To avoid the log force when just moving extents from the different free space tracking structures, move the busy search out of xfs_alloc_get_freelist into the callers that need it, and move the busy list insert from xfs_free_ag_extent which is used both by AGFL refills and real allocation to xfs_free_extent, which is only used by the latter. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-28 13:17:56 -05:00
Dave Chinner	3eff126899	xfs: fix duplicate message output Commit `957935dc` ("xfs: fix xfs_debug warnings" broke the logic in __xfs_printk(). Instead of only printing one of two possible output strings based on whether the fs has a name or not, it outputs both. Fix it to only output one message again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-20 11:36:49 -05:00
Linus Torvalds	1e05ff020f	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: use proper interfaces for on-stack plugging xfs: fix xfs_debug warnings xfs: fix variable set but not used warnings xfs: convert log tail checking to a warning xfs: catch bad block numbers freeing extents. xfs: push the AIL from memory reclaim and periodic sync xfs: clean up code layout in xfs_trans_ail.c xfs: convert the xfsaild threads to a workqueue xfs: introduce background inode reclaim work xfs: convert ENOSPC inode flushing to use new syncd workqueue xfs: introduce a xfssyncd workqueue xfs: fix extent format buffer allocation size xfs: fix unreferenced var error in xfs_buf.c Also, applied patch from Tony Luck that fixes ia64: xfs_destroy_workqueues() should not be tagged with__exit in the branch before merging.	2011-04-11 15:48:57 -07:00
Luck, Tony	39411f81ee	xfs_destroy_workqueues() should not be tagged with__exit ia64 throws away .exit sections for the built-in CONFIG case, so routines that are used in other circumstances should not be tagged as __exit. Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-11 15:47:20 -07:00
Christoph Hellwig	a1b7ea5d58	xfs: use proper interfaces for on-stack plugging Add proper blk_start_plug/blk_finish_plug pairs for the two places where we issue buffer I/O, and remove the blk_flush_plug in xfs_buf_lock and xfs_buf_iowait, given that context switches already flush the per-process plugging lists. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-08 08:09:28 -05:00
Christoph Hellwig	957935dcd8	xfs: fix xfs_debug warnings For a CONFIG_XFS_DEBUG=n build gcc complains about statements with no effect in xfs_debug: fs/xfs/quota/xfs_qm_syscalls.c: In function 'xfs_qm_scall_trunc_qfiles': fs/xfs/quota/xfs_qm_syscalls.c:291:3: warning: statement with no effect The reason for that is that the various new xfs message functions have a return value which is never used, and in case of the non-debug build xfs_debug the macro evaluates to a plain 0 which produces the above warnings. This can be fixed by turning xfs_debug into an inline function instead of a macro, but in addition to that I've also changed all the message helpers to return void as we never use their return values. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-08 08:09:24 -05:00
Christoph Hellwig	ecb697c16c	xfs: fix variable set but not used warnings GCC 4.6 now warnings about variables set but not used. Fix the trivially fixable warnings of this sort. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-08 08:09:12 -05:00
Dave Chinner	da8a1a4a4d	xfs: convert log tail checking to a warning On the Power platform, the log tail debug checks fire excessively causing the system to panic early in testing. The debug checks are known to be racy, though on x86_64 there is no evidence that they trigger at all. We want to keep the checks active on debug systems to alert us to problems with log space accounting, but we need to reduce the impact of a racy check on testing on the Power platform. As a result, convert the ASSERT conditions to warnings, and allow them to fire only once per filesystem mount. This will prevent false positives from interfering with testing, whilst still providing us with the indication that they may be a problem with log space accounting should that occur. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	be65b18a10	xfs: catch bad block numbers freeing extents. A fuzzed filesystem crashed a kernel when freeing an extent with a block number beyond the end of the filesystem. Convert all the debug asserts in xfs_free_extent() to active checks so that we catch bad extents and return that the filesytsem is corrupted rather than crashing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	fd074841cf	xfs: push the AIL from memory reclaim and periodic sync When we are short on memory, we want to expedite the cleaning of dirty objects. Hence when we run short on memory, we need to kick the AIL flushing into action to clean as many dirty objects as quickly as possible. To implement this, sample the lsn of the log item at the head of the AIL and use that as the push target for the AIL flush. Further, we keep items in the AIL that are dirty that are not tracked any other way, so we can get objects sitting in the AIL that don't get written back until the AIL is pushed. Hence to get the filesystem to the idle state, we might need to push the AIL to flush out any remaining dirty objects sitting in the AIL. This requires the same push mechanism as the reclaim push. This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to match the new xfs_ail_max_lsn() function introduced in this patch. Similarly for xfs_trans_ail_push -> xfs_ail_push. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	cd4a3c503c	xfs: clean up code layout in xfs_trans_ail.c This patch rearranges the location of functions in xfs_trans_ail.c to remove the need for forward declarations of those functions in preparation for adding new functions without the need for forward declarations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	0bf6a5bd4b	xfs: convert the xfsaild threads to a workqueue Similar to the xfssyncd, the per-filesystem xfsaild threads can be converted to a global workqueue and run periodically by delayed works. This makes sense for the AIL pushing because it uses variable timeouts depending on the work that needs to be done. By removing the xfsaild, we simplify the AIL pushing code and remove the need to spread the code to implement the threading and pushing across multiple files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	a7b339f1b8	xfs: introduce background inode reclaim work Background inode reclaim needs to run more frequently that the XFS syncd work is run as 30s is too long between optimal reclaim runs. Add a new periodic work item to the xfs syncd workqueue to run a fast, non-blocking inode reclaim scan. Background inode reclaim is kicked by the act of marking inodes for reclaim. When an AG is first marked as having reclaimable inodes, the background reclaim work is kicked. It will continue to run periodically untill it detects that there are no more reclaimable inodes. It will be kicked again when the first inode is queued for reclaim. To ensure shrinker based inode reclaim throttles to the inode cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the background inode reclaim so that when we are low on memory we are trying to reclaim inodes as efficiently as possible. This kick shoul d not be necessary, but it will protect against failures to kick the background reclaim when inodes are first dirtied. To provide the rate throttling, make the shrinker pass do synchronous inode reclaim so that it blocks on inodes under IO. This means that the shrinker will reclaim inodes rather than just skipping over them, but it does not adversely affect the rate of reclaim because most dirty inodes are already under IO due to the background reclaim work the shrinker kicked. These two modifications solve one of the two OOM killer invocations Chris Mason reported recently when running a stress testing script. The particular workload trigger for the OOM killer invocation is where there are more threads than CPUs all unlinking files in an extremely memory constrained environment. Unlike other solutions, this one does not have a performance impact on performance when memory is not constrained or the number of concurrent threads operating is <= to the number of CPUs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	89e4cb550a	xfs: convert ENOSPC inode flushing to use new syncd workqueue On of the problems with the current inode flush at ENOSPC is that we queue a flush per ENOSPC event, regardless of how many are already queued. Thi can result in hundreds of queued flushes, most of which simply burn CPU scanned and do no real work. This simply slows down allocation at ENOSPC. We really only need one active flush at a time, and we can easily implement that via the new xfs_syncd_wq. All we need to do is queue a flush if one is not already active, then block waiting for the currently active flush to complete. The result is that we only ever have a single ENOSPC inode flush active at a time and this greatly reduces the overhead of ENOSPC processing. On my 2p test machine, this results in tests exercising ENOSPC conditions running significantly faster - 042 halves execution time, 083 drops from 60s to 5s, etc - while not introducing test regressions. This allows us to remove the old xfssyncd threads and infrastructure as they are no longer used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	c6d09b666d	xfs: introduce a xfssyncd workqueue All of the work xfssyncd does is background functionality. There is no need for a thread per filesystem to do this work - it can al be managed by a global workqueue now they manage concurrency effectively. Introduce a new gglobal xfssyncd workqueue, and convert the periodic work to use this new functionality. To do this, use a delayed work construct to schedule the next running of the periodic sync work for the filesystem. When the sync work is complete, queue a new delayed work for the next running of the sync work. For laptop mode, we wait on completion for the sync works, so ensure that the sync work queuing interface can flush and wait for work to complete to enable the work queue infrastructure to replace the current sequence number and wakeup that is used. Because the sync work does non-trivial amounts of work, mark the new work queue as CPU intensive. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	e828776a8a	xfs: fix extent format buffer allocation size When formatting an inode item, we have to allocate a separate buffer to hold extents when there are delayed allocation extents on the inode and it is in extent format. The allocation size is derived from the in-core data fork representation, which accounts for delayed allocation extents, while the on-disk representation does not contain any delalloc extents. As a result of this mismatch, the allocated buffer can be far larger than needed to hold the real extent list which, due to the fact the inode is in extent format, is limited to the size of the literal area of the inode. However, we can have thousands of delalloc extents, resulting in an allocation size orders of magnitude larger than is needed to hold all the real extents. Fix this by limiting the size of the buffer being allocated to the size of the literal area of the inodes in the filesystem (i.e. the maximum size an inode fork can grow to). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Dave Chinner	89b3600ccf	xfs: fix unreferenced var error in xfs_buf.c Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-03-30 23:34:20 -05:00
Linus Torvalds	c5850150d0	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: stop using the page cache to back the buffer cache xfs: register the inode cache shrinker before quotachecks xfs: xfs_trans_read_buf() should return an error on failure xfs: introduce inode cluster buffer trylocks for xfs_iflush vmap: flush vmap aliases when mapping fails xfs: preallocation transactions do not need to be synchronous Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_buf.c due to plug removal.	2011-03-28 15:51:02 -07:00
Dave Chinner	0e6e847ffe	xfs: stop using the page cache to back the buffer cache Now that the buffer cache has it's own LRU, we do not need to use the page cache to provide persistent caching and reclaim infrastructure. Convert the buffer cache to use alloc_pages() instead of the page cache. This will remove all the overhead of page cache management from setup and teardown of the buffers, as well as needing to mark pages accessed as we find buffers in the buffer cache. By avoiding the page cache, we also remove the need to keep state in the page_private(page) field for persistant storage across buffer free/buffer rebuild and so all that code can be removed. This also fixes the long-standing problem of not having enough bits in the page_private field to track all the state needed for a 512 sector/64k page setup. It also removes the need for page locking during reads as the pages are unique to the buffer and nobody else will be attempting to access them. Finally, it removes the buftarg address space lock as a point of global contention on workloads that allocate and free buffers quickly such as when creating or removing large numbers of inodes in parallel. This remove the 16TB limit on filesystem size on 32 bit machines as the page index (32 bit) is no longer used for lookups of metadata buffers - the buffer cache is now solely indexed by disk address which is stored in a 64 bit field in the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:16:45 +11:00
Dave Chinner	704b2907c2	xfs: register the inode cache shrinker before quotachecks During mount, we can do a quotacheck that involves a bulkstat pass on all inodes. If there are more inodes in the filesystem than can be held in memory, we require the inode cache shrinker to run to ensure that we don't run out of memory. Unfortunately, the inode cache shrinker is not registered until we get to the end of the superblock setup process, which is after a quotacheck is run if it is needed. Hence we need to register the inode cache shrinker earlier in the mount process so that we don't OOM during mount. This requires that we also initialise the syncd work before we register the shrinker, so we nee dto juggle that around as well. While there, make sure that we have set up the block sizes in the VFS superblock correctly before the quotacheck is run so that any inodes that are cached as a result of the quotacheck have their block size fields set up correctly. Cc: stable@kernel.org Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:14:57 +11:00
Dave Chinner	7401aafd50	xfs: xfs_trans_read_buf() should return an error on failure When inside a transaction and we fail to read a buffer, xfs_trans_read_buf returns a null buffer pointer and no error. xfs_do_da_buf() checks the error return, but not the buffer, and as a result this read failure condition causes a panic when it attempts to dereference the non-existant buffer. Make xfs_trans_read_buf() return the same error for this situation regardless of whether it is in a transaction or not. This means every caller does not need to check both the error return and the buffer before proceeding to use the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:14:44 +11:00
Dave Chinner	1bfd8d0419	xfs: introduce inode cluster buffer trylocks for xfs_iflush There is an ABBA deadlock between synchronous inode flushing in xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the buffer, then takes inode ilocks, whilst synchronous reclaim takes the ilock followed by the buffer lock in xfs_iflush(). To avoid this deadlock, separate the inode cluster buffer locking semantics from the synchronous inode flush semantics, allowing callers to attempt to lock the buffer but still issue synchronous IO if it can get the buffer. This requires xfs_iflush() calls that currently use non-blocking semantics to pass SYNC_TRYLOCK rather than 0 as the flags parameter. This allows xfs_reclaim_inode to avoid the deadlock on the buffer lock and detect the failure so that it can drop the inode ilock and restart the reclaim attempt on the inode. This allows xfs_ifree_cluster to obtain the inode lock, mark the inode stale and release it and hence defuse the deadlock situation. It also has the pleasant side effect of avoiding IO in xfs_reclaim_inode when it tries to next reclaim the inode as it is now marked stale. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:13:55 +11:00
Dave Chinner	a19fb38096	vmap: flush vmap aliases when mapping fails On 32 bit systems, vmalloc space is limited and XFS can chew through it quickly as the vmalloc space is lazily freed. This can result in failure to map buffers, even when there is apparently large amounts of vmalloc space available. Hence, if we fail to map a buffer, purge the aliases that have not yet been freed to hopefuly free up enough vmalloc space to allow a retry to succeed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:13:42 +11:00
Dave Chinner	8287889742	xfs: preallocation transactions do not need to be synchronous Preallocation and hole punch transactions are currently synchronous and this is causing performance problems in some cases. The transactions don't need to be synchronous as we don't need to guarantee the preallocation is persistent on disk until a fdatasync, fsync, sync operation occurs. If the file is opened O_SYNC or O_DATASYNC, only then should the transaction be issued synchronously. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:13:08 +11:00
Linus Torvalds	6c51038900	Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}	2011-03-24 10:16:26 -07:00
Linus Torvalds	3155fe6df5	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (23 commits) xfs: don't name variables "panic" xfs: factor agf counter updates into a helper xfs: clean up the xfs_alloc_compute_aligned calling convention xfs: kill support/debug.[ch] xfs: Convert remaining cmn_err() callers to new API xfs: convert the quota debug prints to new API xfs: rename xfs_cmn_err_fsblock_zero() xfs: convert xfs_fs_cmn_err to new error logging API xfs: kill xfs_fs_mount_cmn_err() macro xfs: kill xfs_fs_repair_cmn_err() macro xfs: convert xfs_cmn_err to xfs_alert_tag xfs: Convert xlog_warn to new logging interface xfs: Convert linux-2.6/ files to new logging interface xfs: introduce new logging API. xfs: zero proper structure size for geometry calls xfs: enable delaylog by default xfs: more sensible inode refcounting for ialloc xfs: stop using xfs_trans_iget in the RT allocator xfs: check if device support discard in xfs_ioc_trim() xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1 ...	2011-03-21 14:24:56 -07:00
Linus Torvalds	a44f99c7ef	Merge branch 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 * 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: (25 commits) video: change to new flag variable scsi: change to new flag variable rtc: change to new flag variable rapidio: change to new flag variable pps: change to new flag variable net: change to new flag variable misc: change to new flag variable message: change to new flag variable memstick: change to new flag variable isdn: change to new flag variable ieee802154: change to new flag variable ide: change to new flag variable hwmon: change to new flag variable dma: change to new flag variable char: change to new flag variable fs: change to new flag variable xtensa: change to new flag variable um: change to new flag variables s390: change to new flag variable mips: change to new flag variable ... Fix up trivial conflict in drivers/hwmon/Makefile	2011-03-20 18:14:55 -07:00
matt mooney	0ccd234ca0	fs: change to new flag variable Replace EXTRA_CFLAGS with ccflags-y. And change ntfs-objs to ntfs-y for cleaner conditional inclusion. Signed-off-by: matt mooney <mfm@muteddisk.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Michal Marek <mmarek@suse.cz>	2011-03-17 14:02:57 +01:00
Linus Torvalds	0f6e0e8448	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits) AppArmor: kill unused macros in lsm.c AppArmor: cleanup generated files correctly KEYS: Add an iovec version of KEYCTL_INSTANTIATE KEYS: Add a new keyctl op to reject a key with a specified error code KEYS: Add a key type op to permit the key description to be vetted KEYS: Add an RCU payload dereference macro AppArmor: Cleanup make file to remove cruft and make it easier to read SELinux: implement the new sb_remount LSM hook LSM: Pass -o remount options to the LSM SELinux: Compute SID for the newly created socket SELinux: Socket retains creator role and MLS attribute SELinux: Auto-generate security_is_socket_class TOMOYO: Fix memory leak upon file open. Revert "selinux: simplify ioctl checking" selinux: drop unused packet flow permissions selinux: Fix packet forwarding checks on postrouting selinux: Fix wrong checks for selinux_policycap_netpeer selinux: Fix check for xfrm selinux context algorithm ima: remove unnecessary call to ima_must_measure IMA: remove IMA imbalance checking ...	2011-03-16 09:15:43 -07:00
Linus Torvalds	bd2895eead	Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER	2011-03-16 08:20:19 -07:00
Aneesh Kumar K.V	5fe0c23788	exportfs: Return the minimum required handle size The exportfs encode handle function should return the minimum required handle size. This helps user to find out the handle size by passing 0 handle size in the first step and then redoing to the call again with the returned handle size value. Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:28 -04:00
Alex Elder	0c9ba97318	xfs: don't name variables "panic" The new xfs_alert_tag() used a variable named "panic", and that is to be avoided. Rename it. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-03-11 16:34:51 -06:00
Jens Axboe	4c63f5646e	Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core Conflicts: block/blk-core.c block/blk-flush.c drivers/md/raid1.c drivers/md/raid10.c drivers/md/raid5.c fs/nilfs2/btnode.c fs/nilfs2/mdt.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:58:35 +01:00
Jens Axboe	721a9602e6	block: kill off REQ_UNPLUG With the plugging now being explicitly controlled by the submitter, callers need not pass down unplugging hints to the block layer. If they want to unplug, it's because they manually plugged on their own - in which case, they should just unplug at will. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:27 +01:00
Jens Axboe	7eaceaccab	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:07 +01:00
Christoph Hellwig	ecb6928fcf	xfs: factor agf counter updates into a helper Updating the AGF and transactions counters is duplicated between allocating and freeing extents. Factor the code into a common helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-03-09 08:23:47 -06:00
Christoph Hellwig	86fa8af69d	xfs: clean up the xfs_alloc_compute_aligned calling convention Pass a xfs_alloc_arg structure to xfs_alloc_compute_aligned and derive the alignment and minlen paramters from it. This cleans up the existing callers, and we'll need even more information from the xfs_alloc_arg in subsequent patches. Based on a patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-03-09 08:23:33 -06:00
James Morris	fe3fa43039	Merge branch 'master' of git://git.infradead.org/users/eparis/selinux into next	2011-03-08 11:38:10 +11:00
Dave Chinner	9130090b5f	xfs: kill support/debug.[ch] The remaining functionality in debug.[ch] is effectively just assert handling, conditional debug definitions and hex dumping. The hex dumping and assert function can be moved into the new printk module, while the rest can be moved into top-level header files. This allows fs/xfs/support/debug.[ch] to be completely removed from the codebase. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:09:35 +11:00
Dave Chinner	0b932cccbd	xfs: Convert remaining cmn_err() callers to new API Once converted, kill the remainder of the cmn_err() interface. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:08:35 +11:00
Dave Chinner	8221112b43	xfs: convert the quota debug prints to new API Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:07:35 +11:00
Dave Chinner	6d4a8ecb34	xfs: rename xfs_cmn_err_fsblock_zero() The "cmn_err" part of the function name is no longer relevant. Rename the function to xfs_alert_fsblock_zero() to match the new logging API. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:06:35 +11:00
Dave Chinner	5348778699	xfs: convert xfs_fs_cmn_err to new error logging API Continue to clean up the error logging code by converting all the callers of xfs_fs_cmn_err() to the new API. Once done, remove the unused old API function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:05:35 +11:00
Dave Chinner	af34e09da4	xfs: kill xfs_fs_mount_cmn_err() macro The xfs_fs_mount_cmn_err() hides a simple check as to whether the mount path should output an error or not. Remove the macro and open code the check. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:04:35 +11:00
Dave Chinner	65333b4c3d	xfs: kill xfs_fs_repair_cmn_err() macro In certain cases of inode corruption, the xfs_fs_repair_cmn_err() macro is used to output an extra message in the corruption report. That extra message is "unmount and run xfs_repair", which really applies to any corruption report. Each case that this macro is called (except one) a following call to xfs_corruption_error() is made to optionally dump more information about the error. Hence, move the output of "run xfs_repair" to xfs_corruption_error() so that it is output on all corruption reports. Also, convert the callers of the repair macro that don't call xfs_corruption_error() to call it, hence provide consiѕtent error reporting for all cases where xfs_fs_repair_cmn_err() used to be called. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:03:35 +11:00
Dave Chinner	6a19d9393a	xfs: convert xfs_cmn_err to xfs_alert_tag Continue the conversion of the old cmn_err interface be converting all the conditional panic tag errors to xfs_alert_tag() and then removing xfs_cmn_err(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:02:35 +11:00
Dave Chinner	a0fa2b679e	xfs: Convert xlog_warn to new logging interface Convert the xfs log operations to use the new error logging interfaces. This removes the xlog_{warn,panic} wrappers and makes almost all errors emit the device they belong to instead of just refering to "XFS". Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:01:35 +11:00
Dave Chinner	4f10700a2e	xfs: Convert linux-2.6/ files to new logging interface Convert the files in fs/xfs/linux-2.6/ to use the new xfs_<level> logging format that replaces the old Irix inherited cmn_err() interfaces. While there, also convert naked printk calls to use the relevant xfs logging function to standardise output format. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-07 10:00:35 +11:00
Alex Elder	af24ee9ea8	xfs: zero proper structure size for geometry calls Commit `493f3358cb` added this call to xfs_fs_geometry() in order to avoid passing kernel stack data back to user space: + memset(geo, 0, sizeof(*geo)); Unfortunately, one of the callers of that function passes the address of a smaller data type, cast to fit the type that xfs_fs_geometry() requires. As a result, this can happen: Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: f87aca93 Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1 Call Trace: [<c12991ac>] ? panic+0x50/0x150 [<c102ed71>] ? __stack_chk_fail+0x10/0x18 [<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs] Fix this by fixing that one caller to pass the right type and then copy out the subset it is interested in. Note: This patch is an alternative to one originally proposed by Eric Sandeen. Reported-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu> Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>	2011-03-01 21:21:13 -06:00
Dave Chinner	10e38391c0	xfs: introduce new logging API. Most of the logging infrastructure in XFS is unneccessary and designed around the infrastructure supplied by Irix rather than Linux. To rationalise the logging interfaces, start by introducing simple printk wrappers similar to the dev_printk() infrastructure. Later patches will convert code to use this new interface. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-03-02 14:20:59 +11:00
Alex Elder	eeb2036b8a	xfs: zero proper structure size for geometry calls Commit `493f3358cb` added this call to xfs_fs_geometry() in order to avoid passing kernel stack data back to user space: + memset(geo, 0, sizeof(*geo)); Unfortunately, one of the callers of that function passes the address of a smaller data type, cast to fit the type that xfs_fs_geometry() requires. As a result, this can happen: Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: f87aca93 Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1 Call Trace: [<c12991ac>] ? panic+0x50/0x150 [<c102ed71>] ? __stack_chk_fail+0x10/0x18 [<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs] Fix this by fixing that one caller to pass the right type and then copy out the subset it is interested in. Note: This patch is an alternative to one originally proposed by Eric Sandeen. Reported-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu> Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>	2011-03-01 21:19:59 -06:00
Christoph Hellwig	20ad9ea9be	xfs: enable delaylog by default Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-22 20:33:25 -06:00
Christoph Hellwig	ec3ba85f40	xfs: more sensible inode refcounting for ialloc Currently we return iodes from xfs_ialloc with just a single reference held. But we need two references, as one is dropped during transaction commit and the second needs to be transfered to the VFS. Change xfs_ialloc to use xfs_iget plus xfs_trans_ijoin_ref to grab two references to the inode, and remove the now superflous IHOLD calls from all callers. This also greatly simplifies the error handling in xfs_create and also allow to remove xfs_trans_iget as no other callers are left. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-22 20:32:28 -06:00
Christoph Hellwig	1050c71e29	xfs: stop using xfs_trans_iget in the RT allocator During mount we establish references to the RT inodes, which we keep for the lifetime of the filesystem. Instead of using xfs_trans_iget to grab additional references when adding RT inodes to transactions use the combination of xfs_ilock and xfs_trans_ijoin_ref, which archives the same end result with less overhead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-22 20:30:21 -06:00
Lukas Czerner	be715140b5	xfs: check if device support discard in xfs_ioc_trim() Right now we, are relying on the fact that when we attempt to actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP and the user is informed that the device does not support discard. However, in the case where the we do not hit any suitable free extent to trim in FITRIM code, it will finish without any error. This is very confusing, because it seems that FITRIM was successful even though the device does not actually supports discard. Solution: Check for the discard support before attempt to search for free extents. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-22 15:08:44 -06:00
Dan Rosenberg	3a3675b7f2	xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1 The FSGEOMETRY_V1 ioctl (and its compat equivalent) calls out to xfs_fs_geometry() with a version number of 3. This code path does not fill in the logsunit member of the passed xfs_fsop_geom_t, leading to the leaking of four bytes of uninitialized stack data to potentially unprivileged callers. v2 switches to memset() to avoid future issues if structure members change, on suggestion of Dave Chinner. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Reviewed-by: Eugene Teo <eugeneteo@kernel.org> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-22 15:06:47 -06:00
Lukas Czerner	5d15765594	xfs: check if device support discard in xfs_ioc_trim() Right now we, are relying on the fact that when we attempt to actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP and the user is informed that the device does not support discard. However, in the case where the we do not hit any suitable free extent to trim in FITRIM code, it will finish without any error. This is very confusing, because it seems that FITRIM was successful even though the device does not actually supports discard. Solution: Check for the discard support before attempt to search for free extents. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-21 20:39:00 -06:00
Dan Rosenberg	c4d0c3b097	xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1 The FSGEOMETRY_V1 ioctl (and its compat equivalent) calls out to xfs_fs_geometry() with a version number of 3. This code path does not fill in the logsunit member of the passed xfs_fsop_geom_t, leading to the leaking of four bytes of uninitialized stack data to potentially unprivileged callers. v2 switches to memset() to avoid future issues if structure members change, on suggestion of Dave Chinner. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Reviewed-by: Eugene Teo <eugeneteo@kernel.org> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-21 19:55:47 -06:00
Tejun Heo	43d133c18b	Merge branch 'master' into for-2.6.39	2011-02-21 09:43:56 +01:00
Christoph Hellwig	9681153b46	xfs: add lockdep annotations for the rt inodes The rt bitmap and summary inodes do not participate in the normal inode locking protocol. Instead the rt bitmap inode can be locked in any transaction involving rt allocations, and the both of the rt inodes can be locked at the same time. Add specific lockdep subclasses for the rt inodes to prevent lockdep from blowing up. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-07 13:29:18 -06:00
Christoph Hellwig	0d8b30ad19	xfs: fix xfs_get_extsz_hint for a zero extent size hint We can easily set the extsize flag without setting an extent size hint, or one that evaluates to zero. Historically the di_extsize field was only used when it was non-zero, but the commit "Cleanup inode extent size hint extraction" broke this. Restore the old behaviour, thus fixing xfsqa 090 with a debug kernel. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-07 13:29:14 -06:00
Christoph Hellwig	04e99455ea	xfs: only lock the rt bitmap inode once per allocation Currently both xfs_rtpick_extent and xfs_rtallocate_extent call xfs_trans_iget to grab and lock the rt bitmap inode, which results in a deadlock since the removal of the lock recursion counters in commit "xfs: simplify inode to transaction joining" Fix this by acquiring and locking the inode in xfs_bmap_rtalloc before calling into xfs_rtpick_extent and xfs_rtallocate_extent. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-02-07 13:29:06 -06:00
Eric Paris	2a7dba391e	fs/vfs/security: pass last path component to LSM on inode creation SELinux would like to implement a new labeling behavior of newly created inodes. We currently label new inodes based on the parent and the creating process. This new behavior would also take into account the name of the new object when deciding the new label. This is not the (supposed) full path, just the last component of the path. This is very useful because creating /etc/shadow is different than creating /etc/passwd but the kernel hooks are unable to differentiate these operations. We currently require that userspace realize it is doing some difficult operation like that and than userspace jumps through SELinux hoops to get things set up correctly. This patch does not implement new behavior, that is obviously contained in a seperate SELinux patch, but it does pass the needed name down to the correct LSM hook. If no such name exists it is fine to pass NULL. Signed-off-by: Eric Paris <eparis@redhat.com>	2011-02-01 11:12:29 -05:00
Tejun Heo	83e759043a	xfs: convert to alloc_workqueue() Convert from create[_singlethread]_workqueue() to alloc_workqueue(). * xfsdatad_workqueue and xfsconvertd_workqueue are identity converted. Using higher concurrency limit might be useful but given the complexity of workqueue usage in xfs, proceeding cautiously seems better. * xfs_mru_reap_wq is converted to non-ordered workqueue with max concurrency of 1 as the work items don't require any specific ordering and already have proper synchronization. It seems it was singlethreaded to save worker threads, which is no longer a concern. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alex Elder <aelder@sgi.com> Cc: xfs-masters@oss.sgi.com Cc: Christoph Hellwig <hch@infradead.org>	2011-02-01 11:42:43 +01:00
bpm@sgi.com	24446fc66f	xfs: xfs_bmap_add_extent_delay_real should init br_startblock When filling in the middle of a previous delayed allocation in xfs_bmap_add_extent_delay_real, set br_startblock of the new delay extent to the right to nullstartblock instead of 0 before inserting the extent into the ifork (xfs_iext_insert), rather than setting br_startblock afterward. Adding the extent into the ifork with br_startblock=0 can lead to the extent being copied into the btree by xfs_bmap_extent_to_btree if we happen to convert from extents format to btree format before updating br_startblock with the correct value. The unexpected addition of this delay extent to the btree can cause subsequent XFS_WANT_CORRUPTED_GOTO filesystem shutdown in several xfs_bmap_add_extent_delay_real cases where we are converting a delay extent to real and unexpectedly find an extent already inserted. For example: 911 case BMAP_LEFT_FILLING: 912 /* 913 * Filling in the first part of a previous delayed allocation. 914 * The left neighbor is not contiguous. 915 */ 916 trace_xfs_bmap_pre_update(ip, idx, state, _THIS_IP_); 917 xfs_bmbt_set_startoff(ep, new_endoff); 918 temp = PREV.br_blockcount - new->br_blockcount; 919 xfs_bmbt_set_blockcount(ep, temp); 920 xfs_iext_insert(ip, idx, 1, new, state); 921 ip->i_df.if_lastex = idx; 922 ip->i_d.di_nextents++; 923 if (cur == NULL) 924 rval = XFS_ILOG_CORE \| XFS_ILOG_DEXT; 925 else { 926 rval = XFS_ILOG_CORE; 927 if ((error = xfs_bmbt_lookup_eq(cur, new->br_startoff, 928 new->br_startblock, new->br_blockcount, 929 &i))) 930 goto done; 931 XFS_WANT_CORRUPTED_GOTO(i == 0, done); With the bogus extent in the btree we shutdown the filesystem at 931. The conversion from extents to btree format happens when the number of extents in the inode increases above ip->i_df.if_ext_max. xfs_bmap_extent_to_btree copies extents from the ifork into the btree, ignoring all delalloc extents which are denoted by br_startblock having some value of nullstartblock. SGI-PV: 1013221 Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-01-28 09:13:29 -06:00
Dave Chinner	0fbca4d1c3	xfs: fix dquot shaker deadlock Commit `368e136` ("xfs: remove duplicate code from dquot reclaim") fails to unlock the dquot freelist when the number of loop restarts is exceeded in xfs_qm_dqreclaim_one(). This causes hangs in memory reclaim. Rework the loop control logic into an unwind stack that all the different cases jump into. This means there is only one set of code that processes the loop exit criteria, and simplifies the unlocking of all the items from different points in the loop. It also fixes a double increment of the restart counter from the qi_dqlist_lock case. Reported-by: Malcolm Scott <lkml@malc.org.uk> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-01-28 09:05:36 -06:00
Dave Chinner	c6f990d1ff	xfs: handle CIl transaction commit failures correctly Failure to commit a transaction into the CIL is not handled correctly. This currently can only happen when racing with a shutdown and requires an explicit shutdown check, so it rare and can be avoided. Remove the shutdown check and make the CIL commit a void function to indicate it will always succeed, thereby removing the incorrectly handled failure case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-01-28 09:05:36 -06:00
Dave Chinner	5315837dae	xfs: limit extsize to size of AGs and/or MAXEXTLEN The extent size hint can be set to larger than an AG. This means that the alignment process can push the range to be allocated outside the bounds of the AG, resulting in assert failures or corrupted bmbt records. Similarly, if the extsize is larger than the maximum extent size supported, the alignment process will produce extents that are too large to fit into the bmbt records, resulting in a different type of assert/corruption failure. Fix this by limiting extsize at the time іt is set firstly to be less than MAXEXTLEN, then to be a maximum of half the size of the AGs in the filesystem for non-realtime inodes. Realtime inodes do not allocate out of AGs, so don't have to be restricted by the size of AGs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-01-28 09:05:36 -06:00
Dave Chinner	4ce159890c	xfs: prevent extsize alignment from exceeding maximum extent size When doing delayed allocation, if the allocation size is for a maximally sized extent, extent size alignment can push it over this limit. This results in an assert failure in xfs_bmbt_set_allf() as the extent length is too large to find in the extent record. Fix this by ensuring that we allow for space that extent size alignment requires (up to 2 * (extsize -1) blocks as we have to handle both head and tail alignment) when limiting the maximum size of the extent. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-01-28 09:05:36 -06:00
Dave Chinner	14b064ceaa	xfs: limit extent length for allocation to AG size Delayed allocation extents can be larger than AGs, so when trying to convert a large range we may scan every AG inside xfs_bmap_alloc_nullfb() trying to find an AG with a size larger than an AG. We should stop when we find the first AG with a maximum possible allocation size. This causes excessive CPU usage when there are lots of AGs. The same problem occurs when doing preallocation of a range larger than an AG. Fix the problem by limiting real allocation lengths to the maximum that an AG can support. This means if we have empty AGs, we'll stop the search at the first of them. If there are no empty AGs, we'll still scan them all, but that is a different problem.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-01-28 09:05:35 -06:00
Dave Chinner	b8fc82630a	xfs: speculative delayed allocation uses rounddown_power_of_2 badly rounddown_power_of_2() returns an undefined result when passed a value of zero. The specualtive delayed allocation code is doing this when the inode is zero length. Hence occasionally the preallocation is much, much larger than is necessary (e.g. 8GB for a 270 _byte_ file). Ensure we don't even pass a zero value to this function so the result of preallocation is always the desired size. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-01-28 09:05:35 -06:00
Dave Chinner	e34a314c5e	xfs: fix efi item leak on forced shutdown After test 139, kmemleak shows: unreferenced object 0xffff880078b405d8 (size 400): comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s) hex dump (first 32 bytes): 60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff `..y....`..y.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60 [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0 [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0 [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50 [<ffffffff8147cd6b>] xfs_efi_init+0x4b/0xb0 [<ffffffff814a4ee8>] xfs_trans_get_efi+0x58/0x90 [<ffffffff81455fab>] xfs_bmap_finish+0x8b/0x1d0 [<ffffffff814851b4>] xfs_itruncate_finish+0x2c4/0x5d0 [<ffffffff814a970f>] xfs_setattr+0x8df/0xa70 [<ffffffff814b5c7b>] xfs_vn_setattr+0x1b/0x20 [<ffffffff8117dc00>] notify_change+0x170/0x2e0 [<ffffffff81163bf6>] do_truncate+0x66/0xa0 [<ffffffff81163d0b>] sys_ftruncate+0xdb/0xe0 [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff The cause of the leak is that the "remove" parameter of IOP_UNPIN() is never set when a CIL push is aborted. This means that the EFI item is never freed if it was in the push being cancelled. The problem is specific to delayed logging, but has uncovered a couple of problems with the handling of IOP_UNPIN(remove). Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN() in the CIL commit failure path or the iclog write failure path because for delayed loging we have no transaction context. Hence we must only call xfs_trans_del_item() if the log item being unpinned has an active log item descriptor. Secondly, xfs_trans_uncommit() does not handle log item descriptor freeing during the traversal of log items on a transaction. It can reference a freed log item descriptor when unpinning an EFI item. Hence it needs to use a safe list traversal method to allow items to be removed from the transaction during IOP_UNPIN(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-01-28 09:01:33 -06:00
Dave Chinner	7db37c5e65	xfs: fix log ticket leak on forced shutdown. The kmemleak detector shows this after test 139: unreferenced object 0xffff880079b88bb0 (size 264): comm "xfs_io", pid 4904, jiffies 4294909382 (age 276.824s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff 48 7b c9 82 ff ff ff ff ........H{...... backtrace: [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60 [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0 [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0 [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50 [<ffffffff8148f394>] xlog_ticket_alloc+0x34/0x170 [<ffffffff81494444>] xlog_cil_push+0xa4/0x3f0 [<ffffffff81494eca>] xlog_cil_force_lsn+0x15a/0x160 [<ffffffff814933a5>] _xfs_log_force_lsn+0x75/0x2d0 [<ffffffff814a264d>] _xfs_trans_commit+0x2bd/0x2f0 [<ffffffff8148bfdd>] xfs_iomap_write_allocate+0x1ad/0x350 [<ffffffff814ac17f>] xfs_map_blocks+0x21f/0x370 [<ffffffff814ad1b7>] xfs_vm_writepage+0x1c7/0x550 [<ffffffff8112200a>] __writepage+0x1a/0x50 [<ffffffff81122df2>] write_cache_pages+0x1c2/0x4c0 [<ffffffff81123117>] generic_writepages+0x27/0x30 [<ffffffff814aba5d>] xfs_vm_writepages+0x5d/0x80 By inspection, the leak occurs when xlog_write() returns and error and we jump to the abort path without dropping the reference on the active ticket. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-01-27 12:02:00 +11:00
Geert Uytterhoeven	cf78859f52	xfs: Do not name variables "panic" On platforms that call panic() inside their BUG() macro (m68k/sun3, and all platforms that don't set HAVE_ARCH_BUG), compilation fails with: \| fs/xfs/support/debug.c: In function ‘xfs_cmn_err’: \| fs/xfs/support/debug.c:92: error: called object ‘panic’ is not a function as the local variable "panic" conflicts with the "panic()" function. Rename the local variable to resolve this. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-17 12:39:07 -08:00
Christoph Hellwig	2fe17c1075	fallocate should be a file operation Currently all filesystems except XFS implement fallocate asynchronously, while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC I/O we really want our allocation on disk, especially for the !KEEP_SIZE case where we actually grow the file with user-visible zeroes. On the other hand always commiting the transaction is a bad idea for fast-path uses of fallocate like for example in recent Samba versions. Given that block allocation is a data plane operation anyway change it from an inode operation to a file operation so that we have the file structure available that lets us check for O_SYNC. This also includes moving the code around for a few of the filesystems, and remove the already unnedded S_ISDIR checks given that we only wire up fallocate for regular files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-01-17 02:25:31 -05:00
Christoph Hellwig	64c23e8687	make the feature checks in ->fallocate future proof Instead of various home grown checks that might need updates for new flags just check for any bit outside the mask of the features supported by the filesystem. This makes the check future proof for any newly added flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-01-17 02:25:30 -05:00
Linus Torvalds	7cb3920a65	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: prevent NMI timeouts in cmn_err xfs: Add log level to assertion printk xfs: fix an assignment within an ASSERT() xfs: fix error handling for synchronous writes xfs: add FITRIM support xfs: ensure log covering transactions are synchronous xfs: serialise unaligned direct IOs xfs: factor common write setup code xfs: split buffered IO write path from xfs_file_aio_write xfs: split direct IO write path from xfs_file_aio_write xfs: introduce xfs_rw_lock() helpers for locking the inode xfs: factor post-write newsize updates xfs: factor common post-write isize handling code xfs: ensure sync write errors are returned	2011-01-14 15:24:17 -08:00
Linus Torvalds	275220f0fc	Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...	2011-01-13 10:45:01 -08:00
Linus Torvalds	b2034d474b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits) fs: add documentation on fallocate hole punching Gfs2: fail if we try to use hole punch Btrfs: fail if we try to use hole punch Ext4: fail if we try to use hole punch Ocfs2: handle hole punching via fallocate properly XFS: handle hole punching via fallocate properly fs: add hole punching to fallocate vfs: pass struct file to do_truncate on O_TRUNC opens (try #2) fix signedness mess in rw_verify_area() on 64bit architectures fs: fix kernel-doc for dcache::prepend_path fs: fix kernel-doc for dcache::d_validate sanitize ecryptfs ->mount() switch afs move internal-only parts of ncpfs headers to fs/ncpfs switch ncpfs switch 9p pass default dentry_operations to mount_pseudo() switch hostfs switch affs switch configfs ...	2011-01-13 10:27:28 -08:00
Linus Torvalds	008d23e485	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.	2011-01-13 10:05:56 -08:00
Josef Bacik	c25d246715	XFS: handle hole punching via fallocate properly This patch simply allows XFS to handle the hole punching flag in fallocate properly. I've tested this with a little program that does a bunch of random hole punching with FL_KEEP_SIZE and without it to make sure it does the right thing. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-01-12 20:16:43 -05:00
Dave Chinner	73efe4a4dd	xfs: prevent NMI timeouts in cmn_err We currently have a global error message buffer in cmn_err that is protected by a spin lock that disables interrupts. Recently there have been reports of NMI timeouts occurring when the console is being flooded by SCSI error reports due to cmn_err() getting stuck trying to print to the console while holding this lock (i.e. with interrupts disabled). The NMI watchdog is seeing this CPU as non-responding and so is triggering a panic. While the trigger for the reported case is SCSI errors, pretty much anything that spams the kernel log could cause this to occur. Realistically the only reason that we have the intemediate message buffer is to prepend the correct kernel log level prefix to the log message. The only reason we have the lock is to protect the global message buffer and the only reason the message buffer is global is to keep it off the stack. Hence if we can avoid needing a global message buffer we avoid needing the lock, and we can do this with a small amount of cleanup and some preprocessor tricks: 1. clean up xfs_cmn_err() panic mask functionality to avoid needing debug code in xfs_cmn_err() 2. remove the couple of "!" message prefixes that still exist that the existing cmn_err() code steps over. 3. redefine CE_* levels directly to KERN_* 4. redefine cmn_err() and friends to use printk() directly via variable argument length macros. By doing this, we can completely remove the cmn_err() code and the lock that is causing the problems, and rely solely on printk() serialisation to ensure that we don't get garbled messages. A series of followup patches is really needed to clean up all the cmn_err() calls and related messages properly, but that results in a series that is not easily back portable to enterprise kernels. Hence this initial fix is only to address the direct problem in the lowest impact way possible. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-01-12 08:46:41 -06:00
Anton Blanchard	65a84a0f75	xfs: Add log level to assertion printk I received a ppc64 bug report involving xfs but the assertion was filtered out by the console log level. Use KERN_CRIT to ensure it makes it out. Signed-off-by: Anton Blanchard <anton@samba.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-01-11 22:29:46 -06:00
Jesper Juhl	1884bd8354	xfs: fix an assignment within an ASSERT() In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out: label we have this: ASSERT(error = 0); I believe a comparison was intended, not an assignment. If I'm right, the patch below fixes that up. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-01-11 22:29:13 -06:00
Christoph Hellwig	bfc60177f8	xfs: fix error handling for synchronous writes If we get an IO error on a synchronous superblock write, we attach an error release function to it so that when the last reference goes away the release function is called and the buffer is invalidated and unlocked. The buffer is left locked until the release function is called so that other concurrent users of the buffer will be locked out until the buffer error is fully processed. Unfortunately, for the superblock buffer the filesyetm itself holds a reference to the buffer which prevents the reference count from dropping to zero and the release function being called. As a result, once an IO error occurs on a sync write, the buffer will never be unlocked and all future attempts to lock the buffer will hang. To make matters worse, this problems is not unique to such buffers; if there is a concurrent _xfs_buf_find() running, the lookup will grab a reference to the buffer and then wait on the buffer lock, preventing the reference count from ever falling to zero and hence unlocking the buffer. As such, the whole b_relse function implementation is broken because it cannot rely on the buffer reference count falling to zero to unlock the errored buffer. The synchronous write error path is the only path that uses this callback - it is used to ensure that the synchronous waiter gets the buffer error before the error state is cleared from the buffer by the release function. Given that the only sychronous buffer writes now go through xfs_bwrite and the error path in question can only occur for a write of a dirty, logged buffer, we can move most of the b_relse processing to happen inline in xfs_buf_iodone_callbacks, just like a normal I/O completion. In addition to that we make sure the error is not cleared in xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it. Given that xfs_bwrite keeps the buffer locked until it has waited for it and checked the error this allows to reliably propagate the error to the caller, and make sure that the buffer is reliably unlocked. Given that xfs_buf_iodone_callbacks was the only instance of the b_relse callback we can remove it entirely. Based on earlier patches by Dave Chinner and Ajeet Yadav. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Ajeet Yadav <ajeet.yadav.77@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-01-11 20:28:42 -06:00
Christoph Hellwig	a46db60834	xfs: add FITRIM support Allow manual discards from userspace using the FITRIM ioctl. This is not intended to be run during normal workloads, as the freepsace btree walks can cause large performance degradation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-01-11 20:28:29 -06:00
Dave Chinner	c58efdb442	xfs: ensure log covering transactions are synchronous To ensure the log is covered and the filesystem idles correctly, we need to ensure that dummy transactions hit the disk and do not stay pinned in memory. If the superblock is pinned in memory, it can't be flushed so the log covering cannot make progress. The result is dependent on timing - more oftent han not we continue to issues a log covering transaction every 36s rather than idling after ~90s. Fix this by making the log covering transaction synchronous. To avoid additional log force from xfssyncd, make the log covering transaction take the place of the existing log force in the xfssyncd background sync process. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-01-11 20:28:17 -06:00
Alex Elder	92f1c008ae	Merge branch 'master' into for-linus-merged This merge pulls the XFS master branch into the latest Linus master. This results in a merge conflict whose best fix is not obvious. I manually fixed the conflict, in "fs/xfs/xfs_iget.c". Dave Chinner had done work that resulted in RCU freeing of inodes separate from what Nick Piggin had done, and their results differed slightly in xfs_inode_free(). The fix updates Nick's call_rcu() with the use of VFS_I(), while incorporating needed updates to some XFS inode fields implemented in Dave's series. Dave's RCU callback function has also been removed. Signed-off-by: Alex Elder <aelder@sgi.com>	2011-01-10 21:35:55 -06:00
Dave Chinner	eda7798272	xfs: serialise unaligned direct IOs When two concurrent unaligned, non-overlapping direct IOs are issued to the same block, the direct Io layer will race to zero the block. The result is that one of the concurrent IOs will overwrite data written by the other IO with zeros. This is demonstrated by the xfsqa test 240. To avoid this problem, serialise all unaligned direct IOs to an inode with a big hammer. We need a big hammer approach as we need to serialise AIO as well, so we can't just block writes on locks. Hence, the big hammer is calling xfs_ioend_wait() while holding out other unaligned direct IOs from starting. We don't bother trying to serialised aligned vs unaligned IOs as they are overlapping IO and the result of concurrent overlapping IOs is undefined - the result of either IO is a valid result so we let them race. Hence we only penalise unaligned IO, which already has a major overhead compared to aligned IO so this isn't a major problem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-01-11 10:22:40 +11:00
Dave Chinner	4d8d15812f	xfs: factor common write setup code The buffered IO and direct IO write paths share a common set of checks and limiting code prior to issuing the write. Factor that into a common helper function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-01-11 10:23:42 +11:00
Dave Chinner	637bbc75d9	xfs: split buffered IO write path from xfs_file_aio_write Complete the split of the different write IO paths by splitting the buffered IO write path out of xfs_file_aio_write(). This makes the different mechanisms of the write patchs easier to follow. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-01-11 10:17:30 +11:00
Dave Chinner	f0d26e860b	xfs: split direct IO write path from xfs_file_aio_write The current xfs_file_aio_write code is a mess of locking shenanigans to handle the different locking requirements of buffered and direct IO. Start to clean this up by disentangling the direct IO path from the mess. This also removes the failed direct IO fallback path to buffered IO. XFS handles all direct IO cases without needing to fall back to buffered IO, so we can safely remove this unused path. This greatly simplifies the logic and locking needed in the write path. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-01-11 10:15:36 +11:00
Dave Chinner	487f84f3f8	xfs: introduce xfs_rw_lock() helpers for locking the inode We need to obtain the i_mutex, i_iolock and i_ilock during the read and write paths. Add a set of wrapper functions to neatly encapsulate the lock ordering and shared/exclusive semantics to make the locking easier to follow and get right. Note that this changes some of the exclusive locking serialisation in that serialisation will occur against the i_mutex instead of the XFS_IOLOCK_EXCL. This does not change any behaviour, and it is arguably more efficient to use the mutex for such serialisation than the rw_sem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-01-12 11:37:10 +11:00
Dave Chinner	4c5cfd1b41	xfs: factor post-write newsize updates Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-01-11 10:14:16 +11:00
Dave Chinner	edafb6da9a	xfs: factor common post-write isize handling code Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-01-11 10:14:06 +11:00
Dave Chinner	a363f0c203	xfs: ensure sync write errors are returned xfs_file_aio_write() only returns the error from synchronous flushing of the data and inode if error == 0. At the point where error is being checked, it is guaranteed to be > 0. Therefore any errors returned by the data or fsync flush will never be returned. Fix the checks so we overwrite the current error once and only if an error really occurred. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-01-11 10:13:53 +11:00
Linus Torvalds	23d69b09b7	Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits) usb: don't use flush_scheduled_work() speedtch: don't abuse struct delayed_work media/video: don't use flush_scheduled_work() media/video: explicitly flush request_module work ioc4: use static work_struct for ioc4_load_modules() init: don't call flush_scheduled_work() from do_initcalls() s390: don't use flush_scheduled_work() rtc: don't use flush_scheduled_work() mmc: update workqueue usages mfd: update workqueue usages dvb: don't use flush_scheduled_work() leds-wm8350: don't use flush_scheduled_work() mISDN: don't use flush_scheduled_work() macintosh/ams: don't use flush_scheduled_work() vmwgfx: don't use flush_scheduled_work() tpm: don't use flush_scheduled_work() sonypi: don't use flush_scheduled_work() hvsi: don't use flush_scheduled_work() xen: don't use flush_scheduled_work() gdrom: don't use flush_scheduled_work() ... Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c as per Tejun.	2011-01-07 16:58:04 -08:00
Nick Piggin	880566e17c	xfs: provide simple rcu-walk ACL implementation This simple implementation just checks for no ACLs on the inode, and if so, then the rcu-walk may proceed, otherwise fail it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:30 +11:00
Nick Piggin	b74c79e993	fs: provide rcu-walk aware permission i_ops Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:29 +11:00
Nick Piggin	fa0d7e3de6	fs: icache RCU free inodes RCU free the struct inode. This will allow: - Subsequent store-free path walking patch. The inode must be consulted for permissions when walking, so an RCU inode reference is a must. - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want to take i_lock no longer need to take sb_inode_list_lock to walk the list in the first place. This will simplify and optimize locking. - Could remove some nested trylock loops in dcache code - Could potentially simplify things a bit in VM land. Do not need to take the page lock to follow page->mapping. The downsides of this is the performance cost of using RCU. In a simple creat/unlink microbenchmark, performance drops by about 10% due to inability to reuse cache-hot slab objects. As iterations increase and RCU freeing starts kicking over, this increases to about 20%. In cases where inode lifetimes are longer (ie. many inodes may be allocated during the average life span of a single inode), a lot of this cache reuse is not applicable, so the regression caused by this patch is smaller. The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU, however this adds some complexity to list walking and store-free path walking, so I prefer to implement this at a later date, if it is shown to be a win in real situations. I haven't found a regression in any non-micro benchmark so I doubt it will be a problem. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:26 +11:00
Jiri Kosina	4b7bd36470	Merge branch 'master' into for-next Conflicts: MAINTAINERS arch/arm/mach-omap2/pm24xx.c drivers/scsi/bfa/bfa_fcpim.c Needed to update to apply fixes for which the old branch was too outdated.	2010-12-22 18:57:02 +01:00
Dave Chinner	d0eb2f38b2	xfs: convert grant head manipulations to lockless algorithm The only thing that the grant lock remains to protect is the grant head manipulations when adding or removing space from the log. These calculations are already based on atomic variables, so we can already update them safely without locks. However, the grant head manpulations require atomic multi-step calculations to be executed, which the algorithms currently don't allow. To make these multi-step calculations atomic, convert the algorithms to compare-and-exchange loops on the atomic variables. That is, we sample the old value, perform the calculation and use atomic64_cmpxchg() to attempt to update the head with the new value. If the head has not changed since we sampled it, it will succeed and we are done. Otherwise, we rerun the calculation again from a new sample of the head. This allows us to remove the grant lock from around all the grant head space manipulations, and that effectively removes the grant lock from the log completely. Hence we can remove the grant lock completely from the log at this point. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-21 12:29:14 +11:00
Dave Chinner	3f16b98507	xfs: introduce new locks for the log grant ticket wait queues The log grant ticket wait queues are currently protected by the log grant lock. However, the queues are functionally independent from each other, and operations on them only require serialisation against other queue operations now that all of the other log variables they use are atomic values. Hence, we can make them independent of the grant lock by introducing new locks just to protect the lists operations. because the lists are independent, we can use a lock per list and ensure that reserve and write head queuing do not contend. To ensure forced shutdowns work correctly in conjunction with the new fast paths, ensure that we check whether the log has been shut down in the grant functions once we hold the relevant spin locks but before we go to sleep. This is needed to co-ordinate correctly with the wakeups that are issued on the ticket queues so we don't leave any processes sleeping on the queues during a shutdown. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-21 12:29:01 +11:00
Tejun Heo	afe2c511fb	workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() cancel_rearming_delayed_work[queue]() has been superceded by cancel_delayed_work_sync() quite some time ago. Convert all the in-kernel users. The conversions are completely equivalent and trivial. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: netdev@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: xfs-masters@oss.sgi.com Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netfilter-devel@vger.kernel.org Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: linux-nfs@vger.kernel.org	2010-12-15 10:56:11 +01:00
Christoph Hellwig	05340d4ab2	xfs: log timestamp changes to the source inode in rename Now that we don't mark VFS inodes dirty anymore for internal timestamp changes, but rely on the transaction subsystem to push them out, we need to explicitly log the source inode in rename after updating it's timestamps to make sure the changes actually get forced out by sync/fsync or an AIL push. We already account for the fourth inode in the log reservation, as a rename of directories needs to update the nlink field, so just adding the xfs_trans_log_inode call is enough. This fixes the xfsqa 065 regression introduced by: "xfs: don't use vfs writeback for pure metadata modifications" Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-09 17:07:02 -06:00
Dave Chinner	c8a09ff8ca	xfs: convert log grant heads to atomic variables Convert the log grant heads to atomic64_t types in preparation for converting the accounting algorithms to atomic operations. his patch just converts the variables; the algorithmic changes are in a separate patch for clarity. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-04 00:02:40 +11:00
Dave Chinner	1c3cb9ec07	xfs: convert l_tail_lsn to an atomic variable. log->l_tail_lsn is currently protected by the log grant lock. The lock is only needed for serialising readers against writers, so we don't really need the lock if we make the l_tail_lsn variable an atomic. Converting the l_tail_lsn variable to an atomic64_t means we can start to peel back the grant lock from various operations. Also, provide functions to safely crack an atomic LSN variable into it's component pieces and to recombined the components into an atomic variable. Use them where appropriate. This also removes the need for explicitly holding a spinlock to read the l_tail_lsn on 32 bit platforms. Signed-off-by: Dave Chinner <dchinner@redhat.com>	2010-12-21 12:28:39 +11:00
Dave Chinner	84f3c683c4	xfs: convert l_last_sync_lsn to an atomic variable log->l_last_sync_lsn is updated in only one critical spot - log buffer Io completion - and is protected by the grant lock here. This requires the grant lock to be taken for every log buffer IO completion. Converting the l_last_sync_lsn variable to an atomic64_t means that we do not need to take the grant lock in log buffer IO completion to update it. This also removes the need for explicitly holding a spinlock to read the l_last_sync_lsn on 32 bit platforms. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-03 22:11:29 +11:00
Dave Chinner	2ced19cbae	xfs: make AIL tail pushing independent of the grant lock The xlog_grant_push_ail() currently takes the grant lock internally to sample the tail lsn, last sync lsn and the reserve grant head. Most of the callers already hold the grant lock but have to drop it before calling xlog_grant_push_ail(). This is a left over from when the AIL tail pushing was done in line and hence xlog_grant_push_ail had to drop the grant lock. AIL push is now done in another thread and hence we can safely hold the grant lock over the entire xlog_grant_push_ail call. Push the grant lock outside of xlog_grant_push_ail() to simplify the locking and synchronisation needed for tail pushing. This will reduce traffic on the grant lock by itself, but this is only one step in preparing for the complete removal of the grant lock. While there, clean up the formatting of xlog_grant_push_ail() to match the rest of the XFS code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-21 12:09:20 +11:00
Dave Chinner	eb40a87500	xfs: use wait queues directly for the log wait queues The log grant queues are one of the few places left using sv_t constructs for waiting. Given we are touching this code, we should convert them to plain wait queues. While there, convert all the other sv_t users in the log code as well. Seeing as this removes the last users of the sv_t type, remove the header file defining the wrapper and the fragments that still reference it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-21 12:09:01 +11:00
Dave Chinner	a69ed03c24	xfs: combine grant heads into a single 64 bit integer Prepare for switching the grant heads to atomic variables by combining the two 32 bit values that make up the grant head into a single 64 bit variable. Provide wrapper functions to combine and split the grant heads appropriately for calculations and use them as necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-21 12:08:20 +11:00
Dave Chinner	663e496a72	xfs: rework log grant space calculations The log grant space calculations are repeated for both write and reserve grant heads. To make it simpler to convert the calculations toa different algorithm, factor them so both the gratn heads use the same calculation functions. Once this is done we can drop the wrappers that are used in only a couple of place to update both grant heads at once as they don't provide any particular value. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-21 12:06:05 +11:00
Dave Chinner	3f336c6fa1	xfs: fact out common grant head/log tail verification code Factor repeated debug code out of grant head manipulation functions into a separate function. This removes ifdef DEBUG spagetti from the code and makes the code easier to follow. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-21 12:02:52 +11:00
Dave Chinner	1054794198	xfs: convert log grant ticket queues to list heads The grant write and reserve queues use a roll-your-own double linked list, so convert it to a standard list_head structure and convert all the list traversals to use list_for_each_entry(). We can also get rid of the XLOG_TIC_IN_Q flag as we can use the list_empty() check to tell if the ticket is in a list or not. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-21 12:02:25 +11:00
Dave Chinner	9552e7f2f3	xfs: use AIL bulk delete function to implement single delete We now have two copies of AIL delete operations that are mostly duplicate functionality. The single log item deletes can be implemented via the bulk updates by turning xfs_trans_ail_delete() into a simple wrapper. This removes all the duplicate delete functionality and associated helpers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-20 12:36:15 +11:00
Dave Chinner	e605994929	xfs: use AIL bulk update function to implement single updates We now have two copies of AIL insert operations that are mostly duplicate functionality. The single log item updates can be implemented via the bulk updates by turning xfs_trans_ail_update() into a simple wrapper. This removes all the duplicate insert functionality and associated helpers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-20 12:34:26 +11:00
Dave Chinner	3013683253	xfs: remove all the inodes on a buffer from the AIL in bulk When inode buffer IO completes, usually all of the inodes are removed from the AIL. This involves processing them one at a time and taking the AIL lock once for every inode. When all CPUs are processing inode IO completions, this causes excessive amount sof contention on the AIL lock. Instead, change the way we process inode IO completion in the buffer IO done callback. Allow the inode IO done callback to walk the list of IO done callbacks and pull all the inodes off the buffer in one go and then process them as a batch. Once all the inodes for removal are collected, take the AIL lock once and do a bulk removal operation to minimise traffic on the AIL lock. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-20 12:03:17 +11:00
Dave Chinner	c90821a26a	xfs: consume iodone callback items on buffers as they are processed To allow buffer iodone callbacks to consume multiple items off the callback list, first we need to convert the xfs_buf_do_callbacks() to consume items and always pull the next item from the head of the list. The means the item list walk is never dependent on knowing the next item on the list and hence allows callbacks to remove items from the list as well. This allows callbacks to do bulk operations by scanning the list for identical callbacks, consuming them all and then processing them in bulk, negating the need for multiple callbacks of that type. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-03 17:00:52 +11:00
Dave Chinner	e677d0f954	xfs: reduce the number of AIL push wakeups The xfaild often tries to rest to wait for congestion to pass of for IO to complete, but is regularly woken in tail-pushing situations. In severe cases, the xfsaild is getting woken tens of thousands of times a second. Reduce the number needless wakeups by only waking the xfsaild if the new target is larger than the old one. Further make short sleeps uninterruptible as they occur when the xfsaild has decided it needs to back off to allow some IO to complete and being woken early is counter-productive. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-17 20:08:04 +11:00
Dave Chinner	0e57f6a36f	xfs: bulk AIL insertion during transaction commit When inserting items into the AIL from the transaction committed callbacks, we take the AIL lock for every single item that is to be inserted. For a CIL checkpoint commit, this can be tens of thousands of individual inserts, yet almost all of the items will be inserted at the same point in the AIL because they have the same index. To reduce the overhead and contention on the AIL lock for such operations, introduce a "bulk insert" operation which allows a list of log items with the same LSN to be inserted in a single operation via a list splice. To do this, we need to pre-sort the log items being committed into a temporary list for insertion. The complexity is that not every log item will end up with the same LSN, and not every item is actually inserted into the AIL. Items that don't match the commit LSN will be inserted and unpinned as per the current one-at-a-time method (relatively rare), while items that are not to be inserted will be unpinned and freed immediately. Items that are to be inserted at the given commit lsn are placed in a temporary array and inserted into the AIL in bulk each time the array fills up. As a result of this, we trade off AIL hold time for a significant reduction in traffic. lock_stat output shows that the worst case hold time is unchanged, but contention from AIL inserts drops by an order of magnitude and the number of lock traversal decreases significantly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-20 12:02:19 +11:00
Dave Chinner	eb3efa1249	xfs: clean up xfs_ail_delete() xfs_ail_delete() has a needlessly complex interface. It returns the log item that was passed in for deletion (which the callers then assert is identical to the one passed in), and callers of xfs_ail_delete() still need to invalidate current traversal cursors. Make xfs_ail_delete() return void, move the cursor invalidation inside it, and clean up the callers just to use the log item pointer they passed in. While cleaning up, remove the messy and unnecessary "/* ARGUSED */" comments around all these functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-03 16:42:57 +11:00
Dave Chinner	b199c8a4ba	xfs: Pull EFI/EFD handling out from under the AIL lock EFI/EFD interactions are protected from races by the AIL lock. They are the only type of log items that require the the AIL lock to serialise internal state, so they need to be separated from the AIL lock before we can do bulk insert operations on the AIL. To acheive this, convert the counter of the number of extents in the EFI to an atomic so it can be safely manipulated by EFD processing without locks. Also, convert the EFI state flag manipulations to use atomic bit operations so no locks are needed to record state changes. Finally, use the state bits to determine when it is safe to free the EFI and clean up the code to do this neatly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-20 11:59:49 +11:00
Dave Chinner	9c5f8414ef	xfs: fix EFI transaction cancellation. XFS_EFI_CANCELED has not been set in the code base since xfs_efi_cancel() was removed back in 2006 by commit `065d312e15` ("[XFS] Remove unused iop_abort log item operation), and even then xfs_efi_cancel() was never called. I haven't tracked it back further than that (beyond git history), but it indicates that the handling of EFIs in cancelled transactions has been broken for a long time. Basically, when we get an IOP_UNPIN(lip, 1); call from xfs_trans_uncommit() (i.e. remove == 1), if we don't free the log item descriptor we leak it. Fix the behviour to be correct and kill the XFS_EFI_CANCELED flag. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-20 11:57:24 +11:00
Dave Chinner	821eb21d97	xfs: connect up buffer reclaim priority hooks Now that the buffer reclaim infrastructure can handle different reclaim priorities for different types of buffers, reconnect the hooks in the XFS code that has been sitting dormant since it was ported to Linux. This should finally give use reclaim prioritisation that is on a par with the functionality that Irix provided XFS 15 years ago. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-02 16:31:13 +11:00
Dave Chinner	430cbeb86f	xfs: add a lru to the XFS buffer cache Introduce a per-buftarg LRU for memory reclaim to operate on. This is the last piece we need to put in place so that we can fully control the buffer lifecycle. This allows XFS to be responsibile for maintaining the working set of buffers under memory pressure instead of relying on the VM reclaim not to take pages we need out from underneath us. The implementation introduces a b_lru_ref counter into the buffer. This is currently set to 1 whenever the buffer is referenced and so is used to determine if the buffer should be added to the LRU or not when freed. Effectively it allows lazy LRU initialisation of the buffer so we do not need to touch the LRU list and locks in xfs_buf_find(). Instead, when the buffer is being released and we drop the last reference to it, we check the b_lru_ref count and if it is none zero we re-add the buffer reference and add the inode to the LRU. The b_lru_ref counter is decremented by the shrinker, and whenever the shrinker comes across a buffer with a zero b_lru_ref counter, if released the LRU reference on the buffer. In the absence of a lookup race, this will result in the buffer being freed. This counting mechanism is used instead of a reference flag so that it is simple to re-introduce buffer-type specific reclaim reference counts to prioritise reclaim more effectively. We still have all those hooks in the XFS code, so this will provide the infrastructure to re-implement that functionality. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-02 16:30:55 +11:00
Dave Chinner	c76febef57	xfs: only run xfs_error_test if error injection is active Recent tests writing lots of small files showed the flusher thread being CPU bound and taking a long time to do allocations on a debug kernel. perf showed this as the prime reason: samples pcnt function DSO _______ _____ ___________________________ _________________ 224648.00 36.8% xfs_error_test [kernel.kallsyms] 86045.00 14.1% xfs_btree_check_sblock [kernel.kallsyms] 39778.00 6.5% prandom32 [kernel.kallsyms] 37436.00 6.1% xfs_btree_increment [kernel.kallsyms] 29278.00 4.8% xfs_btree_get_rec [kernel.kallsyms] 27717.00 4.5% random32 [kernel.kallsyms] Walking btree blocks during allocation checking them requires each block (a cache hit, so no I/O) call xfs_error_test(), which then does a random32() call as the first operation. IOWs, ~50% of the CPU is being consumed just testing whether we need to inject an error, even though error injection is not active. Kill this overhead when error injection is not active by adding a global counter of active error traps and only calling into xfs_error_test when fault injection is active. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-01 07:40:20 -06:00
Dave Chinner	de25c1818c	xfs: avoid moving stale inodes in the AIL When an inode has been marked stale because the cluster is being freed, we don't want to (re-)insert this inode into the AIL. There is a race condition where the cluster buffer may be unpinned before the inode is inserted into the AIL during transaction committed processing. If the buffer is unpinned before the inode item has been committed and inserted, then it is possible for the buffer to be released and hence processthe stale inode callbacks before the inode is inserted into the AIL. In this case, we then insert a clean, stale inode into the AIL which will never get removed by an IO completion. It will, however, get reclaimed and that triggers an assert in xfs_inode_free() complaining about freeing an inode still in the AIL. This race can be avoided by not moving stale inodes forward in the AIL during transaction commit completion processing. This closes the race condition by ensuring we never insert clean stale inodes into the AIL. It is safe to do this because a dirty stale inode, by definition, must already be in the AIL. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-01 07:40:20 -06:00
Dave Chinner	309c848002	xfs: delayed alloc blocks beyond EOF are valid after writeback There is an assumption in the parts of XFS that flushing a dirty file will make all the delayed allocation blocks disappear from an inode. That is, that after calling xfs_flush_pages() then ip->i_delayed_blks will be zero. This is an invalid assumption as we may have specualtive preallocation beyond EOF and they are recorded in ip->i_delayed_blks. A flush of the dirty pages of an inode will not change the state of these blocks beyond EOF, so a non-zero deeelalloc block count after a flush is valid. The bmap code has an invalid ASSERT() that needs to be removed, and the swapext code has a bug in that while it swaps the data forks around, it fails to swap the i_delayed_blks counter associated with the fork and hence can get the block accounting wrong. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-01 07:40:20 -06:00
Dave Chinner	90810b9e82	xfs: push stale, pinned buffers on trylock failures As reported by Nick Piggin, XFS is suffering from long pauses under highly concurrent workloads when hosted on ramdisks. The problem is that an inode buffer is stuck in the pinned state in memory and as a result either the inode buffer or one of the inodes within the buffer is stopping the tail of the log from being moved forward. The system remains in this state until a periodic log force issued by xfssyncd causes the buffer to be unpinned. The main problem is that these are stale buffers, and are hence held locked until the transaction/checkpoint that marked them state has been committed to disk. When the filesystem gets into this state, only the xfssyncd can cause the async transactions to be committed to disk and hence unpin the inode buffer. This problem was encountered when scaling the busy extent list, but only the blocking lock interface was fixed to solve the problem. Extend the same fix to the buffer trylock operations - if we fail to lock a pinned, stale buffer, then force the log immediately so that when the next attempt to lock it comes around, it will have been unpinned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-01 07:40:20 -06:00
Dave Chinner	c726de4409	xfs: fix failed write truncation handling. Since the move to the new truncate sequence we call xfs_setattr to truncate down excessively instanciated blocks. As shown by the testcase in kernel.org BZ #22452 that doesn't work too well. Due to the confusion of the internal inode size, and the VFS inode i_size it zeroes data that it shouldn't. But full blown truncate seems like overkill here. We only instanciate delayed allocations in the write path, and given that we never released the iolock we can't have converted them to real allocations yet either. The only nasty case is pre-existing preallocation which we need to skip. We already do this for page discard during writeback, so make the delayed allocation block punching a generic function and call it from the failed write path as well as xfs_aops_discard_page. The callers are responsible for ensuring that partial blocks are not truncated away, and that they hold the ilock. Based on a fix originally from Christoph Hellwig. This version used filesystem blocks as the range unit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-01 07:40:19 -06:00
Dave Chinner	ff57ab2199	xfs: convert xfsbud shrinker to a per-buftarg shrinker. Before we introduce per-buftarg LRU lists, split the shrinker implementation into per-buftarg shrinker callbacks. At the moment we wake all the xfsbufds to run the delayed write queues to free the dirty buffers and make their pages available for reclaim. However, with an LRU, we want to be able to free clean, unused buffers as well, so we need to separate the xfsbufd from the shrinker callbacks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-11-30 17:27:57 +11:00
Dave Chinner	1a427ab0c1	xfs: convert pag_ici_lock to a spin lock now that we are using RCU protection for the inode cache lookups, the lock is only needed on the modification side. Hence it is not necessary for the lock to be a rwlock as there are no read side holders anymore. Convert it to a spin lock to reflect it's exclusive nature. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-16 17:08:41 +11:00
Dave Chinner	1a3e8f3da0	xfs: convert inode cache lookups to use RCU locking With delayed logging greatly increasing the sustained parallelism of inode operations, the inode cache locking is showing significant read vs write contention when inode reclaim runs at the same time as lookups. There is also a lot more write lock acquistions than there are read locks (4:1 ratio) so the read locking is not really buying us much in the way of parallelism. To avoid the read vs write contention, change the cache to use RCU locking on the read side. To avoid needing to RCU free every single inode, use the built in slab RCU freeing mechanism. This requires us to be able to detect lookups of freed inodes, so enѕure that ever freed inode has an inode number of zero and the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit lookup path, but also add a check for a zero inode number as well. We canthen convert all the read locking lockups to use RCU read side locking and hence remove all read side locking. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-12-17 17:29:43 +11:00
Dave Chinner	d95b7aaf9a	xfs: rcu free inodes Introduce RCU freeing of XFS inodes so that we can convert lookup traversals to use rcu_read_lock() protection. This patch only introduces the RCU freeing to minimise the potential conflicts with mainline if this is merged into mainline via a VFS patchset. It abuses the i_dentry list for the RCU callback structure because the VFS patches make this a union so it is safe to use like this and simplifies and merge issues. This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU. The later lookup patches need the same "found free inode" protection regardless of the RCU freeing method used, so once again the RCU freeing method can be dealt with apprpriately at merge time without affecting any other code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-12-16 16:41:39 +11:00
Dave Chinner	6e857567db	xfs: don't truncate prealloc from frequently accessed inodes A long standing problem for streaming writeѕ through the NFS server has been that the NFS server opens and closes file descriptors on an inode for every write. The result of this behaviour is that the ->release() function is called on every close and that results in XFS truncating speculative preallocation beyond the EOF. This has an adverse effect on file layout when multiple files are being written at the same time - they interleave their extents and can result in severe fragmentation. To avoid this problem, keep track of ->release calls made on a dirty inode. For most cases, an inode is only going to be opened once for writing and then closed again during it's lifetime in cache. Hence if there are multiple ->release calls when the inode is dirty, there is a good chance that the inode is being accessed by the NFS server. Hence set a flag the first time ->release is called while there are delalloc blocks still outstanding on the inode. If this flag is set when ->release is next called, then do no truncate away the speculative preallocation - leave it there so that subsequent writes do not need to reallocate the delalloc space. This will prevent interleaving of extents of different inodes written concurrently to the same AG. If we get this wrong, it is not a big deal as we truncate speculative allocation beyond EOF anyway in xfs_inactive() when the inode is thrown out of the cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-23 12:02:31 +11:00
Dave Chinner	055388a318	xfs: dynamic speculative EOF preallocation Currently the size of the speculative preallocation during delayed allocation is fixed by either the allocsize mount option of a default size. We are seeing a lot of cases where we need to recommend using the allocsize mount option to prevent fragmentation when buffered writes land in the same AG. Rather than using a fixed preallocation size by default (up to 64k), make it dynamic by basing it on the current inode size. That way the EOF preallocation will increase as the file size increases. Hence for streaming writes we are much more likely to get large preallocations exactly when we need it to reduce fragementation. For default settings, the size of the initial extents is determined by the number of parallel writers and the amount of memory in the machine. For 4GB RAM and 4 concurrent 32GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576 1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576 2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152 3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304 4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608 5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208 6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088 7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088 and for 16 concurrent 16GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144 1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144 2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288 3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576 4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152 5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304 6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608 7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208 Because it is hard to take back specualtive preallocation, cases where there are large slow growing log files on a nearly full filesystem may cause premature ENOSPC. Hence as the filesystem nears full, the maximum dynamic prealloc size іs reduced according to this table (based on 4k block size): freespace max prealloc size >5% full extent (8GB) 4-5% 2GB (8GB >> 2) 3-4% 1GB (8GB >> 3) 2-3% 512MB (8GB >> 4) 1-2% 256MB (8GB >> 5) <1% 128MB (8GB >> 6) This should reduce the amount of space held in speculative preallocation for such cases. The allocsize mount option turns off the dynamic behaviour and fixes the prealloc size to whatever the mount option specifies. i.e. the behaviour is unchanged. Signed-off-by: Dave Chinner <dchinner@redhat.com>	2011-01-04 11:35:03 +11:00
Dave Chinner	622d81494f	xfs: use KM_NOFS for allocations during attribute list operations When listing attributes, we are doiing memory allocations under the inode ilock using only KM_SLEEP. This allows memory allocation to recurse back into the filesystem and do writeback, which may the ilock we already hold on the current inode. THis will deadlock. Hence use KM_NOFS for such allocations outside of transaction context to ensure that reclaim recursion does not occur. Reported-by: Nick Piggin <npiggin@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-23 11:57:37 +11:00
Dave Chinner	dcfcf20512	xfs: provide a inode iolock lockdep class The XFS iolock needs to be re-initialised to a new lock class before it enters reclaim to prevent lockdep false positives. Unfortunately, this is not sufficient protection as inodes in the XFS_IRECLAIMABLE state can be recycled and not re-initialised before being reused. We need to re-initialise the lock state when transfering out of XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same class as if the inode was just allocated. Hence we need a specific lockdep class variable for the iolock so that both initialisations use the same class. While there, add a specific class for inodes in the reclaim state so that it is easy to tell from lockdep reports what state the inode was in that generated the report. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-12-23 11:57:13 +11:00
Christoph Hellwig	489a150f64	xfs: factor duplicate code in xfs_alloc_ag_vextent_near into a helper Add a new xfs_alloc_find_best_extent that does a forward/backward search in the allocation btree. That code previously was existed two times in xfs_alloc_ag_vextent_near, once for each search direction. Based on an earlier patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:06:15 -06:00
Christoph Hellwig	9f9baab38d	xfs: clean up xfs_alloc_ag_vextent_exact Use a goto label to consolidate all block not found cases, and add a tracepoint for them. Also clean up a few whitespace issues. Based on an earlier patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:06:11 -06:00
Christoph Hellwig	ecff71e677	xfs: simplify xfs_map_at_offset Move the buffer locking into the callers as they need to do it wether they call xfs_map_at_offset or not. Remove the b_bdev assignment, which is already done by get_blocks. Remove the duplicate extent type asserts in xfs_convert_page just before calling xfs_map_at_offset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:06:07 -06:00
Christoph Hellwig	aeea1b1f81	xfs: refactor xfs_vm_writepage After the last patches the code for overwrites is the same as for delayed and unwritten extents except that it doesn't need to call xfs_map_at_offset. Take care of that fact to simplify xfs_vm_writepage. The buffer loop now first checks the type of buffer and checks/sets the ioend type, or continues to the next buffer if it's not interesting to us. Only after that we validate the iomap and perform the block mapping if needed, all in common code for the cases where we have to do work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:06:03 -06:00
Christoph Hellwig	2fa24f9253	xfs: remove the all_bh flag from xfs_convert_page The all_bh flag is always set when entering the page clustering machinery with a regular written extent, which means the check for it is superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:06:00 -06:00
Christoph Hellwig	ed1e7b7e48	xfs: remove xfs_probe_cluster xfs_map_blocks always calls xfs_bmapi with the XFS_BMAPI_ENTIRE entire flag, which tells it to not cap the extent at the passed in size, but just treat the size as an minimum to map. This means xfs_probe_cluster is entirely useless as we'll always get the whole extent back anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:57 -06:00
Christoph Hellwig	8ff2957d58	xfs: simplify xfs_map_blocks No need to lock the extent map exclusive when performing an overwrite, we know the extent map must already have been loaded by get_blocks. Apply the non-blocking inode semantics to all mapping types instead of just delayed allocations. Remove the handling of not yet allocated blocks for the IO_UNWRITTEN case - if an extent is marked as unwritten allocated in the buffer it must already have an extent on disk. Add asserts to verify all the assumptions above in debug builds. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:53 -06:00
Christoph Hellwig	a206c817c8	xfs: kill xfs_iomap Opencode the xfs_iomap code in it's two callers. The overlap of passed flags already was minimal and will be further reduced in the next patch. As a side effect the BMAPI_* flags for xfs_bmapi and the IO_* flags for I/O end processing are merged into a single set of flags, which should be a bit more descriptive of the operation we perform. Also improve the tracing by giving each caller it's own type set of tracepoints. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:51 -06:00
Christoph Hellwig	405f804294	xfs: cleanup the xfs_iomap_write_* helpers Remove passing the BMAPI_* flags to these helpers, in xfs_iomap_write_direct the check BMAPI_DIRECT was always true, and in the xfs_iomap_write_delay path is was never checked at all. Remove the nmap return value as we never make use of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:47 -06:00
Christoph Hellwig	6ac7248ec5	xfs: a few small tweaks for overwrites in xfs_vm_writepage Don't trylock the buffer. We are the only one ever locking it for a regular file address space, and trylock was only copied from the generic code which did it due to the old buffer based writeout in jbd. Also make sure to only write out the buffer if the iomap actually is valid, because we wouldn't have a proper mapping otherwise. In practice we will never get an invalid mapping here as the page lock guarantees truncate doesn't race with us, but better be safe than sorry. Also make sure we allocate a new ioend when crossing boundaries between mappings, just like we do for delalloc and unwritten extents. Again this currently doesn't matter as the I/O end handler only cares for the boundaries for unwritten extents, but this makes the code fully correct and the same as for delalloc/unwritten extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:44 -06:00
Christoph Hellwig	221cb2517e	xfs: remove some dead bio handling code We'll never have BIO_EOPNOTSUPP set after calling submit_bio as this can only happen for discards, and used to happen for barriers, none of which is every submitted by xfs_submit_ioend_bio. Also remove the loop around bio_alloc as it will never fail due to it's mempool backing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:40 -06:00
Christoph Hellwig	85da94c6b4	xfs: improve mapping type check in xfs_vm_writepage Currently we only refuse a "read-only" mapping for writing out unwritten and delayed buffers, and refuse any other for overwrites. Improve the checks to require delalloc mappings for delayed buffers, and unwritten extent mappings for unwritten extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:34 -06:00
Christoph Hellwig	c9f71f5fc4	xfs: untangle phase1 vs phase2 recovery helpers Dispatch to a different helper for phase1 vs phase2 in xlog_recover_commit_trans instead of doing it in all the low-level functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:30 -06:00
Christoph Hellwig	d045094864	xfs: refactor xlog_recover_commit_trans Merge the call to xlog_recover_reorder_trans and the loop over the recovery items from xlog_recover_do_trans into xlog_recover_commit_trans, and keep the switch statement over the log item types as a separate helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:26 -06:00
Christoph Hellwig	d5689eaa0a	xfs: use struct list_head for the buf cancel table Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:22 -06:00
Christoph Hellwig	e2714bf8d5	xfs: remove leftovers of old buffer log items in recovery code XFS used to support different types of buffer log items long time ago. Remove the switch statements checking the log item type in various buffer recovery helpers that were left over from those days and the rather useless xlog_recover_do_buffer_pass2 wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:05:16 -06:00
Samuel Kvasnica	576ecb8e2b	xfs: fix exporting with left over 64-bit inodes We now support mounting and using filesystems with 64-bit inodes even when not mounted with the inode64 option (which now only controls if we allocate new inodes in that space or not). Make sure we always use large NFS file handles when exporting a filesystem that may contain 64-bit inodes. Note that this only affects newly generated file handles, any outstanding 32-bit file handle is still accepted. [hch: the comment and commit log are mine, the rest is from a patch snipplet from Samuel] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-12-16 16:04:55 -06:00
Jens Axboe	f30195c502	Merge branch 'cleanup-bd_claim' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core	2010-11-27 19:49:18 +01:00
Tejun Heo	d4d7762995	block: clean up blkdev_get() wrappers and their users After recent blkdev_get() modifications, open_by_devnum() and open_bdev_exclusive() are simple wrappers around blkdev_get(). Replace them with blkdev_get_by_dev() and blkdev_get_by_path(). blkdev_get_by_dev() is identical to open_by_devnum(). blkdev_get_by_path() is slightly different in that it doesn't automatically add %FMODE_EXCL to @mode. All users are converted. Most conversions are mechanical and don't introduce any behavior difference. There are several exceptions. * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no reason to OR it explicitly on blkdev_put(). * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in sb->s_mode. * With the above changes, sb->s_mode now always should contain FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect errors. The new blkdev_get_*() functions are with proper docbook comments. While at it, add function description to blkdev_get() too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Joern Engel <joern@lazybastard.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: reiserfs-devel@vger.kernel.org Cc: xfs-masters@oss.sgi.com Cc: Alexander Viro <viro@zeniv.linux.org.uk>	2010-11-13 11:55:18 +01:00
Tejun Heo	e525fd89d3	block: make blkdev_get/put() handle exclusive access Over time, block layer has accumulated a set of APIs dealing with bdev open, close, claim and release. * blkdev_get/put() are the primary open and close functions. * bd_claim/release() deal with exclusive open. * open/close_bdev_exclusive() are combination of open and claim and the other way around, respectively. * bd_link/unlink_disk_holder() to create and remove holder/slave symlinks. * open_by_devnum() wraps bdget() + blkdev_get(). The interface is a bit confusing and the decoupling of open and claim makes it impossible to properly guarantee exclusive access as in-kernel open + claim sequence can disturb the existing exclusive open even before the block layer knows the current open if for another exclusive access. Reorganize the interface such that, * blkdev_get() is extended to include exclusive access management. @holder argument is added and, if is @FMODE_EXCL specified, it will gain exclusive access atomically w.r.t. other exclusive accesses. * blkdev_put() is similarly extended. It now takes @mode argument and if @FMODE_EXCL is set, it releases an exclusive access. Also, when the last exclusive claim is released, the holder/slave symlinks are removed automatically. * bd_claim/release() and close_bdev_exclusive() are no longer necessary and either made static or removed. * bd_link_disk_holder() remains the same but bd_unlink_disk_holder() is no longer necessary and removed. * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev() and blkdev_get(). It also has an unexpected extra bdev_read_only() test which probably should be moved into blkdev_get(). * open_by_devnum() is modified to take @holder argument and pass it to blkdev_get(). Most of bdev open/close operations are unified into blkdev_get/put() and most exclusive accesses are tested atomically at the open time (as it should). This cleans up code and removes some, both valid and invalid, but unnecessary all the same, corner cases. open_bdev_exclusive() and open_by_devnum() can use further cleanup - rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop special features. Well, let's leave them for another day. Most conversions are straight-forward. drbd conversion is a bit more involved as there was some reordering, but the logic should stay the same. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Philipp Reisner <philipp.reisner@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: dm-devel@redhat.com Cc: drbd-dev@lists.linbit.com Cc: Leo Chen <leochen@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Joern Engel <joern@logfs.org> Cc: reiserfs-devel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk>	2010-11-13 11:55:17 +01:00
Christoph Hellwig	ece413f59f	xfs: remove incorrect assert in xfs_vm_writepage In commit `20cb52ebd1`, titled "xfs: simplify xfs_vm_writepage" I added an assert that any !mapped and uptodate buffers are not dirty. That asserts turns out to trigger a lot when running fsx on filesystems with small block sizes. The reason for that is that the assert is simply incorrect. !mapped and uptodate just mean this buffer covers a hole, and whenever we do a set_page_dirty we mark all blocks in the page dirty, no matter if they have data or not. So remove the assert, and update the comment above the condition to match reality. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-11-10 15:51:10 -06:00
Christoph Hellwig	c6f6cd0608	xfs: use hlist_add_fake XFS does not need it's inodes to actuall be hashed in the VFS inode cache, but we require the inode to be marked hashed for the writeback code to work. Insted of using insert_inode_hash, which requires a second inode_lock roundtrip after the partial merge of the inode scalability patches in 2.6.37-rc simply use the new hlist_add_fake helper to mark it hashed without requiring a lock or touching a global cache line. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-11-10 12:00:48 -06:00
Christoph Hellwig	5d2bf8a55e	xfs: fix a few compiler warnings with CONFIG_XFS_QUOTA=n Andi Kleen reported that gcc-4.5 gives lots of warnings for him inside the XFS code. It turned out most of them are due to the quota stubs beeing macros, and gcc now complaining about macros evaluating to 0 that are not assigned to variables. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-11-10 12:00:48 -06:00
Christoph Hellwig	785ce41805	xfs: tell lockdep about parent iolock usage in filestreams The filestreams code may take the iolock on the parent inode while holding it on a child. This is the only place in XFS where we take both the child and parent iolock, so just telling lockdep about it is enough. The lock flag required for that was already added as part of the ilock lockdep annotations and unused so far. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-11-10 12:00:48 -06:00
Dave Chinner	bfe2741967	xfs: move delayed write buffer trace The delayed write buffer split trace currently issues a trace for every buffer it scans. These buffers are not necessarily queued for delayed write. Indeed, when buffers are pinned, there can be thousands of traces of buffers that aren't actually queued for delayed write and the ones that are are lost in the noise. Move the trace point to record only buffers that are split out for IO to be issued on. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-11-10 12:00:48 -06:00
Dave Chinner	f83282a8ef	xfs: fix per-ag reference counting in inode reclaim tree walking The walk fails to decrement the per-ag reference count when the non-blocking walk fails to obtain the per-ag reclaim lock, leading to an assert failure on debug kernels when unmounting a filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-11-10 12:00:48 -06:00
Kulikov Vasiliy	6762b938ea	xfs: xfs_ioctl: fix information leak to userland al_hreq is copied from userland. If al_hreq.buflen is not properly aligned then xfs_attr_list will ignore the last bytes of kbuf. These bytes are unitialized. It leads to leaking of contents of kernel stack memory. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-11-10 12:00:47 -06:00
Christoph Hellwig	5d0af85cd0	xfs: remove experimental tag from the delaylog option We promised to do this for 2.6.37, and the code looks stable enough to keep that promise. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-11-10 12:00:47 -06:00
Uwe Kleine-König	b595076a18	tree-wide: fix comment/printk typos "gadget", "through", "command", "maintain", "maintain", "controller", "address", "between", "initiali[zs]e", "instead", "function", "select", "already", "equal", "access", "management", "hierarchy", "registration", "interest", "relative", "memory", "offset", "already", Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-11-01 15:38:34 -04:00
Al Viro	152a083666	new helper: mount_bdev() ... and switch of the obvious get_sb_bdev() users to ->mount() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-10-29 04:16:13 -04:00
Linus Torvalds	7d2f280e75	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits) quota: Fix possible oops in __dquot_initialize() ext3: Update kernel-doc comments jbd/2: fixed typos ext2: fixed typo. ext3: Fix debug messages in ext3_group_extend() jbd: Convert atomic_inc() to get_bh() ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate() jbd: Fix debug message in do_get_write_access() jbd: Check return value of __getblk() ext3: Use DIV_ROUND_UP() on group desc block counting ext3: Return proper error code on ext3_fill_super() ext3: Remove unnecessary casts on bh->b_data ext3: Cleanup ext3_setup_super() quota: Fix issuing of warnings from dquot_transfer quota: fix dquot_disable vs dquot_transfer race v2 jbd: Convert bitops to buffer fns ext3/jbd: Avoid WARN() messages when failing to write the superblock jbd: Use offset_in_page() instead of manual calculation jbd: Remove unnecessary goto statement jbd: Use printk_ratelimited() in journal_alloc_journal_head() ...	2010-10-27 20:13:18 -07:00
Linus Torvalds	426e1f5cec	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) split invalidate_inodes() fs: skip I_FREEING inodes in writeback_sb_inodes fs: fold invalidate_list into invalidate_inodes fs: do not drop inode_lock in dispose_list fs: inode split IO and LRU lists fs: switch bdev inode bdi's correctly fs: fix buffer invalidation in invalidate_list fsnotify: use dget_parent smbfs: use dget_parent exportfs: use dget_parent fs: use RCU read side protection in d_validate fs: clean up dentry lru modification fs: split __shrink_dcache_sb fs: improve DCACHE_REFERENCED usage fs: use percpu counter for nr_dentry and nr_dentry_unused fs: simplify __d_free fs: take dcache_lock inside __d_path fs: do not assign default i_ino in new_inode fs: introduce a per-cpu last_ino allocator new helper: ihold() ...	2010-10-26 17:58:44 -07:00
Wu Fengguang	1b430beee5	writeback: remove nonblocking/encountered_congestion references This removes more dead code that was somehow missed by commit `0d99519efe` (writeback: remove unused nonblocking and congestion checks). There are no behavior change except for the removal of two entries from one of the ext4 tracing interface. The nonblocking checks in ->writepages are no longer used because the flusher now prefer to block on get_request_wait() than to skip inodes on IO congestion. The latter will lead to more seeky IO. The nonblocking checks in ->writepage are no longer used because it's redundant with the WB_SYNC_NONE check. We no long set ->nonblocking in VM page out and page migration, because a) it's effectively redundant with WB_SYNC_NONE in current code b) it's old semantic of "Don't get stuck on request queues" is mis-behavior: that would skip some dirty inodes on congestion and page out others, which is unfair in terms of LRU age. Inspired by Christoph Hellwig. Thanks! Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: David Howells <dhowells@redhat.com> Cc: Sage Weil <sage@newdream.net> Cc: Steve French <sfrench@samba.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-26 16:52:05 -07:00
Christoph Hellwig	85fe4025c6	fs: do not assign default i_ino in new_inode Instead of always assigning an increasing inode number in new_inode move the call to assign it into those callers that actually need it. For now callers that need it is estimated conservatively, that is the call is added to all filesystems that do not assign an i_ino by themselves. For a few more filesystems we can avoid assigning any inode number given that they aren't user visible, and for others it could be done lazily when an inode number is actually needed, but that's left for later patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-10-25 21:26:11 -04:00
Al Viro	7de9c6ee3e	new helper: ihold() Clones an existing reference to inode; caller must already hold one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-10-25 21:26:11 -04:00
Christoph Hellwig	646ec4615c	fs: remove inode_add_to_list/__inode_add_to_list Split up inode_add_to_list/__inode_add_to_list. Locking for the two lists will be split soon so these helpers really don't buy us much anymore. The __ prefixes for the sb list helpers will go away soon, but until inode_lock is gone we'll need them to distinguish between the locked and unlocked variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-10-25 21:26:10 -04:00
Christoph Hellwig	ebdec241d5	fs: kill block_prepare_write __block_write_begin and block_prepare_write are identical except for slightly different calling conventions. Convert all callers to the __block_write_begin calling conventions and drop block_prepare_write. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-10-25 21:18:20 -04:00
Linus Torvalds	5fe3a5ae5c	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (36 commits) xfs: semaphore cleanup xfs: Extend project quotas to support 32bit project ids xfs: remove xfs_buf wrappers xfs: remove xfs_cred.h xfs: remove xfs_globals.h xfs: remove xfs_version.h xfs: remove xfs_refcache.h xfs: fix the xfs_trans_committed xfs: remove unused t_callback field in struct xfs_trans xfs: fix bogus m_maxagi check in xfs_iget xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters xfs: do not use xfs_mod_incore_sb for per-cpu counters xfs: remove XFS_MOUNT_NO_PERCPU_SB xfs: pack xfs_buf structure more tightly xfs: convert buffer cache hash to rbtree xfs: serialise inode reclaim within an AG xfs: batch inode reclaim lookup xfs: implement batched inode lookups for AG walking xfs: split out inode walk inode grabbing xfs: split inode AG walking into separate code for reclaim ...	2010-10-22 17:32:27 -07:00
Linus Torvalds	91b745016c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: remove in_workqueue_context() workqueue: Clarify that schedule_on_each_cpu is synchronous memory_hotplug: drop spurious calls to flush_scheduled_work() shpchp: update workqueue usage pciehp: update workqueue usage isdn/eicon: don't call flush_scheduled_work() from diva_os_remove_soft_isr() workqueue: add and use WQ_MEM_RECLAIM flag workqueue: fix HIGHPRI handling in keep_working() workqueue: add queue_work and activate_work trace points workqueue: prepare for more tracepoints workqueue: implement flush[_delayed]_work_sync() workqueue: factor out start_flush_work() workqueue: cleanup flush/cancel functions workqueue: implement alloc_ordered_workqueue() Fix up trivial conflict in fs/gfs2/main.c as per Tejun	2010-10-22 17:13:10 -07:00
Jens Axboe	fa251f8990	Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier Conflicts: block/blk-core.c drivers/block/loop.c mm/swapfile.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-10-19 09:13:04 +02:00
Thomas Gleixner	a731cd116c	xfs: semaphore cleanup Get rid of init_MUTEX[_LOCKED]() and use sema_init() instead. (Ported to current XFS code by <aelder@sgi.com>.) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:09:09 -05:00
Arkadiusz Mi?kiewicz	6743099ce5	xfs: Extend project quotas to support 32bit project ids This patch adds support for 32bit project quota identifiers. On disk format is backward compatible with 16bit projid numbers. projid on disk is now kept in two 16bit values - di_projid_lo (which holds the same position as old 16bit projid value) and new di_projid_hi (takes existing padding) and converts from/to 32bit value on the fly. xfs_admin (for existing fs), mkfs.xfs (for new fs) needs to be used to enable PROJID32BIT support. Signed-off-by: Arkadiusz Miśkiewicz <arekm@maven.pl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:08 -05:00
Christoph Hellwig	1a1a3e97ba	xfs: remove xfs_buf wrappers Stop having two different names for many buffer functions and use the more descriptive xfs_buf_* names directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:07 -05:00
Christoph Hellwig	6c77b0ea1b	xfs: remove xfs_cred.h We're not actually passing around credentials inside XFS for a while now, so remove all xfs_cred.h with it's cred_t typedef and all instances of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:06 -05:00
Christoph Hellwig	78a4b0961f	xfs: remove xfs_globals.h This header only provides one extern that isn't actually declared anywhere, and shadowed by a macro. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:05 -05:00
Christoph Hellwig	668332e5fe	xfs: remove xfs_version.h It used to have a place when it contained an automatically generated CVS version, but these days it's entirely superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:04 -05:00
Christoph Hellwig	1ae4fe6dba	xfs: remove xfs_refcache.h This header has been completely unused for a couple of years. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:03 -05:00
Christoph Hellwig	4957a449a1	xfs: fix the xfs_trans_committed Use the correct prototype for xfs_trans_committed instead of casting it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:02 -05:00
Christoph Hellwig	dfe188d428	xfs: remove unused t_callback field in struct xfs_trans Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:01 -05:00
Christoph Hellwig	d276734d93	xfs: fix bogus m_maxagi check in xfs_iget These days inode64 should only control which AGs we allocate new inodes from, while we still try to support reading all existing inodes. To make this actually work the check ontop of xfs_iget needs to be relaxed to allow inodes in all allocation groups instead of just those that we allow allocating inodes from. Note that we can't simply remove the check - it prevents us from accessing invalid data when fed invalid inode numbers from NFS or bulkstat. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:01 -05:00
Christoph Hellwig	1b0407125f	xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb and remove support for per-cpu counters from xfs_mod_incore_sb_batch to simplify it. And added benefit is that we don't have to take m_sb_lock for transactions that only modify per-cpu counters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:00 -05:00
Christoph Hellwig	96540c7858	xfs: do not use xfs_mod_incore_sb for per-cpu counters Export xfs_icsb_modify_counters and always use it for modifying the per-cpu counters. Remove support for per-cpu counters from xfs_mod_incore_sb to simplify it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:59 -05:00
Christoph Hellwig	61ba35dea0	xfs: remove XFS_MOUNT_NO_PERCPU_SB Fail the mount if we can't allocate memory for the per-CPU counters. This is consistent with how we handle everything else in the mount path and makes the superblock counter modification a lot simpler. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:58 -05:00
Dave Chinner	50f59e8eed	xfs: pack xfs_buf structure more tightly pahole reports the struct xfs_buf has quite a few holes in it, so packing the structure better will reduce the size of it by 16 bytes. Also, move all the fields used in cache lookups into the first cacheline. Before on x86_64: /* size: 320, cachelines: 5 / / sum members: 298, holes: 6, sum holes: 22 / After on x86_64: / size: 304, cachelines: 5 / / padding: 6 / / last cacheline: 48 bytes */ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:57 -05:00
Dave Chinner	74f75a0cb7	xfs: convert buffer cache hash to rbtree The buffer cache hash is showing typical hash scalability problems. In large scale testing the number of cached items growing far larger than the hash can efficiently handle. Hence we need to move to a self-scaling cache indexing mechanism. I have selected rbtrees for indexing becuse they can have O(log n) search scalability, and insert and remove cost is not excessive, even on large trees. Hence we should be able to cache large numbers of buffers without incurring the excessive cache miss search penalties that the hash is imposing on us. To ensure we still have parallel access to the cache, we need multiple trees. Rather than hashing the buffers by disk address to select a tree, it seems more sensible to separate trees by typical access patterns. Most operations use buffers from within a single AG at a time, so rather than searching lots of different lists, separate the buffer indexes out into per-AG rbtrees. This means that searches during metadata operation have a much higher chance of hitting cache resident nodes, and that updates of the tree are less likely to disturb trees being accessed on other CPUs doing independent operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:56 -05:00
Dave Chinner	69b491c214	xfs: serialise inode reclaim within an AG Memory reclaim via shrinkers has a terrible habit of having N+M concurrent shrinker executions (N = num CPUs, M = num kswapds) all trying to shrink the same cache. When the cache they are all working on is protected by a single spinlock, massive contention an slowdowns occur. Wrap the per-ag inode caches with a reclaim mutex to serialise reclaim access to the AG. This will block concurrent reclaim in each AG but still allow reclaim to scan multiple AGs concurrently. Allow shrinkers to move on to the next AG if it can't get the lock, and if we can't get any AG, then start blocking on locks. To prevent reclaimers from continually scanning the same inodes in each AG, add a cursor that tracks where the last reclaim got up to and start from that point on the next reclaim. This should avoid only ever scanning a small number of inodes at the satart of each AG and not making progress. If we have a non-shrinker based reclaim pass, ignore the cursor and reset it to zero once we are done. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:55 -05:00
Dave Chinner	e3a20c0b02	xfs: batch inode reclaim lookup Batch and optimise the per-ag inode lookup for reclaim to minimise scanning overhead. This involves gang lookups on the radix trees to get multiple inodes during each tree walk, and tighter validation of what inodes can be reclaimed without blocking befor we take any locks. This is based on ideas suggested in a proof-of-concept patch posted by Nick Piggin. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:54 -05:00
Dave Chinner	78ae525676	xfs: implement batched inode lookups for AG walking With the reclaim code separated from the generic walking code, it is simple to implement batched lookups for the generic walk code. Separate out the inode validation from the execute operations and modify the tree lookups to get a batch of inodes at a time. Reclaim operations will be optimised separately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:53 -05:00
Dave Chinner	e13de955ca	xfs: split out inode walk inode grabbing When doing read side inode cache walks, the code to validate and grab an inode is common to all callers. Split it out of the execute callbacks in preparation for batching lookups. Similarly, split out the inode reference dropping from the execute callbacks into the main lookup look to be symmetric with the grab. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:52 -05:00
Dave Chinner	65d0f20533	xfs: split inode AG walking into separate code for reclaim The reclaim walk requires different locking and has a slightly different walk algorithm, so separate it out so that it can be optimised separately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:52 -05:00
Dave Chinner	69d6cc76cf	xfs: remove buftarg hash for external devices For RT and external log devices, we never use hashed buffers on them now. Remove the buftarg hash tables that are set up for them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:51 -05:00
Dave Chinner	1922c949c5	xfs: use unhashed buffers for size checks When we are checking we can access the last block of each device, we do not need to use cached buffers as they will be tossed away immediately. Use uncached buffers for size checks so that all IO prior to full in-memory structure initialisation does not use the buffer cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:50 -05:00
Dave Chinner	26af655233	xfs: kill XBF_FS_MANAGED buffers Filesystem level managed buffers are buffers that have their lifecycle controlled by the filesystem layer, not the buffer cache. We currently cache these buffers, which makes cleanup and cache walking somewhat troublesome. Convert the fs managed buffers to uncached buffers obtained by via xfs_buf_get_uncached(), and remove the XBF_FS_MANAGED special cases from the buffer cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:49 -05:00
Dave Chinner	ebad861b57	xfs: store xfs_mount in the buftarg instead of in the xfs_buf Each buffer contains both a buftarg pointer and a mount pointer. If we add a mount pointer into the buftarg, we can avoid needing the b_mount field in every buffer and grab it from the buftarg when needed instead. This shrinks the xfs_buf by 8 bytes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:48 -05:00
Dave Chinner	5adc94c247	xfs: introduced uncached buffer read primitve To avoid the need to use cached buffers for single-shot or buffers cached at the filesystem level, introduce a new buffer read primitive that bypasses the cache an reads directly from disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:47 -05:00
Dave Chinner	686865f76e	xfs: rename xfs_buf_get_nodaddr to be more appropriate xfs_buf_get_nodaddr() is really used to allocate a buffer that is uncached. While it is not directly assigned a disk address, the fact that they are not cached is a more important distinction. With the upcoming uncached buffer read primitive, we should be consistent with this disctinction. While there, make page allocation in xfs_buf_get_nodaddr() safe against memory reclaim re-entrancy into the filesystem by allowing a flags parameter to be passed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:46 -05:00
Dave Chinner	dcd79a1423	xfs: don't use vfs writeback for pure metadata modifications Under heavy multi-way parallel create workloads, the VFS struggles to write back all the inodes that have been changed in age order. The bdi flusher thread becomes CPU bound, spending 85% of it's time in the VFS code, mostly traversing the superblock dirty inode list to separate dirty inodes old enough to flush. We already keep an index of all metadata changes in age order - in the AIL - and continued log pressure will do age ordered writeback without any extra overhead at all. If there is no pressure on the log, the xfssyncd will periodically write back metadata in ascending disk address offset order so will be very efficient. Hence we can stop marking VFS inodes dirty during transaction commit or when changing timestamps during transactions. This will keep the inodes in the superblock dirty list to those containing data or unlogged metadata changes. However, the timstamp changes are slightly more complex than this - there are a couple of places that do unlogged updates of the timestamps, and the VFS need to be informed of these. Hence add a new function xfs_trans_ichgtime() for transactional changes, and leave xfs_ichgtime() for the non-transactional changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-10-18 15:07:45 -05:00
Dave Chinner	e176579e70	xfs: lockless per-ag lookups When we start taking a reference to the per-ag for every cached buffer in the system, kernel lockstat profiling on an 8-way create workload shows the mp->m_perag_lock has higher acquisition rates than the inode lock and has significantly more contention. That is, it becomes the highest contended lock in the system. The perag lookup is trivial to convert to lock-less RCU lookups because perag structures never go away. Hence the only thing we need to protect against is tree structure changes during a grow. This can be done simply by replacing the locking in xfs_perag_get() with RCU read locking. This removes the mp->m_perag_lock completely from this path. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:44 -05:00
Dave Chinner	bd32d25a7c	xfs: remove debug assert for per-ag reference counting When we start taking references per cached buffer to the the perag it is cached on, it will blow the current debug maximum reference count assert out of the water. The assert has never caught a bug, and we have tracing to track changes if there ever is a problem, so just remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:43 -05:00
Dave Chinner	d1583a3833	xfs: reduce the number of CIL lock round trips during commit When commiting a transaction, we do a lock CIL state lock round trip on every single log vector we insert into the CIL. This is resulting in the lock being as hot as the inode and dcache locks on 8-way create workloads. Rework the insertion loops to bring the number of lock round trips to one per transaction for log vectors, and one more do the busy extents. Also change the allocation of the log vector buffer not to zero it as we copy over the entire allocated buffer anyway. This patch also includes a structural cleanup to the CIL item insertion provided by Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:42 -05:00
Poyo VL	9c169915ad	xfs: eliminate some newly-reported gcc warnings Ionut Gabriel Popescu <poyo_vl@yahoo.com> submitted a simple change to eliminate some "may be used uninitialized" warnings when building XFS. The reported condition seems to be something that GCC did not used to recognize or report. The warnings were produced by: gcc version 4.5.0 20100604 [gcc-4_5-branch revision 160292] (SUSE Linux) Signed-off-by: Ionut Gabriel Popescu <poyo_vl@yahoo.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:39 -05:00
Christoph Hellwig	c0e59e1ac0	xfs: remove the ->kill_root btree operation The implementation os ->kill_root only differ by either simply zeroing out the now unused buffer in the btree cursor in the inode allocation btree or using xfs_btree_setbuf in the allocation btree. Initially both of them used xfs_btree_setbuf, but the use in the ialloc btree was removed early on because it interacted badly with xfs_trans_binval. In addition to zeroing out the buffer in the cursor xfs_btree_setbuf updates the bc_ra array in the btree cursor, and calls xfs_trans_brelse on the buffer previous occupying the slot. The bc_ra update should be done for the alloc btree updated too, although the lack of it does not cause serious problems. The xfs_trans_brelse call on the other hand is effectively a no-op in the end - it keeps decrementing the bli_recur refcount until it hits zero, and then just skips out because the buffer will always be dirty at this point. So removing it for the allocation btree is just fine. So unify the code and move it to xfs_btree.c. While we're at it also replace the call to xfs_btree_setbuf with a NULL bp argument in xfs_btree_del_cursor with a direct call to xfs_trans_brelse given that the cursor is beeing freed just after this and the state updates are superflous. After this xfs_btree_setbuf is only used with a non-NULL bp argument and can thus be simplified. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:38 -05:00
Christoph Hellwig	acecf1b5d8	xfs: stop using xfs_qm_dqtobp in xfs_qm_dqflush In xfs_qm_dqflush we know that q_blkno must be initialized already from a previous xfs_qm_dqread. So instead of calling xfs_qm_dqtobp we can simply read the quota buffer directly. This also saves us from a duplicate xfs_qm_dqcheck call check and allows xfs_qm_dqtobp to be simplified now that it is always called for a newly initialized inode. In addition to that properly unwind all locks in xfs_qm_dqflush when xfs_qm_dqcheck fails. This mirrors a similar cleanup in the inode lookup done earlier. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:37 -05:00
Christoph Hellwig	52fda11424	xfs: simplify xfs_qm_dqusage_adjust There is no need to have the users and group/project quota locked at the same time. Get rid of xfs_qm_dqget_noattach and just do a xfs_qm_dqget inside xfs_qm_quotacheck_dqadjust for the quota we are operating on right now. The new version of xfs_qm_quotacheck_dqadjust holds the inode lock over it's operations, which is not a problem as it simply increments counters and there is no concern about log contention during mount time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:36 -05:00
Dave Chinner	4472235205	xfs: Introduce XFS_IOC_ZERO_RANGE XFS_IOC_ZERO_RANGE is the equivalent of an atomic XFS_IOC_UNRESVSP/ XFS_IOC_RESVSP call pair. It enabled ranges of written data to be turned into zeroes without requiring IO or having to free and reallocate the extents in the range given as would occur if we had to punch and then preallocate them separately. This enables applications to zero parts of files very quickly without changing the layout of the files in any way. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-10-18 15:07:25 -05:00
Dave Chinner	3ae4c9deb3	xfs: use range primitives for xfs page cache operations While XFS passes ranges to operate on from the core code, the functions being called ignore the either the entire range or the end of the range. This is historical because when the function were written linux didn't have the necessary range operations. Update the functions to use the correct operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-10-18 15:07:24 -05:00
Tejun Heo	6370a6ad3b	workqueue: add and use WQ_MEM_RECLAIM flag Add WQ_MEM_RECLAIM flag which currently maps to WQ_RESCUER, mark WQ_RESCUER as internal and replace all external WQ_RESCUER usages to WQ_MEM_RECLAIM. This makes the API users express the intent of the workqueue instead of indicating the internal mechanism used to guarantee forward progress. This is also to make it cleaner to add more semantics to WQ_MEM_RECLAIM. For example, if deemed necessary, memory reclaim workqueues can be made highpri. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Steven Whitehouse <swhiteho@redhat.com>	2010-10-11 15:20:26 +02:00
Johannes Weiner	081003fff4	xfs: properly account for reclaimed inodes When marking an inode reclaimable, a per-AG counter is increased, the inode is tagged reclaimable in its per-AG tree, and, when this is the first reclaimable inode in the AG, the AG entry in the per-mount tree is also tagged. When an inode is finally reclaimed, however, it is only deleted from the per-AG tree. Neither the counter is decreased, nor is the parent tree's AG entry untagged properly. Since the tags in the per-mount tree are not cleared, the inode shrinker iterates over all AGs that have had reclaimable inodes at one point in time. The counters on the other hand signal an increasing amount of slab objects to reclaim. Since "70e60ce xfs: convert inode shrinker to per-filesystem context" this is not a real issue anymore because the shrinker bails out after one iteration. But the problem was observable on a machine running v2.6.34, where the reclaimable work increased and each process going into direct reclaim eventually got stuck on the xfs inode shrinking path, trying to scan several million objects. Fix this by properly unwinding the reclaimable-state tracking of an inode when it is reclaimed. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@kernel.org Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-06 22:35:48 -05:00
Jan Kara	80f44b152c	quota: Make QUOTACTL config be selected by its users Remove "depends on" line from QUOTACTL config option and rather select the option explicitely from config options which need it. It makes more sense this way and also fixes Kconfig warning due to GFS2 selecting QUOTACTL but QUOTACTL not depending on it. Signed-off-by: Jan Kara <jack@suse.cz>	2010-10-05 12:16:37 +02:00
Dave Chinner	80168676eb	xfs: force background CIL push under sustained load I have been seeing occasional pauses in transaction throughput up to 30s long under heavy parallel workloads. The only notable thing was that the xfsaild was trying to be active during the pauses, but making no progress. It was running exactly 20 times a second (on the 50ms no-progress backoff), and the number of pushbuf events was constant across this time as well. IOWs, the xfsaild appeared to be stuck on buffers that it could not push out. Further investigation indicated that it was trying to push out inode buffers that were pinned and/or locked. The xfsbufd was also getting woken at the same frequency (by the xfsaild, no doubt) to push out delayed write buffers. The xfsbufd was not making any progress because all the buffers in the delwri queue were pinned. This scan- and-make-no-progress dance went one in the trace for some seconds, before the xfssyncd came along an issued a log force, and then things started going again. However, I noticed something strange about the log force - there were way too many IO's issued. 516 log buffers were written, to be exact. That added up to 129MB of log IO, which got me very interested because it's almost exactly 25% of the size of the log. He delayed logging code is suppose to aggregate the minimum of 25% of the log or 8MB worth of changes before flushing. That's what really puzzled me - why did a log force write 129MB instead of only 8MB? Essentially what has happened is that no CIL pushes had occurred since the previous tail push which cleared out 25% of the log space. That caused all the new transactions to block because there wasn't log space for them, but they kick the xfsaild to push the tail. However, the xfsaild was not making progress because there were buffers it could not lock and flush, and the xfsbufd could not flush them because they were pinned. As a result, both the xfsaild and the xfsbufd could not move the tail of the log forward without the CIL first committing. The cause of the problem was that the background CIL push, which should happen when 8MB of aggregated changes have been committed, is being held off by the concurrent transaction commit load. The background push does a down_write_trylock() which will fail if there is a concurrent transaction commit holding the push lock in read mode. With 8 CPUs all doing transactions as fast as they can, there was enough concurrent transaction commits to hold off the background push until tail-pushing could no longer free log space, and the halt would occur. It should be noted that there is no reason why it would halt at 25% of log space used by a single CIL checkpoint. This bug could definitely violate the "no transaction should be larger than half the log" requirement and hence result in corruption if the system crashed under heavy load. This sort of bug is exactly the reason why delayed logging was tagged as experimental.... The fix is to start blocking background pushes once the threshold has been exceeded. Rework the threshold calculations to keep the amount of log space a CIL checkpoint can use to below that of the AIL push threshold to avoid the problem completely. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-09-29 07:51:03 -05:00
Christoph Hellwig	dd3932eddf	block: remove BLKDEV_IFL_WAIT All the blkdev_issue_* helpers can only sanely be used for synchronous caller. To issue cache flushes or barriers asynchronously the caller needs to set up a bio by itself with a completion callback to move the asynchronous state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always specified when calling blkdev_issue_* and also remove the now unused flags argument to blkdev_issue_flush and blkdev_issue_zeroout. For blkdev_issue_discard we need to keep it for the secure discard flag, which gains a more descriptive name and loses the bitops vs flag confusion. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-16 20:52:58 +02:00
Dave Chinner	51749e47e1	xfs: log IO completion workqueue is a high priority queue The workqueue implementation in 2.6.36-rcX has changed, resulting in the workqueues no longer having dedicated threads for work processing. This has caused severe livelocks under heavy parallel create workloads because the log IO completions have been getting held up behind metadata IO completions. Hence log commits would stall, memory allocation would stall because pages could not be cleaned, and lock contention on the AIL during inode IO completion processing was being seen to slow everything down even further. By making the log Io completion workqueue a high priority workqueue, they are queued ahead of all data/metadata IO completions and processed before the data/metadata completions. Hence the log never gets stalled, and operations needed to clean memory can continue as quickly as possible. This avoids the livelock conditions and allos the system to keep running under heavy load as per normal. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-09-10 10:16:54 -05:00
Dan Rosenberg	a122eb2fdf	xfs: prevent reading uninitialized stack memory The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12 bytes of uninitialized stack memory, because the fsxattr struct declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero) the 12-byte fsx_pad member before copying it back to the user. This patch takes care of it. Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-09-10 07:39:28 -05:00
Christoph Hellwig	80f6c29d8a	xfs: replace barriers with explicit flush / FUA usage Switch to the WRITE_FLUSH_FUA flag for log writes and remove the EOPNOTSUPP detection for barriers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
Alex Elder	cb7a93412a	Merge branch '2.6.36-xfs-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev	2010-09-03 09:02:32 -05:00
Tao Ma	9af2546508	xfs: Make fiemap work with sparse files In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want to return fi_extent_max extents, but actually it won't work for a sparse file. The reason is that in xfs_getbmap we will calculate holes and set it in 'out', while out is malloced by bmv_count(fi_extent_max+1) which didn't consider holes. So in the worst case, if 'out' vector looks like [hole, extent, hole, extent, hole, ... hole, extent, hole], we will only return half of fi_extent_max extents. This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags. So with this flags, we don't use our 'out' in xfs_getbmap for a hole. The solution is a bit ugly by just don't increasing index of 'out' vector. I felt that it is not easy to skip it at the very beginning since we have the complicated check and some function like xfs_getbmapx_fix_eof_hole to adjust 'out'. Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-09-03 09:02:11 -05:00
Dave Chinner	72656c46f5	xfs: prevent 32bit overflow in space reservation If we attempt to preallocate more than 2^32 blocks of space in a single syscall, the transaction block reservation will overflow leading to a hangs in the superblock block accounting code. This is trivially reproduced with xfs_io. Fix the problem by capping the allocation reservation to the maximum number of blocks a single xfs_bmapi() call can allocate (2^21 blocks). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-09-03 12:19:33 +10:00
Arkadiusz Mi?kiewicz	23963e54ce	xfs: Disallow 32bit project quota id Currently on-disk structure is able to keep only 16bit project quota id, so disallow 32bit ones. This fixes a problem where parts of kernel structures holding project quota id are 32bit while parts (on-disk) are 16bit variables which causes project quota member files to be inaccessible for some operations (like mv/rm). Signed-off-by: Arkadiusz Mi?kiewicz <arekm@maven.pl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-09-02 10:29:08 -05:00
Dave Chinner	9bc08a45fb	xfs: improve buffer cache hash scalability When doing large parallel file creates on a 16p machines, large amounts of time is being spent in _xfs_buf_find(). A system wide profile with perf top shows this: 1134740.00 19.3% _xfs_buf_find 733142.00 12.5% __ticket_spin_lock The problem is that the hash contains 45,000 buffers, and the hash table width is only 256 buffers. That means we've got around 200 buffers per chain, and searching it is quite expensive. The hash table size needs to increase. Secondly, every time we do a lookup, we promote the buffer we find to the head of the hash chain. This is causing cachelines to be dirtied and causes invalidation of cachelines across all CPUs that may have walked the hash chain recently. hence every walk of the hash chain is effectively a cold cache walk. Remove the promotion to avoid this invalidation. The results are: 1045043.00 21.2% __ticket_spin_lock 326184.00 6.6% _xfs_buf_find A 70% drop in the CPU usage when looking up buffers. Unfortunately that does not result in an increase in performance underthis workload as contention on the inode_lock soaks up most of the reduction in CPU usage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-09-02 15:14:38 +10:00
Christoph Hellwig	b5420f2359	xfs: do not discard page cache data on EAGAIN If xfs_map_blocks returns EAGAIN because of lock contention we must redirty the page and not disard the pagecache content and return an error from writepage. We used to do this correctly, but the logic got lost during the recent reshuffle of the writepage code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Mike Gao <ygao.linux@gmail.com> Tested-by: Mike Gao <ygao.linux@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>	2010-08-24 11:47:51 +10:00
Dave Chinner	3b93c7aaef	xfs: don't do memory allocation under the CIL context lock Formatting items requires memory allocation when using delayed logging. Currently that memory allocation is done while holding the CIL context lock in read mode. This means that if memory allocation takes some time (e.g. enters reclaim), we cannot push on the CIL until the allocation(s) required by formatting complete. This can stall CIL pushes for some time, and once a push is stalled so are all new transaction commits. Fix this splitting the item formatting into two steps. The first step which does the allocation and memcpy() into the allocated buffer is now done outside the CIL context lock, and only the CIL insert is done inside the CIL context lock. This avoids the stall issue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-08-24 11:45:53 +10:00
Dave Chinner	a44f13edf0	xfs: Reduce log force overhead for delayed logging Delayed logging adds some serialisation to the log force process to ensure that it does not deference a bad commit context structure when determining if a CIL push is necessary or not. It does this by grabing the CIL context lock exclusively, then dropping it before pushing the CIL if necessary. This causes serialisation of all log forces and pushes regardless of whether a force is necessary or not. As a result fsync heavy workloads (like dbench) can be significantly slower with delayed logging than without. To avoid this penalty, copy the current sequence from the context to the CIL structure when they are swapped. This allows us to do unlocked checks on the current sequence without having to worry about dereferencing context structures that may have already been freed. Hence we can remove the CIL context locking in the forcing code and only call into the push code if the current context matches the sequence we need to force. By passing the sequence into the push code, we can check the sequence again once we have the CIL lock held exclusive and abort if the sequence has already been pushed. This avoids a lock round-trip and unnecessary CIL pushes when we have racing push calls. The result is that the regression in dbench performance goes away - this change improves dbench performance on a ramdisk from ~2100MB/s to ~2500MB/s. This compares favourably to not using delayed logging which retuns ~2500MB/s for the same workload. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-08-24 11:40:03 +10:00
Dave Chinner	1a387d3be2	xfs: dummy transactions should not dirty VFS state When we need to cover the log, we issue dummy transactions to ensure the current log tail is on disk. Unfortunately we currently use the root inode in the dummy transaction, and the act of committing the transaction dirties the inode at the VFS level. As a result, the VFS writeback of the dirty inode will prevent the filesystem from idling long enough for the log covering state machine to complete. The state machine gets stuck in a loop issuing new dummy transactions to cover the log and never makes progress. To avoid this problem, the dummy transactions should not cause externally visible state changes. To ensure this occurs, make sure that dummy transactions log an unchanging field in the superblock as it's state is never propagated outside the filesystem. This allows the log covering state machine to complete successfully and the filesystem now correctly enters a fully idle state about 90s after the last modification was made. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-08-24 11:46:31 +10:00
Stuart Brodsky	2fe33661fc	xfs: ensure f_ffree returned by statfs() is non-negative Because of delayed updates to sb_icount field in the super block, it is possible to allocate over maxicount number of inodes. This causes the arithmetic to calculate a negative number of free inodes in user commands like df or stat -f. Since maxicount is a somewhat arbitrary number, a slight over allocation is not critical but user commands should be displayed as 0 or greater and never go negative. To do this the value in the stats buffer f_ffree is capped to never go negative. [ Modified to use max_t as per Christoph's comment. ] Signed-off-by: Stu Brodsky <sbrodsky@sgi.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>	2010-08-24 11:46:05 +10:00
Dave Chinner	efceab1d56	xfs: handle negative wbc->nr_to_write during sync writeback During data integrity (WB_SYNC_ALL) writeback, wbc->nr_to_write will go negative on inodes with more than 1024 dirty pages due to implementation details of write_cache_pages(). Currently XFS will abort page clustering in writeback once nr_to_write drops below zero, and so for data integrity writeback we will do very inefficient page at a time allocation and IO submission for inodes with large numbers of dirty pages. Fix this by only aborting the page clustering code when wbc->nr_to_write is negative and the sync mode is WB_SYNC_NONE. Cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-08-24 11:44:56 +10:00
Dave Chinner	4536f2ad8b	xfs: fix untrusted inode number lookup Commit `7124fe0a5b` ("xfs: validate untrusted inode numbers during lookup") changes the inode lookup code to do btree lookups for untrusted inode numbers. This change made an invalid assumption about the alignment of inodes and hence incorrectly calculated the first inode in the cluster. As a result, some inode numbers were being incorrectly considered invalid when they were actually valid. The issue was not picked up by the xfstests suite because it always runs fsr and dump (the two utilities that utilise the bulkstat interface) on cache hot inodes and hence the lookup code in the cold cache path was not sufficiently exercised to uncover this intermittent problem. Fix the issue by relaxing the btree lookup criteria and then checking if the record returned contains the inode number we are lookup for. If it we get an incorrect record, then the inode number is invalid. Cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-08-24 11:42:30 +10:00
Dave Chinner	5b3eed756c	xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE Under heavy load parallel metadata loads (e.g. dbench), we can fail to mark all the inodes in a cluster being freed as XFS_ISTALE as we skip inodes we cannot get the XFS_ILOCK_EXCL or the flush lock on. When this happens and the inode cluster buffer has already been marked stale and freed, inode reclaim can try to write the inode out as it is dirty and not marked stale. This can result in writing th metadata to an freed extent, or in the case it has already been overwritten trigger a magic number check failure and return an EUCLEAN error such as: Filesystem "ram0": inode 0x442ba1 background reclaim flush failed with 117 Fix this by ensuring that we hoover up all in memory inodes in the cluster and mark them XFS_ISTALE when freeing the cluster. Cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-08-24 11:42:41 +10:00
Dave Chinner	d17c701ce6	xfs: unlock items before allowing the CIL to commit When we commit a transaction using delayed logging, we need to unlock the items in the transaciton before we unlock the CIL context and allow it to be checkpointed. If we unlock them after we release the CIl context lock, the CIL can checkpoint and complete before we free the log items. This breaks stale buffer item unlock and unpin processing as there is an implicit assumption that the unlock will occur before the unpin. Also, some log items need to store the LSN of the transaction commit in the item (inodes and EFIs) and so can race with other transaction completions if we don't prevent the CIL from checkpointing before the unlock occurs. Cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-08-24 11:42:52 +10:00
Linus Torvalds	5f248c9c25	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits) no need for list_for_each_entry_safe()/resetting with superblock list Fix sget() race with failing mount vfs: don't hold s_umount over close_bdev_exclusive() call sysv: do not mark superblock dirty on remount sysv: do not mark superblock dirty on mount btrfs: remove junk sb_dirt change BFS: clean up the superblock usage AFFS: wait for sb synchronization when needed AFFS: clean up dirty flag usage cifs: truncate fallout mbcache: fix shrinker function return value mbcache: Remove unused features add f_flags to struct statfs(64) pass a struct path to vfs_statfs update VFS documentation for method changes. All filesystems that need invalidate_inode_buffers() are doing that explicitly convert remaining ->clear_inode() to ->evict_inode() Make ->drop_inode() just return whether inode needs to be dropped fs/inode.c:clear_inode() is gone fs/inode.c:evict() doesn't care about delete vs. non-delete paths now ... Fix up trivial conflicts in fs/nilfs2/super.c	2010-08-10 11:26:52 -07:00
Al Viro	b57922d97f	convert remaining ->clear_inode() to ->evict_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-08-09 16:48:37 -04:00
Al Viro	a4ffdde6e5	simplify checks for I_CLEAR/I_FREEING add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is equivalent to I_FREEING for almost all code looking at either; it's there to keep track of having called clear_inode() exactly once per inode lifetime, at some point after having set I_FREEING. I_CLEAR and I_FREEING never get set at the same time with the current code, so we can switch to setting i_flags to I_FREEING \| I_CLEAR instead of I_CLEAR without loss of information. As the result of such change, checks become simpler and the amount of code that needs to know about I_CLEAR shrinks a lot. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-08-09 16:47:44 -04:00
Christoph Hellwig	fa9b227e90	xfs: new truncate sequence Convert XFS to the new truncate sequence. We still can have errors after updating the file size in xfs_setattr, but these are real I/O errors and lead to a transaction abort and filesystem shutdown, so they are not an issue. Errors from ->write_begin and write_end can now be handled correctly because we can actually get rid of the delalloc extents while previous the buffer state was stipped in block_invalidatepage. There is still no error handling for ->direct_IO, because doing so will need some major restructuring given that we only have the iolock shared and do not hold i_mutex at all. Fortunately leaving the normally allocated blocks behind there is not a major issue and this will get cleaned up by xfs_free_eofblock later. Note: the patch is against Al's vfs.git tree as that contains the nessecary preparations. I'd prefer to get it applied there so that we can get some testing in linux-next. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-08-09 16:47:42 -04:00
Christoph Hellwig	155130a4f7	get rid of block_write_begin_newtrunc Move the call to vmtruncate to get rid of accessive blocks to the callers in preparation of the new truncate sequence and rename the non-truncating version to block_write_begin. While we're at it also remove several unused arguments to block_write_begin. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-08-09 16:47:33 -04:00
Christoph Hellwig	eafdc7d190	sort out blockdev_direct_IO variants Move the call to vmtruncate to get rid of accessive blocks to the callers in prepearation of the new truncate calling sequence. This was only done for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant was not needed anyway. Get rid of blockdev_direct_IO_no_locking and its _newtrunc variant while at it as just opencoding the two additional paramters is shorted than the name suffix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-08-09 16:47:29 -04:00
Linus Torvalds	90e0c22596	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: ext3: Fix dirtying of journalled buffers in data=journal mode ext3: default to ordered mode quota: Use mark_inode_dirty_sync instead of mark_inode_dirty quota: Change quota error message to print out disk and function name MAINTAINERS: Update entries of ext2 and ext3 MAINTAINERS: Update address of Andreas Dilger ext3: Avoid filesystem corruption after a crash under heavy delete load ext3: remove vestiges of nobh support ext3: Fix set but unused variables quota: clean up quota active checks quota: Clean up the namespace in dqblk_xfs.h quota: check quota reservation on remove_dquot_ref	2010-08-07 12:57:07 -07:00
Christoph Hellwig	209fb87a25	xfs simplify and speed up direct I/O completions Our current handling of direct I/O completions is rather suboptimal, because we defer it to a workqueue more often than needed, and we perform a much to aggressive flush of the workqueue in case unwritten extent conversions happen. This patch changes the direct I/O reads to not even use a completion handler, as we don't bother to use it at all, and to perform the unwritten extent conversions in caller context for synchronous direct I/O. For a small I/O size direct I/O workload on a consumer grade SSD, such as the untar of a kernel tree inside qemu this patch gives speedups of about 5%. Getting us much closer to the speed of a native block device, or a fully allocated XFS file. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 16:09:19 -05:00
Christoph Hellwig	fb511f2150	xfs: move aio completion after unwritten extent conversion If we write into an unwritten extent using AIO we need to complete the AIO request after the extent conversion has finished. Without that a read could race to see see the extent still unwritten and return zeros. For synchronous I/O we already take care of that by flushing the xfsconvertd workqueue (which might be a bit of overkill). To do that add iocb and result fields to struct xfs_ioend, so that we can call aio_complete from xfs_end_io after the extent conversion has happened. Note that we need a new result field as io_error is used for positive errno values, while the AIO code can return negative error values and positive transfer sizes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 16:09:10 -05:00
Christoph Hellwig	40e2e97316	direct-io: move aio_complete into ->end_io Filesystems with unwritten extent support must not complete an AIO request until the transaction to convert the extent has been commited. That means the aio_complete calls needs to be moved into the ->end_io callback so that the filesystem can control when to call it exactly. This makes a bit of a mess out of dio_complete and the ->end_io callback prototype even more complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 16:09:02 -05:00
Dave Chinner	696123fca8	xfs: fix big endian build Commit 0fd7275cc42ab734eaa1a2c747e65479bd1e42af ("xfs: fix gcc 4.6 set but not read and unused statement warnings") failed to convert some code inside XFS_NATIVE_HOST (big endian host code only) and hence fails to build on such machines. Fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 16:07:38 -05:00
Christoph Hellwig	ecd7f082d6	xfs: clean up xfs_bmap_get_bp Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:53 -05:00
Christoph Hellwig	5d18898b20	xfs: simplify xfs_truncate_file xfs_truncate_file is only used for truncating quota files. Move it to xfs_qm_syscalls.c so it can be marked static and take advatange of the fact by removing the unused page cache validation and taking the iget into the helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:52 -05:00
Christoph Hellwig	939d723b72	xfs: kill the b_strat callback in xfs_buf The b_strat callback is used by xfs_buf_iostrategy to perform additional checks before submitting a buffer. It is used in xfs_bwrite and when writing out delayed buffers. In xfs_bwrite it we can de-virtualize the call easily as b_strat is set a few lines above the call to xfs_buf_iostrategy. For the delayed buffers the rationale is a bit more complicated: - there are three callers of xfs_buf_delwri_queue, which places buffers on the delwri list: (1) xfs_bdwrite - this sets up b_strat, so it's fine (2) xfs_buf_iorequest. None of the callers can have XBF_DELWRI set: - xlog_bdstrat is only used for log buffers, which are never delwri - _xfs_buf_read explicitly clears the delwri flag - xfs_buf_iodone_work retries log buffers only - xfsbdstrat - only used for reads, superblock writes without the delwri flag, log I/O and file zeroing with explicitly allocated buffers. - xfs_buf_iostrategy - only calls xfs_buf_iorequest if b_strat is not set (3) xfs_buf_unlock - only puts the buffer on the delwri list if the DELWRI flag is already set. The DELWRI flag is only ever set in xfs_bwrite, xfs_buf_iodone_callbacks, or xfs_trans_log_buf. For xfs_buf_iodone_callbacks and xfs_trans_log_buf we require an initialized buf item, which means b_strat was set to xfs_bdstrat_cb in xfs_buf_item_init. Conclusion: we can just get rid of the callback and replace it with explicit calls to xfs_bdstrat_cb. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:52 -05:00
Christoph Hellwig	a64afb057b	xfs: remove obsolete osyncisosync mount option Since Linux 2.6.33 the kernel has support for real O_SYNC, which made the osyncisosync option a no-op. Warn the users about this and remove the mount flag for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:51 -05:00
Christoph Hellwig	0664ce8d0f	xfs: clean up filestreams helpers Move xfs_filestream_peek_ag, xxfs_filestream_get_ag and xfs_filestream_put_ag from xfs_filestream.h to xfs_filestream.c where it's only callers are, and remove the inline marker while we're at it to let the compiler decide on the inlining. Also don't return a value from xfs_filestream_put_ag because we don't need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:51 -05:00
Christoph Hellwig	73523a2ecf	xfs: fix gcc 4.6 set but not read and unused statement warnings [hch: dropped a few hunks that need structural changes instead] Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:51 -05:00
Tony Luck	0f1a932f5d	xfs: Fix build when CONFIG_XFS_POSIX_ACL=n When CONFIG_XFS_POSIX_ACL is not set "xfs_check_acl" is #defined to NULL - which breaks the code attempting to add a tracepoint on this function. Only define the tracepoint when the function exists. Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:50 -05:00
Kulikov Vasiliy	3f34885cd7	xfs: fix unsigned underflow in xfs_free_eofblocks map_len is unsigned. Checking map_len <= 0 is buggy when it should be below zero. So, check exact expression instead of map_len. Signed-off-by: Kulikov Vasiliy <segooon@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:50 -05:00
Dave Chinner	aea1b95321	xfs: use GFP_NOFS for page cache allocation Avoid a lockdep warning by preventing page cache allocation from recursing back into the filesystem during memory reclaim. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:49 -05:00
Dave Chinner	4a7edddcb5	xfs: fix memory reclaim recursion deadlock on locked inode buffer Calling into memory reclaim with a locked inode buffer can deadlock if memory reclaim tries to lock the inode buffer during inode teardown. Convert the relevant memory allocations to use KM_NOFS to avoid this deadlock condition. Reported-by: Peter Watkins <treestem@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:49 -05:00
Dave Chinner	438697064a	xfs: fix xfs_trans_add_item() lockdep warnings xfs_trans_add_item() is called with ip->i_ilock held, which means it is unsafe for memory reclaim to recurse back into the filesystem (ilock is required in writeback). Hence the allocation needs to be KM_NOFS to avoid recursion. Lockdep report indicating memory allocation being called with the ip->i_ilock held is as follows: [ 1749.866796] ================================= [ 1749.867788] [ INFO: inconsistent lock state ] [ 1749.868327] 2.6.35-rc3-dgc+ #25 [ 1749.868741] --------------------------------- [ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. [ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 1749.868741] (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] {IN-RECLAIM_FS-W} state was registered at: [ 1749.868741] [<ffffffff810b3a97>] __lock_acquire+0x437/0x1450 [ 1749.868741] [<ffffffff810b4b56>] lock_acquire+0xa6/0x160 [ 1749.868741] [<ffffffff810a20b5>] down_write_nested+0x65/0xb0 [ 1749.868741] [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] [<ffffffff8134e819>] xfs_reclaim_inode+0x99/0x310 [ 1749.868741] [<ffffffff8134f56b>] xfs_inode_ag_walk+0x8b/0x150 [ 1749.868741] [<ffffffff8134f6bb>] xfs_inode_ag_iterator+0x8b/0xf0 [ 1749.868741] [<ffffffff8134f7a8>] xfs_reclaim_inode_shrink+0x88/0x90 [ 1749.868741] [<ffffffff81119d07>] shrink_slab+0x137/0x1a0 [ 1749.868741] [<ffffffff8111bbe1>] balance_pgdat+0x421/0x6a0 [ 1749.868741] [<ffffffff8111bf7d>] kswapd+0x11d/0x320 [ 1749.868741] [<ffffffff8109ce56>] kthread+0x96/0xa0 [ 1749.868741] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 1749.868741] irq event stamp: 4234335 [ 1749.868741] hardirqs last enabled at (4234335): [<ffffffff81147d25>] kmem_cache_free+0x115/0x220 [ 1749.868741] hardirqs last disabled at (4234334): [<ffffffff81147c4d>] kmem_cache_free+0x3d/0x220 [ 1749.868741] softirqs last enabled at (4233112): [<ffffffff81084dd2>] __do_softirq+0x142/0x260 [ 1749.868741] softirqs last disabled at (4233095): [<ffffffff81035edc>] call_softirq+0x1c/0x50 [ 1749.868741] [ 1749.868741] other info that might help us debug this: [ 1749.868741] 2 locks held by dd/2835: [ 1749.868741] #0: (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [<ffffffff81316edd>] xfs_ilock_nowait+0xed/0x200 [ 1749.868741] #1: (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] [ 1749.868741] stack backtrace: [ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25 [ 1749.868741] Call Trace: [ 1749.868741] [<ffffffff810b1faa>] print_usage_bug+0x18a/0x190 [ 1749.868741] [<ffffffff8104264f>] ? save_stack_trace+0x2f/0x50 [ 1749.868741] [<ffffffff810b2400>] ? check_usage_backwards+0x0/0xf0 [ 1749.868741] [<ffffffff810b2f11>] mark_lock+0x331/0x400 [ 1749.868741] [<ffffffff810b3047>] mark_held_locks+0x67/0x90 [ 1749.868741] [<ffffffff810b3111>] lockdep_trace_alloc+0xa1/0xe0 [ 1749.868741] [<ffffffff81147419>] kmem_cache_alloc+0x39/0x1e0 [ 1749.868741] [<ffffffff8133f954>] kmem_zone_alloc+0x94/0xe0 [ 1749.868741] [<ffffffff8133f9be>] kmem_zone_zalloc+0x1e/0x50 [ 1749.868741] [<ffffffff81335f02>] xfs_trans_add_item+0x72/0xb0 [ 1749.868741] [<ffffffff81339e41>] xfs_trans_ijoin+0xa1/0xd0 [ 1749.868741] [<ffffffff81319f82>] xfs_itruncate_finish+0x312/0x5d0 [ 1749.868741] [<ffffffff8133cb87>] xfs_free_eofblocks+0x227/0x280 [ 1749.868741] [<ffffffff8133cd18>] xfs_release+0x138/0x190 [ 1749.868741] [<ffffffff813464c5>] xfs_file_release+0x15/0x20 [ 1749.868741] [<ffffffff81150ebf>] fput+0x13f/0x260 [ 1749.868741] [<ffffffff8114d8c2>] filp_close+0x52/0x80 [ 1749.868741] [<ffffffff8114d9a9>] sys_close+0xb9/0x120 [ 1749.868741] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:49 -05:00
Dave Chinner	2f11feabb1	xfs: simplify and remove xfs_ireclaim xfs_ireclaim has to get and put te pag structure because it is only called with the inode to reclaim. The one caller of this function already has a reference on the pag and a pointer to is, so move the radix tree delete to the caller and remove xfs_ireclaim completely. This avoids a xfs_perag_get/put on every inode being reclaimed. The overhead was noticed in a bug report at: https://bugzilla.kernel.org/show_bug.cgi?id=16348 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:48 -05:00
Dave Chinner	ec53d1dbb3	xfs: don't block on buffer read errors xfs_buf_read() fails to detect dispatch errors before attempting to wait on sychronous IO. If there was an error, it will get stuck forever, waiting for an I/O that was never started. Make sure the error is detected correctly. Further, such a failure can leave locked pages in the page cache which will cause a later operation to hang on the page. Ensure that we correctly process pages in the buffers when we get a dispatch error. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:48 -05:00
Dave Chinner	a4190f90b4	xfs: move inode shrinker unregister even earlier I missed Dave Chinner's second revision of this change, and pushed his first version out to the repository instead. commit a476c59ebb279d738718edc0e3fb76aab3687114 Author: Dave Chinner <dchinner@redhat.com> This commit compensates for that by moving a block of code up a bit further, with a result that matches the the effect of Dave's second version. Dave's first version was: Reviewed-by: Eric Sandeen <sandeen@redhat.com> Dave's second version was: Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2010-07-26 13:16:47 -05:00
Christoph Hellwig	fa17b25e9f	xfs: remove a dmapi leftover The open_exec file operation is only added by the external dmapi patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 13:16:47 -05:00
Christoph Hellwig	78558fe8d8	xfs: writepage always has buffers These days we always have buffers thanks to ->page_mkwrite. And we already have an assert a few lines above tripping in case that was not true due to a bug. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 13:16:46 -05:00
Christoph Hellwig	d4f7a5cbd5	xfs: allow writeback from kswapd We only need disable I/O from direct or memcg reclaim. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 13:16:46 -05:00
Christoph Hellwig	651701d71d	xfs: remove incorrect log write optimization We do need a barrier for the first buffer of a split log write. Otherwise we might incorrectly stamp the tail LSN into transactions in the first part of the split write, or not flush data I/O before updating the inode size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 13:16:45 -05:00
Dave Chinner	2727ccc950	xfs: unregister inode shrinker before freeing filesystem structures Currently we don't remove the XFS mount from the shrinker list until late in the unmount path. By this time, we have already torn down the internals of the filesystem (e.g. the per-ag structures), and hence if the shrinker is executed between the teardown and the unregistering, the shrinker will get NULL per-ag structure pointers and panic trying to dereference them. Fix this by removing the xfs mount from the shrinker list before tearing down it's internal structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-07-26 13:16:45 -05:00
Christoph Hellwig	cca28fb83d	xfs: split xfs_itrace_entry Replace the xfs_itrace_entry catchall with specific trace points. For most simple callers we now use the simple inode class, which used to be the iget class, but add more details tracing for namespace events, which now includes the name of the directory entries manipulated. Remove the xfs_inactive trace point, which is a duplicate of the clear_inode one, and the xfs_change_file_space trace point, which is immediately followed by the more specific alloc/free space trace points. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:44 -05:00
Christoph Hellwig	f2d6761433	xfs: remove xfs_iput xfs_iput is just a small wrapper for xfs_iunlock + IRELE. Having this out of line wrapper means the trace events in those two can't track their caller properly. So just remove the wrapper and opencode the unlock + rele in the few callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:44 -05:00
Christoph Hellwig	ef35e9255d	xfs: remove xfs_iput_new We never get an i_mode of 0 or a locked VFS inode until we pass in the XFS_IGET_CREATE flag to xfs_iget, which makes xfs_iput_new equivalent to xfs_iput for the only caller. In addition to that xfs_nfs_get_inode does not even need to lock the inode given that the generation never changes for a life inode, so just pass a 0 lock_flags to xfs_iget and release the inode using IRELE in the error path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:44 -05:00
Christoph Hellwig	d2e078c33c	xfs: some iget tracing cleanups / fixes The xfs_iget_alloc/found tracepoints are a bit misnamed and misplaced. Rename them to xfs_iget_hit/xfs_iget_miss and move them to the beggining of the xfs_iget_cache_hit/miss functions. Add a new xfs_iget_reclaim_fail tracepoint for the case where we fail to re-initialize a VFS inode, and add a second instance of the xfs_iget_skip tracepoint for the case of a failed igrab() call. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:43 -05:00
Christoph Hellwig	807cbbdb43	xfs: do not use emums for flags used in tracing The tracing code can't print flags defined as enums. Most flags that we want to print are defines as macros already, but move the few remaining ones over to make the trace output more useful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:43 -05:00
Christoph Hellwig	64c8614941	xfs: remove explicit xfs_sync_data/xfs_sync_attr calls on umount On the final put of a superblock the VFS already calls sync_filesystem for us to write out all data and wait for it. No need to start another asynchronous writeback inside ->put_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:42 -05:00
Christoph Hellwig	f2bde9b89b	xfs: small cleanups for xfs_iomap / __xfs_get_blocks Remove the flags argument to __xfs_get_blocks as we can easily derive it from the direct argument, and remove the unused BMAPI_MMAP flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:42 -05:00
Christoph Hellwig	3070451eea	xfs: reduce stack usage in xfs_iomap xfs_iomap passes a xfs_bmbt_irec pointer to xfs_iomap_write_direct and xfs_iomap_write_allocate to give them the results of our read-only xfs_bmapi query. Instead of allocating a new xfs_bmbt_irec on stack for the next call to xfs_bmapi re use the one we got passed as it's not used after this point. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:42 -05:00
Christoph Hellwig	7a36c8a98a	xfs: avoid synchronous transaction in xfs_fs_write_inode We already rely on the fact that the sync code will cause a synchronous log force later on (currently via xfs_fs_sync_fs -> xfs_quiesce_data -> xfs_sync_data), so no need to do this here. This allows us to avoid a lot of synchronous log forces during sync, which pays of especially with delayed logging enabled. Some compilebench numbers that show this: xfs (delayed logging, 256k logbufs) =================================== intial create 25.94 MB/s 25.75 MB/s 25.64 MB/s create 8.54 MB/s 9.12 MB/s 9.15 MB/s patch 2.47 MB/s 2.47 MB/s 3.17 MB/s compile 29.65 MB/s 30.51 MB/s 27.33 MB/s clean 90.92 MB/s 98.83 MB/s 128.87 MB/s read tree 11.90 MB/s 11.84 MB/s 8.56 MB/s read compiled 28.75 MB/s 29.96 MB/s 24.25 MB/s delete tree 8.39 seconds 8.12 seconds 8.46 seconds delete compiled 8.35 seconds 8.44 seconds 5.11 seconds stat tree 6.03 seconds 5.59 seconds 5.19 seconds stat compiled tree 9.00 seconds 9.52 seconds 8.49 seconds xfs + write_inode log_force removal =================================== intial create 25.87 MB/s 25.76 MB/s 25.87 MB/s create 15.18 MB/s 14.80 MB/s 14.94 MB/s patch 3.13 MB/s 3.14 MB/s 3.11 MB/s compile 36.74 MB/s 37.17 MB/s 36.84 MB/s clean 226.02 MB/s 222.58 MB/s 217.94 MB/s read tree 15.14 MB/s 15.02 MB/s 15.14 MB/s read compiled tree 29.30 MB/s 29.31 MB/s 29.32 MB/s delete tree 6.22 seconds 6.14 seconds 6.15 seconds delete compiled tree 5.75 seconds 5.92 seconds 5.81 seconds stat tree 4.60 seconds 4.51 seconds 4.56 seconds stat compiled tree 4.07 seconds 3.87 seconds 3.96 seconds In addition to that also remove the delwri inode flush that is unessecary now that bulkstat is always coherent. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:41 -05:00
Christoph Hellwig	20cb52ebd1	xfs: simplify xfs_vm_writepage The writepage implementation in XFS still tries to deal with dirty but unmapped buffers which used to caused by writes through shared mmaps. Since the introduction of ->page_mkwrite these can't happen anymore, so remove the code dealing with them. Note that the all_bh variable which causes us to start I/O on all buffers on the pages was controlled by the count of unmapped buffers, which also included those not actually dirty. It's now unconditionally initialized to 0 but set to 1 for the case of small file size extensions. It probably can be removed entirely, but that's left for another patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:41 -05:00
Christoph Hellwig	89f3b36396	xfs: simplify xfs_vm_releasepage Currently the xfs releasepage implementation has code to deal with converting delayed allocated and unwritten space. But we never get called for those as we always convert delayed and unwritten space when cleaning a page, or drop the state from the buffers in block_invalidatepage. We still keep a WARN_ON on those cases for now, but remove all the case dealing with it, which allows to fold xfs_page_state_convert into xfs_vm_writepage and remove the !startio case from the whole writeback path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:40 -05:00
Eric Sandeen	3d9b02e3c7	xfs: fix corruption case for block size < page size xfstests 194 first truncats a file back and then extends it again by truncating it to a larger size. This causes discard_buffer to drop the mapped, but not the uptodate bit and thus creates something that xfs_page_state_convert takes for unmapped space created by mmap because it doesn't check for the dirty bit, which also gets cleared by discard_buffer and checked by other ->writepage implementations like block_write_full_page. Handle this kind of buffers early, and unlike Eric's first version of the patch simply ASSERT that the buffers is dirty, given that the mmap write case can't happen anymore since the introduction of ->page_mkwrite. The now dead code dealing with that will be deleted in a follow on patch. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:40 -05:00
Christoph Hellwig	b4e9181e77	xfs: remove unused delta tracking code in xfs_bmapi This code was introduced four years ago in commit `3e57ecf640` without any review and has been unused since. Remove it just as the rest of the code introduced in that commit to reduce that stack usage and complexity in this central piece of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:39 -05:00
Christoph Hellwig	cd8b0bb3c4	xfs: remove unused XFS_BMAPI_ flags Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:39 -05:00
Christoph Hellwig	a59f55703c	xfs: remove the unused XFS_TRANS_NOSLEEP/XFS_TRANS_WAIT flags Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:38 -05:00
Christoph Hellwig	9134c2332e	xfs: remove the unused XFS_LOG_SLEEP and XFS_LOG_NOSLEEP flags Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:38 -05:00
Christoph Hellwig	dbb2f6529f	xfs: kill the unused xlog_debug variable Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:37 -05:00
Christoph Hellwig	4e0d5f926b	xfs: fix the xfs_log_iovec i_addr type By making this member a void pointer we can get rid of a lot of pointless casts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:36 -05:00
Christoph Hellwig	898621d5a7	xfs: simplify inode to transaction joining Currently we need to either call IHOLD or xfs_trans_ihold on an inode when joining it to a transaction via xfs_trans_ijoin. This patches instead makes xfs_trans_ijoin usable on it's own by doing an implicity xfs_trans_ihold, which also allows us to drop the third argument. For the case where we want to hold a reference on the inode a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks the inode for needing an xfs_iput. In addition to the cleaner interface to the caller this also simplifies the implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:36 -05:00
Christoph Hellwig	4d16e9246f	xfs: simplify buffer pinning Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode them in their only callers, just like we did for the inode pinning a while ago. Also remove duplicate trace points - the bufitem tracepoints cover all the information that is present in a buffer tracepoint. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:36 -05:00
Christoph Hellwig	ca30b2a7b7	xfs: give li_cb callbacks the correct prototype Stop the function pointer casting madness and give all the li_cb instances correct prototype. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:35 -05:00
Christoph Hellwig	7bfa31d8e0	xfs: give xfs_item_ops methods the correct prototypes Stop the function pointer casting madness and give all the xfs_item_ops the correct prototypes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:35 -05:00
Christoph Hellwig	9412e3181c	xfs: merge iop_unpin_remove into iop_unpin The unpin_remove item operation instances always share most of the implementation with the respective unpin implementation. So instead of keeping two different entry points add a remove flag to the unpin operation and share the code more easily. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:34 -05:00
Christoph Hellwig	e98c414f9a	xfs: simplify log item descriptor tracking Currently we track log item descriptor belonging to a transaction using a complex opencoded chunk allocator. This code has been there since day one and seems to work around the lack of an efficient slab allocator. This patch replaces it with dynamically allocated log item descriptors from a dedicated slab pool, linked to the transaction by a linked list. This allows to greatly simplify the log item descriptor tracking to the point where it's just a couple hundred lines in xfs_trans.c instead of a separate file. The external API has also been simplified while we're at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/ delete items from a transaction have been simplified to the bare minium, and the xfs_trans_find_item function is replaced with a direct dereference of the li_desc field. All debug code walking the list of log items in a transaction is down to a simple list_for_each_entry. Note that we could easily use a singly linked list here instead of the double linked list from list.h as the fastpath only does deletion from sequential traversal. But given that we don't have one available as a library function yet I use the list.h functions for simplicity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:34 -05:00
Christoph Hellwig	3400777ff0	xfs: remove unneeded #include statements Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-07-26 13:16:33 -05:00
Christoph Hellwig	288699feca	xfs: drop dmapi hooks Dmapi support was never merged upstream, but we still have a lot of hooks bloating XFS for it, all over the fast pathes of the filesystem. This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM support in mainline at least the namespace events can be done much saner in the VFS instead of the individual filesystem, so it's not like this is much help for future work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-07-26 13:16:33 -05:00
Christoph Hellwig	ade7ce31c2	quota: Clean up the namespace in dqblk_xfs.h Almost all identifiers use the FS_* namespace, so rename the missing few XFS_* ones to FS_* as well. Without this some people might get upset about having too many XFS names in generic code. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-07-21 16:01:46 +02:00
Dave Chinner	16fd536737	xfs: track AGs with reclaimable inodes in per-ag radix tree https://bugzilla.kernel.org/show_bug.cgi?id=16348 When the filesystem grows to a large number of allocation groups, the summing of recalimable inodes gets expensive. In many cases, most AGs won't have any reclaimable inodes and so we are wasting CPU time aggregating over these AGs. This is particularly important for the inode shrinker that gets called frequently under memory pressure. To avoid the overhead, track AGs with reclaimable inodes in the per-ag radix tree so that we can find all the AGs with reclaimable inodes via a simple gang tag lookup. This involves setting the tag when the first reclaimable inode is tracked in the AG, and removing the tag when the last reclaimable inode is removed from the tree. Then the summation process becomes a loop walking the radix tree summing AGs with the reclaim tag set. This significantly reduces the overhead of scanning - a 6400 AG filesystea now only uses about 25% of a cpu in kswapd while slab reclaim progresses instead of being permanently stuck at 100% CPU and making little progress. Clean filesystems filesystems will see no overhead and the overhead only increases linearly with the number of dirty AGs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-07-20 09:43:39 +10:00
Dave Chinner	70e60ce715	xfs: convert inode shrinker to per-filesystem contexts Now the shrinker passes us a context, wire up a shrinker context per filesystem. This allows us to remove the global mount list and the locking problems that introduced. It also means that a shrinker call does not need to traverse clean filesystems before finding a filesystem with reclaimable inodes. This significantly reduces scanning overhead when lots of filesystems are present. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-07-20 08:07:02 +10:00
Dave Chinner	7f8275d0d6	mm: add context argument to shrinker callback The current shrinker implementation requires the registered callback to have global state to work from. This makes it difficult to shrink caches that are not global (e.g. per-filesystem caches). Pass the shrinker structure to the callback so that users can embed the shrinker structure in the context the shrinker needs to operate on and get back to it in the callback via container_of(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-07-19 14:56:17 +10:00
Dave Chinner	7b6259e7a8	xfs: remove block number from inode lookup code The block number comes from bulkstat based inode lookups to shortcut the mapping calculations. We ar enot able to trust anything from bulkstat, so drop the block number as well so that the correct lookups and mappings are always done. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-06-24 11:35:17 +10:00
Dave Chinner	1920779e67	xfs: rename XFS_IGET_BULKSTAT to XFS_IGET_UNTRUSTED Inode numbers may come from somewhere external to the filesystem (e.g. file handles, bulkstat information) and so are inherently untrusted. Rename the flag we use for these lookups to make it obvious we are doing a lookup of an untrusted inode number and need to verify it completely before trying to read it from disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-06-24 11:15:47 +10:00
Dave Chinner	7124fe0a5b	xfs: validate untrusted inode numbers during lookup When we decode a handle or do a bulkstat lookup, we are using an inode number we cannot trust to be valid. If we are deleting inode chunks from disk (default noikeep mode), then we cannot trust the on disk inode buffer for any given inode number to correctly reflect whether the inode has been unlinked as the di_mode nor the generation number may have been updated on disk. This is due to the fact that when we delete an inode chunk, we do not write the clusters back to disk when they are removed - instead we mark them stale to avoid them being written back potentially over the top of something that has been subsequently allocated at that location. The result is that we can have locations of disk that look like they contain valid inodes but in reality do not. Hence we cannot simply convert the inode number to a block number and read the location from disk to determine if the inode is valid or not. As a result, and XFS_IGET_BULKSTAT lookup needs to actually look the inode up in the inode allocation btree to determine if the inode number is valid or not. It should be noted even on ikeep filesystems, there is the possibility that blocks on disk may look like valid inode clusters. e.g. if there are filesystem images hosted on the filesystem. Hence even for ikeep filesystems we really need to validate that the inode number is valid before issuing the inode buffer read. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-06-24 11:15:33 +10:00
Christoph Hellwig	7dce11dbac	xfs: always use iget in bulkstat The non-coherent bulkstat versionsthat look directly at the inode buffers causes various problems with performance optimizations that make increased use of just logging inodes. This patch makes bulkstat always use iget, which should be fast enough for normal use with the radix-tree based inode cache introduced a while ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-06-23 18:11:11 +10:00
Dan Rosenberg	1817176a86	xfs: prevent swapext from operating on write-only files This patch prevents user "foo" from using the SWAPEXT ioctl to swap a write-only file owned by user "bar" into a file owned by "foo" and subsequently reading it. It does so by checking that the file descriptors passed to the ioctl are also opened for reading. Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-06-24 12:07:47 +10:00
Dave Chinner	254c8c2dbf	xfs: remove nr_to_write writeback windup. Now that the background flush code has been fixed, we shouldn't need to silently multiply the wbc->nr_to_write to get good writeback. Remove that code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-08 18:12:44 -07:00
Alex Elder	1bf7dbfde8	Merge branch 'master' into for-linus	2010-06-04 13:22:30 -05:00
Christoph Hellwig	f936972949	xfs: improve xfs_isilocked Use rwsem_is_locked to make the assertations for shared locks work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-06-03 16:22:29 +10:00
Christoph Hellwig	070ecdca54	xfs: skip writeback from reclaim context Allowing writeback from reclaim context causes massive problems with stack overflows as we can call into the writeback code which tends to be a heavy stack user both in the generic code and XFS from random contexts that perform memory allocations. Follow the example of btrfs (and in slightly different form ext4) and refuse to write out data from reclaim context. This issue should really be handled by the VM so that we can tune better for this case, but until we get it sorted out there we have to hack around this in each filesystem with a complex writeback path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-06-03 16:22:29 +10:00
Dave Chinner	5b257b4a1f	xfs: fix race in inode cluster freeing failing to stale inodes When an inode cluster is freed, it needs to mark all inodes in memory as XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes have a different life cycle to the buffer, and once the buffer is torn down during transaction completion, we must ensure none of the inodes get written back (which is what XFS_ISTALE does). Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these inodes either during inode reclaim or tail pushing on the AIL. The buffer is read back, but no longer contains inodes and so triggers assert failures and shutdowns. This was reproducable with at run.dbench10 invocation from xfstests. There are two main causes of xfs_ifree_cluster() failing. The first is simple - it checks in-memory inodes it finds in the per-ag icache to see if they are clean without holding the flush lock. if they are clean it skips them completely. However, If an inode is flushed delwri, it will appear clean, but is not guaranteed to be written back until the flush lock has been dropped. Hence we may have raced on the clean check and the inode may actually be dirty. Hence always mark inodes found in memory stale before we check properly if they are clean. The second is more complex, and makes the first problem easier to hit. Basically the in-memory inode scan is done with full knowledge it can be racing with inode flushing and AIl tail pushing, which means that inodes that it can't get the flush lock on might not be attached to the buffer after then in-memory inode scan due to IO completion occurring. This is actually documented in the code as "needs better interlocking". i.e. this is a zero-day bug. Effectively, the in-memory scan must be done while the inode buffer is locked and Io cannot be issued on it while we do the in-memory inode scan. This ensures that inodes we couldn't get the flush lock on are guaranteed to be attached to the cluster buffer, so we can then catch all in-memory inodes and mark them stale. Now that the inode cluster buffer is locked before the in-memory scan is done, there is no need for the two-phase update of the in-memory inodes, so simplify the code into two loops and remove the allocation of the temporary buffer used to hold locked inodes across the phases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-06-03 16:22:29 +10:00
Christoph Hellwig	fb3b504ade	xfs: fix access to upper inodes without inode64 If a filesystem is mounted without the inode64 mount option we should still be able to access inodes not fitting into 32 bits, just not created new ones. For this to work we need to make sure the inode cache radix tree is initialized for all allocation groups, not just those we plan to allocate inodes from. This patch makes sure we initialize the inode cache radix tree for all allocation groups, and also cleans xfs_initialize_perag up a bit to separate the inode32 logical from the general perag structure setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:56 -05:00
Dave Chinner	9b98b6f3e1	xfs: fix might_sleep() warning when initialising per-ag tree The use of radix_tree_preload() only works if the radix tree was initialised without the __GFP_WAIT flag. The per-ag tree uses GFP_NOFS, so does not trigger allocation of new tree nodes from the preloaded array. Hence it enters the allocator with a spinlock held and triggers the might_sleep() warnings. Reported-by; Chris Mason <chris.mason@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:50 -05:00
Julia Lawall	38e712ab3d	fs/xfs/quota: Add missing mutex_unlock Add a mutex_unlock missing on the error path. The use of this lock is balanced elsewhere in the file. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression E1; @@ * mutex_lock(E1,...); <+... when != E1 if (...) { ... when != E1 * return ...; } ...+> * mutex_unlock(E1,...); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:41 -05:00
Huang Weiyi	3bd0946eb1	xfs: remove duplicated #include Remove duplicated #include('s) in fs/xfs/linux-2.6/xfs_quotaops.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:36 -05:00
Li Zefan	f8adb4d574	xfs: convert more trace events to DEFINE_EVENT Use DECLARE_EVENT_CLASS, and save ~15K: text data bss dec hex filename 171949 43028 48 215025 347f1 fs/xfs/linux-2.6/xfs_trace.o.orig 156521 43028 36 199585 30ba1 fs/xfs/linux-2.6/xfs_trace.o No change in functionality. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:31 -05:00
Huang Weiyi	292ec4cf35	xfs: xfs_trace.c: remove duplicated #include Remove duplicated #include('s) in fs/xfs/linux-2.6/xfs_trace.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:24 -05:00
Dave Chinner	07f1a4f5e8	xfs: Check new inode size is OK before preallocating The new xfsqa test 228 tries to preallocate more space than the filesystem contains. it should fail, but instead triggers an assert about lock flags. The failure is due to the size extension failing in vmtruncate() due to rlimit being set. Check this before we start the preallocation to avoid allocating space that will never be used. Also the path through xfs_vn_allocate already holds the IO lock, so it should not be present in the lock flags when the setattr fails. Hence the assert needs to take this into account. This will prevent other such callers from hitting this incorrect ASSERT. (Fixed a reference to "newsize" to read "new_size". -Alex) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:12 -05:00
Christoph Hellwig	fdc07f44c8	xfs: clean up xlog_align Add suggested cleanups to commit 29db3370a1369541d58d692fbfb168b8a0bd7f41 from review that didn't end up being commited. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:36 -05:00
Christoph Hellwig	025101dca4	xfs: cleanup log reservation calculactions Instead of having small helper functions calling big macros do the calculations for the log reservations directly in the functions. These are mostly 1:1 from the macros execept that the macros kept the quota calculations in their callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:30 -05:00
Eric Sandeen	32891b292d	xfs: be more explicit if RT mount fails due to config Recent testers were slightly confused that a realtime mount failed due to missing CONFIG_XFS_RT; we can make that a little more obvious. V2: drop the else as suggested by Christoph Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:24 -05:00
Eric Sandeen	657a4cffde	xfs: replace E2BIG with EFBIG where appropriate Many places in the xfs code return E2BIG when they really mean EFBIG; trying to grow past 16T on a 32 bit machine, for example, says "Argument list too long" rather than "File too large" which is not particularly helpful. Some of these don't make perfect sense as EFBIG either, but still better than E2BIG IMHO. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:16 -05:00
Christoph Hellwig	7ea8085910	drop unused dentry argument to ->fsync Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:05:02 -04:00
Alex Elder	88e88374ee	Merge branch 'delayed-logging-for-2.6.35' into for-linus	2010-05-24 11:57:36 -05:00
Dave Chinner	ccf7c23fc1	xfs: Ensure inode allocation buffers are fully replayed With delayed logging, we can get inode allocation buffers in the same transaction inode unlink buffers. We don't currently mark inode allocation buffers in the log, so inode unlink buffers take precedence over allocation buffers. The result is that when they are combined into the same checkpoint, only the unlinked inode chain fields are replayed, resulting in uninitialised inode buffers being detected when the next inode modification is replayed. To fix this, we need to ensure that we do not set the inode buffer flag in the buffer log item format flags if the inode allocation has not already hit the log. To avoid requiring a change to log recovery, we really need to make this a modification that relies only on in-memory sate. We can do this by checking during buffer log formatting (while the CIL cannot be flushed) if we are still in the same sequence when we commit the unlink transaction as the inode allocation transaction. If we are, then we do not add the inode buffer flag to the buffer log format item flags. This means the entire buffer will be replayed, not just the unlinked fields. We do this while CIL flusheѕ are locked out to ensure that we don't race with the sequence numbers changing and hence fail to put the inode buffer flag in the buffer format flags when we really need to. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:41:22 -05:00
Dave Chinner	df806158b0	xfs: enable background pushing of the CIL If we let the CIL grow without bound, it will grow large enough to violate recovery constraints (must be at least one complete transaction in the log at all times) or take forever to write out through the log buffers. Hence we need a check during asynchronous transactions as to whether the CIL needs to be pushed. We track the amount of log space the CIL consumes, so it is relatively simple to limit it on a pure size basis. Make the limit the minimum of just under half the log size (recovery constraint) or 8MB of log space (which is an awful lot of metadata). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:20 -05:00
Dave Chinner	9da1ab181a	xfs: forced unmounts need to push the CIL If the filesystem is being shut down and the there is no log error, the current code forces out the current log buffers. This code now needs to push the CIL before it forces out the log buffers to acheive the same result. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:14 -05:00
Dave Chinner	71e330b593	xfs: Introduce delayed logging core code The delayed logging code only changes in-memory structures and as such can be enabled and disabled with a mount option. Add the mount option and emit a warning that this is an experimental feature that should not be used in production yet. We also need infrastructure to track committed items that have not yet been written to the log. This is what the Committed Item List (CIL) is for. The log item also needs to be extended to track the current log vector, the associated memory buffer and it's location in the Commit Item List. Extend the log item and log vector structures to enable this tracking. To maintain the current log format for transactions with delayed logging, we need to introduce a checkpoint transaction and a context for tracking each checkpoint from initiation to transaction completion. This includes adding a log ticket for tracking space log required/used by the context checkpoint. To track all the changes we need an io vector array per log item, rather than a single array for the entire transaction. Using the new log vector structure for this requires two passes - the first to allocate the log vector structures and chain them together, and the second to fill them out. This log vector chain can then be passed to the CIL for formatting, pinning and insertion into the CIL. Formatting of the log vector chain is relatively simple - it's just a loop over the iovecs on each log vector, but it is made slightly more complex because we re-write the iovec after the copy to point back at the memory buffer we just copied into. This code also needs to pin log items. If the log item is not already tracked in this checkpoint context, then it needs to be pinned. Otherwise it is already pinned and we don't need to pin it again. The only other complexity is calculating the amount of new log space the formatting has consumed. This needs to be accounted to the transaction in progress, and the accounting is made more complex becase we need also to steal space from it for log metadata in the checkpoint transaction. Calculate all this at insert time and update all the tickets, counters, etc correctly. Once we've formatted all the log items in the transaction, attach the busy extents to the checkpoint context so the busy extents live until checkpoint completion and can be processed at that point in time. Transactions can then be freed at this point in time. Now we need to issue checkpoints - we are tracking the amount of log space used by the items in the CIL, so we can trigger background checkpoints when the space usage gets to a certain threshold. Otherwise, checkpoints need ot be triggered when a log synchronisation point is reached - a log force event. Because the log write code already handles chained log vectors, writing the transaction is trivial, too. Construct a transaction header, add it to the head of the chain and write it into the log, then issue a commit record write. Then we can release the checkpoint log ticket and attach the context to the log buffer so it can be called during Io completion to complete the checkpoint. We also need to allow for synchronising multiple in-flight checkpoints. This is needed for two things - the first is to ensure that checkpoint commit records appear in the log in the correct sequence order (so they are replayed in the correct order). The second is so that xfs_log_force_lsn() operates correctly and only flushes and/or waits for the specific sequence it was provided with. To do this we need a wait variable and a list tracking the checkpoint commits in progress. We can walk this list and wait for the checkpoints to change state or complete easily, an this provides the necessary synchronisation for correct operation in both cases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:03 -05:00
Dave Chinner	ed3b4d6cdc	xfs: Improve scalability of busy extent tracking When we free a metadata extent, we record it in the per-AG busy extent array so that it is not re-used before the freeing transaction hits the disk. This array is fixed size, so when it overflows we make further allocation transactions synchronous because we cannot track more freed extents until those transactions hit the disk and are completed. Under heavy mixed allocation and freeing workloads with large log buffers, we can overflow this array quite easily. Further, the array is sparsely populated, which means that inserts need to search for a free slot, and array searches often have to search many more slots that are actually used to check all the busy extents. Quite inefficient, really. To enable this aspect of extent freeing to scale better, we need a structure that can grow dynamically. While in other areas of XFS we have used radix trees, the extents being freed are at random locations on disk so are better suited to being indexed by an rbtree. So, use a per-AG rbtree indexed by block number to track busy extents. This incures a memory allocation when marking an extent busy, but should not occur too often in low memory situations. This should scale to an arbitrary number of extents so should not be a limitation for features such as in-memory aggregation of transactions. However, there are still situations where we can't avoid allocating busy extents (such as allocation from the AGFL). To minimise the overhead of such occurences, we need to avoid doing a synchronous log force while holding the AGF locked to ensure that the previous transactions are safely on disk before we use the extent. We can do this by marking the transaction doing the allocation as synchronous rather issuing a log force. Because of the locking involved and the ordering of transactions, the synchronous transaction provides the same guarantees as a synchronous log force because it ensures that all the prior transactions are already on disk when the synchronous transaction hits the disk. i.e. it preserves the free->allocate order of the extent correctly in recovery. By doing this, we avoid holding the AGF locked while log writes are in progress, hence reducing the length of time the lock is held and therefore we increase the rate at which we can allocate and free from the allocation group, thereby increasing overall throughput. The only problem with this approach is that when a metadata buffer is marked stale (e.g. a directory block is removed), then buffer remains pinned and locked until the log goes to disk. The issue here is that if that stale buffer is reallocated in a subsequent transaction, the attempt to lock that buffer in the transaction will hang waiting the log to go to disk to unlock and unpin the buffer. Hence if someone tries to lock a pinned, stale, locked buffer we need to push on the log to get it unlocked ASAP. Effectively we are trading off a guaranteed log force for a much less common trigger for log force to occur. Ideally we should not reallocate busy extents. That is a much more complex fix to the problem as it involves direct intervention in the allocation btree searches in many places. This is left to a future set of modifications. Finally, now that we track busy extents in allocated memory, we don't need the descriptors in the transaction structure to point to them. We can replace the complex busy chunk infrastructure with a simple linked list of busy extents. This allows us to remove a large chunk of code, making the overall change a net reduction in code size. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:34:00 -05:00
Dave Chinner	955833cf2a	xfs: make the log ticket ID available outside the log infrastructure The ticket ID is needed to uniquely identify transactions when doing busy extent matching. Delayed logging changes the lifecycle of busy extents with respect to the transaction structure lifecycle. Hence we can no longer use the transaction structure as a means of determining the owner of the busy extent as it may be freed and reused while the busy extent is still active. This commit provides the infrastructure to access the xlog_tid_t held in the ticket from a transaction handle. This avoids the need for callers to peek into the transaction and log structures to find this out. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:52 -05:00
Dave Chinner	169a7b078e	xfs: clean up log ticket overrun debug output Push the error message output when a ticket overrun is detected into the ticket printing functions. Also remove the debug version of the code as the production version will still panic just as effectively on a debug kernel via the panic mask being set. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:46 -05:00
Dave Chinner	c11554104f	xfs: Clean up XFS_BLI_* flag namespace Clean up the buffer log format (XFS_BLI_) flags because they have a polluted namespace. They XFS_BLI_ prefix is used for both in-memory and on-disk flag feilds, but have overlapping values for different flags. Rename the buffer log format flags to use the XFS_BLF_ prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed flags. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:39 -05:00
Dave Chinner	64fc35de60	xfs: modify buffer item reference counting The buffer log item reference counts used to take referenceѕ for every transaction, similar to the pin counting. This is symmetric (like the pin/unpin) with respect to transaction completion, but with dleayed logging becomes assymetric as the pinning becomes assymetric w.r.t. transaction completion. To make both cases the same, allow the buffer pinning to take a reference to the buffer log item and always drop the reference the transaction has on it when being unlocked. This is balanced correctly because the unpin operation always drops a reference to the log item. Hence reference counting becomes symmetric w.r.t. item pinning as well as w.r.t active transactions and as a result the reference counting model remain consistent between normal and delayed logging. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:31 -05:00
Dave Chinner	3383ca5780	xfs: allow log ticket allocation to take allocation flags Delayed logging currently requires ticket allocation to succeed, so we need to be able to sleep on allocation. It also should not allow memory allocation to recurse into the filesystem. hence we need to pass allocation flags directing the type of allocation the caller requires. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:17 -05:00
Dave Chinner	524ee36fa4	xfs: Don't reuse the same transaction ID for duplicated transactions. The transaction ID is written into the log as the unique identifier for transactions during recover. When duplicating a transaction, we reuse the log ticket, which means it has the same transaction ID as the previous transaction. Rather than regenerating a random transaction ID for the duplicated transaction, just add one to the current ID so that duplicated transaction can be easily spotted in the log and during recovery during problem diagnosis. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:10 -05:00
Linus Torvalds	e8bebe2f71	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits) fix handling of offsets in cris eeprom.c, get rid of fake on-stack files get rid of home-grown mutex in cris eeprom.c switch ecryptfs_write() to struct inode , kill on-stack fake files switch ecryptfs_get_locked_page() to struct inode simplify access to ecryptfs inodes in ->readpage() and friends AFS: Don't put struct file on the stack Ban ecryptfs over ecryptfs logfs: replace inode uid,gid,mode initialization with helper function ufs: replace inode uid,gid,mode initialization with helper function udf: replace inode uid,gid,mode init with helper ubifs: replace inode uid,gid,mode initialization with helper function sysv: replace inode uid,gid,mode initialization with helper function reiserfs: replace inode uid,gid,mode initialization with helper function ramfs: replace inode uid,gid,mode initialization with helper function omfs: replace inode uid,gid,mode initialization with helper function bfs: replace inode uid,gid,mode initialization with helper function ocfs2: replace inode uid,gid,mode initialization with helper function nilfs2: replace inode uid,gid,mode initialization with helper function minix: replace inode uid,gid,mode init with helper ext4: replace inode uid,gid,mode init with helper ... Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)	2010-05-21 19:37:45 -07:00
Stephen Hemminger	46e58764f0	xfs: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:19 -04:00
Jens Axboe	ee9a3607fb	Merge branch 'master' into for-2.6.35 Conflicts: fs/ext3/fsync.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:27:26 +02:00
Christoph Hellwig	c472b43275	quota: unify ->set_dqblk Pass the larger struct fs_disk_quota to the ->set_dqblk operation so that the Q_SETQUOTA and Q_XSETQUOTA operations can be implemented with a single filesystem operation and we can retire the ->set_xquota operation. The additional information (RT-subvolume accounting and warn counts) are left zero for the VFS quota implementation. Add new fieldmask values for setting the numer of blocks and inodes values which is required for the VFS quota, but wasn't for XFS. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-21 19:30:44 +02:00
Christoph Hellwig	b9b2dd36c1	quota: unify ->get_dqblk Pass the larger struct fs_disk_quota to the ->get_dqblk operation so that the Q_GETQUOTA and Q_XGETQUOTA operations can be implemented with a single filesystem operation and we can retire the ->get_xquota operation. The additional information (RT-subvolume accounting and warn counts) are left zero for the VFS quota implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-21 19:30:43 +02:00
Christoph Hellwig	b4ed4626a9	xfs: mark xfs_iomap_write_ helpers static And also drop a useless argument to xfs_iomap_write_direct. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:20 -05:00
Christoph Hellwig	bd1556a146	xfs: clean up end index calculation in xfs_page_state_convert Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:20 -05:00
Christoph Hellwig	2b8f12b7e4	xfs: clean up mapping size calculation in __xfs_get_blocks Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:19 -05:00
Christoph Hellwig	558e689169	xfs: clean up xfs_iomap_valid Rename all iomap_valid identifiers to imap_valid to fit the new world order, and clean up xfs_iomap_valid to convert the passed in offset to blocks instead of the imap values to bytes. Use the simpler inode->i_blkbits instead of the XFS macros for this. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:19 -05:00
Christoph Hellwig	34a52c6c06	xfs: move I/O type flags into xfs_aops.c The IOMAP_ flags are now only used inside xfs_aops.c for extent probing and I/O completion tracking, so more them here, and rename them to IO_* as there's no mapping involved at all. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	207d041602	xfs: kill struct xfs_iomap Now that struct xfs_iomap contains exactly the same units as struct xfs_bmbt_irec we can just use the latter directly in the aops code. Replace the missing IOMAP_NEW flag with a new boolean output parameter to xfs_iomap. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	e513182d4d	xfs: report iomap_bn in block base Report the iomap_bn field of struct xfs_iomap in terms of filesystem blocks instead of in terms of bytes. Shift the byte conversions into the caller, and replace the IOMAP_DELAY and IOMAP_HOLE flag checks with checks for HOLESTARTBLOCK and DELAYSTARTBLOCK. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	8699bb0a48	xfs: report iomap_offset and iomap_bsize in block base Report the iomap_offset and iomap_bsize fields of struct xfs_iomap in terms of fsblocks instead of in terms of disk blocks. Shift the byte conversions into the callers temporarily, but they will disappear or get cleaned up later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	9563b3d899	xfs: remove iomap_delta The iomap_delta field in struct xfs_iomap just contains the difference between the offset passed to xfs_iomap and the iomap_offset. Just calculate it in the only caller that cares. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	046f1685bb	xfs: remove iomap_target Instead of using the iomap_target field in struct xfs_iomap and the IOMAP_REALTIME flag just use the already existing xfs_find_bdev_for_inode helper. There's some fallout as we need to pass the inode in a few more places, which we also use to sanitize some calling conventions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:16 -05:00
Christoph Hellwig	826bf0adce	xfs: limit xfs_imap_to_bmap to a single mapping We only call xfs_iomap for single mappings anyway, so remove all code dealing with multiple mappings from xfs_imap_to_bmap and add asserts that we never get results that we do not expect. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:16 -05:00
Christoph Hellwig	4a5224d7b1	xfs: simplify buffer to transaction matching We currenly have a routine xfs_trans_buf_item_match_all which checks if any log item in a transaction contains a given buffer, and a second one that only does this check for the first, embedded chunk of log items. We only use the second routine if we know we only have that log item chunk, so get rid of the limited routine and always use the more complete one. Also rename the old xfs_trans_buf_item_match_all to xfs_trans_buf_item_match and update various surrounding comments, and move the remaining xfs_trans_buf_item_match on top of the file to avoid a forward prototype. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:16 -05:00
Tao Ma	2d1ff3c75a	xfs: Make fiemap work in query mode. According to Documentation/filesystems/fiemap.txt, If fm_extent_count is zero, then the fm_extents[] array is ignored (no extents will be returned), and the fm_mapped_extents count will hold the number of extents needed. But as the commit `97db39a1f6` has changed bmv_count to the caller's input buffer, this number query function can't work any more. As this commit is written to change bmv_count from MAXEXTNUM because of ENOMEM. This patch just try to set bm.bmv_count to something sane. Thanks to Dave Chinner <david@fromorbit.com> for the suggestion. Cc: Eric Sandeen <sandeen@redhat.com> Cc: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tao Ma <tao.ma@oracle.com>	2010-05-19 09:58:16 -05:00
Alex Elder	48389ef175	xfs: kill off l_sectbb_mask There remains only one user of the l_sectbb_mask field in the log structure. Just kill it off and compute the mask where needed from the power-of-2 sector size. (Only update from last post is to accomodate the changes in the previous patch in the series.) Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:16 -05:00
Alex Elder	69ce58f08a	xfs: record log sector size rather than log2(that) Change struct log so it keeps track of the size (in basic blocks) of a log sector in l_sectBBsize rather than the log-base-2 of that value (previously, l_sectbb_log). The name was chosen for consistency with the other fields in the structure that represent a number of basic blocks. (Updated so that a variable used in computing and verifying a log's sector size is named "log2_size". Also added the "BB" to the structure field name, based on feedback from Eric Sandeen. Also dropped some superfluous parentheses.) Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2010-05-19 09:58:15 -05:00
Christoph Hellwig	1414a6046a	xfs: remove dead XFS_LOUD_RECOVERY code This can't be enabled through the build system and has been dead for ages. Note that the CRC patches add back log checksumming, but the code is quite different from the version removed here anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:15 -05:00
Christoph Hellwig	8112e9dc6d	xfs: removed unused XFS_QMOPT_ flags Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:15 -05:00
Christoph Hellwig	191f8488f9	xfs: remove a few macro indirections in the quota code Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:15 -05:00
Christoph Hellwig	8a7b8a89a3	xfs: access quotainfo structure directly Access fields in m_quotainfo directly instead of hiding them behind the XFS_QI_* macros. Add local variables for the quotainfo pointer in places where we have lots of them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:14 -05:00
Christoph Hellwig	37bc5743fd	xfs: wait for direct I/O to complete in fsync and write_inode We need to wait for all pending direct I/O requests before taking care of metadata in fsync and write_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:14 -05:00
Andrea Gelmini	fce1cad651	xfs: xfs_trace.c: duplicated include fs/xfs/linux-2.6/xfs_trace.c: xfs_attr_sf.h is included more than once. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:14 -05:00
Alex Elder	3f943d853d	xfs: minor odds and ends in xfs_log_recover.c Odds and ends in "xfs_log_recover.c". This patch just contains some minor things that didn't seem to warrant their own individual patches: - In xlog_bread_noalign(), drop an assertion that a pointer is non-null (the crash will tell us it was a bad pointer). - Add a more descriptive header comment for xlog_find_verify_cycle(). - Make a few additions to the comments in xlog_find_head(). Also rearrange some expressions in a few spots to produce the same result, but in a way that seems more clear what's being computed. (Updated in response to Dave's review comments. Note I did not split this patch like I said I would.) Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:14 -05:00
Alex Elder	e3bb2e30d5	xfs: avoid repeated pointer dereferences In xlog_find_cycle_start() use a local variable for some repeated operations rather than constantly accessing the memory location whose address is passed in. (This version drops an assertion that a pointer is non-null.) Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:14 -05:00
Alex Elder	9db127edb5	xfs: change a few labels in xfs_log_recover.c Rename a label used in xlog_find_head() that I thought was poorly chosen. Also combine two adjacent labels xlog_find_tail() into a single label, and give it a more generic name. (Now using Dave's suggested "validate_head" name for first label.) Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:13 -05:00
Christoph Hellwig	8c38366f99	xfs: enforce synchronous writes in xfs_bwrite xfs_bwrite is used with the intention of synchronously writing out buffers, but currently it does not actually clear the async flag if that's left from previous writes but instead implements async behaviour if it finds it. Remove the code handling asynchronous writes as we've got rid of those entirely outside of the log and delwri buffers, and make sure that we clear the async and read flags before writing the buffer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:13 -05:00
Christoph Hellwig	df308bcfec	xfs: remove periodic superblock writeback All modifications to the superblock are done transactional through xfs_trans_log_buf, so there is no reason to initiate periodic asynchronous writeback. This only removes the superblock from the delwri list and will lead to sub-optimal I/O scheduling. Cut down xfs_sync_fsdata now that it's only used for synchronous superblock writes and move the log coverage checks into the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:13 -05:00
Dave Chinner	f983710758	xfs: make the log ticket transaction id random The transaction ID that is written to the log for a transaction is currently set by taking the lower 32 bits of the memory address of the ticket structure. This is not guaranteed to be unique as tickets comes from a slab and slots can be reallocated immediately after being freed. As a result, there is no guarantee of uniqueness in the ticket ID value. Fix this by assigning a random number to the ticket ID field so that it is extremely unlikely that duplicates will occur and remove the possibility of transactions being mixed up during recovery due to duplicate IDs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:13 -05:00
Alex Elder	36adecff50	xfs: nothing special about 1-block log sector There are a number of places where a log sector size of 1 uses special case code. The round_up() and round_down() macros produce the correct result even when the log sector size is 1, and this eliminates the need for treating this as a special case. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:12 -05:00
Alex Elder	ff30a6221d	xfs: encapsulate bbcount validity checking Define a function that encapsulates checking the validity of a log block count. (Updated from previous version--no longer includes error reporting in the encapsulated validation function.) Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:12 -05:00
Alex Elder	5c17f5339f	xfs: kill XLOG_SECTOR_ROUND*() XLOG_SECTOR_ROUNDUP_BBCOUNT() and XLOG_SECTOR_ROUNDDOWN_BLKNO() are now fairly simple macro translations. Just get rid of them in favor of the round_up() and round_down() macro calls they represent. Also, in spots in xlog_get_bp() and xlog_write_log_records(), round_up() was being called with value 1, which just evaluates to the macro's second argument; so just use that instead. In the latter case, make use of that value, as long as it's already been computed. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:12 -05:00
Alex Elder	8511998baa	xfs: simplify XLOG_SECTOR_ROUND*() XLOG_SECTOR_ROUNDUP_BBCOUNT() is defined in "fs/xfs/xfs_log_recover.c" in an overly-complicated way. It is basically roundup(), but that is not at all clear from its definition. (Actually, there is another macro round_up() that applies for power-of-two-based masks which I'll be using here.) The operands in XLOG_SECTOR_ROUNDUP_BBCOUNT() are basically the block number (bbs) and the log sector basic block mask (log->l_sectbb_mask). I'll call them B and M for this discussion. The macro computes is value this way: M && (B & M) ? (B + M + 1) & ~M : B Put another way, we can break it into 3 cases: 1) ! M -> B # 0 mask, no effect 2) ! (B & M) -> B # sector aligned 3) M && (B & M) -> (B + M + 1) & ~M # round up otherwise The round_up() macro is cleverly defined using a value, v, and a power-of-2, p, and the result is the nearest multiple of p greater than or equal to v. Its value is computed something like this: ((v - 1) \| (p - 1)) + 1 Let's consider using this in the context of the 3 cases above. When p = 2^0 = 1, the result boils down to ((v - 1) \| 0) + 1, so it just translates any value v to itself. That handles case (1) above. When p = 2^n, n > 0, we know that (p - 1) will be a mask with all n bits 0..n-1 set. The condition in this case occurs when none of those mask bits is set in the value v provided. If that is the case, subtracting 1 from v will have 1's in all those lower bits (at least). Therefore, OR-ing the mask with that decremented value has no effect, so adding the 1 back again will just translate the v to itself. This handles case (2). Otherwise, the value v is greater than some multiple of p, and decrementing it will produce a result greater than or equal to that multiple. OR-ing in the mask will produce a value 1 less than the next multiple of p, so finally adding 1 back will result in the desired rounded-up value. This handles case (3). Hopefully this is convincing. While I was at it, I converted XLOG_SECTOR_ROUNDDOWN_BLKNO() to use the round_down() macro. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:12 -05:00
Alex Elder	6881a229f6	xfs: fix min bufsize bugs in two places This fixes a bug in two places that I found by inspection. In xlog_find_verify_cycle() and xlog_write_log_records(), the code attempts to allocate a buffer to hold as many blocks as possible. It gives up if the number of blocks to be allocated gets too small. Right now it uses log->l_sectbb_log as that lower bound, but I'm sure it's supposed to be the actual log sector size instead. That is, the lower bound should be (1 << log->l_sectbb_log). Also define a simple macro xlog_sectbb(log) to represent the number of basic blocks in a sector for the given log. (No change from original submission; I have implemented Christoph's suggestion about storing l_sectsize rather than l_sectbb_log in a new, separate patch in this series.) Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:12 -05:00
Alex Elder	a0e856b0b4	xfs: add const qualifiers to xfs error function args Change the tag and file name arguments to xfs_error_report() and xfs_corruption_error() to use a const qualifier. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:11 -05:00
Christoph Hellwig	74457cf4a3	xfs: remove xfs_dqmarker The xfs_dqmarker structure does not need to exist anymore. Move the remaining flags field out of it and remove the structure altogether. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:11 -05:00
Dave Chinner	3a8406f6d6	xfs: convert the dquot free list to use list heads Convert the dquot free list on the filesystem to use listhead infrastructure rather than the roll-your-own in the quota code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:11 -05:00
Dave Chinner	e6a81f13aa	xfs: convert the dquot hash list to use list heads Convert the dquot hash list on the filesystem to use listhead infrastructure rather than the roll-your-own in the quota code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:11 -05:00
Dave Chinner	368e136174	xfs: remove duplicate code from dquot reclaim The dquot shaker and the free-list reclaim code use exactly the same algorithm but the code is duplicated and slightly different in each case. Make the shaker code use the single dquot reclaim code to remove the code duplication. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:11 -05:00
Dave Chinner	3a25404b3f	xfs: convert the per-mount dquot list to use list heads Convert the dquot list on the filesytesm to use listhead infrastructure rather than the roll-your-own in the quota code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:10 -05:00
Dave Chinner	9abbc539bf	xfs: add log item recovery tracing Currently there is no tracing in log recovery, so it is difficult to determine what is going on when something goes wrong. Add tracing for log item recovery to provide visibility into the log recovery process. The tracing added shows regions being extracted from the log transactions and added to the transaction hash forming recovery items, followed by the reordering, cancelling and finally recovery of the items. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:10 -05:00
Christoph Hellwig	e6b1f27370	xfs: clean up xlog_write_adv_cnt Replace the awkward xlog_write_adv_cnt with an inline helper that makes it more obvious that it's modifying it's paramters, and replace the use of an integer type for "ptr" with a real void pointer. Also move xlog_write_adv_cnt to xfs_log_priv.h as it will be used outside of xfs_log.c in the delayed logging series. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:10 -05:00
Dave Chinner	55b66332d0	xfs: introduce new internal log vector structure The current log IO vector structure is a flat array and not extensible. To make it possible to keep separate log IO vectors for individual log items, we need a method of chaining log IO vectors together. Introduce a new log vector type that can be used to wrap the existing log IO vectors on use that internally to the log. This means that the existing external interface (xfs_log_write) does not change and hence no changes to the transaction commit code are required. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:10 -05:00
Christoph Hellwig	99428ad0f6	xfs: reindent xlog_write Reindent xlog_write to normal one tab indents and move all variable declarations into the closest enclosing block. Split from a bigger patch by Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:10 -05:00
Dave Chinner	b5203cd0a4	xfs: factor xlog_write xlog_write is a mess that takes a lot of effort to understand. It is a mass of nested loops with 4 space indents to get it to fit in 80 columns and lots of funky variables that aren't obvious what they mean or do. Break it down into understandable chunks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-05-19 09:58:09 -05:00
Dave Chinner	9b9fc2b760	xfs: log ticket reservation underestimates the number of iclogs When allocation a ticket for a transaction, the ticket is initialised with the worst case log space usage based on the number of bytes the transaction may consume. Part of this calculation is the number of log headers required for the iclog space used up by the transaction. This calculation makes an undocumented assumption that if the transaction uses the log header space reservation on an iclog, then it consumes either the entire iclog or it completes. That is - the transaction that is first in an iclog is the transaction that the log header reservation is accounted to. If the transaction is larger than the iclog, then it will use the entire iclog itself. Document this assumption. Further, the current calculation uses the rule that we can fit iclog_size bytes of transaction data into an iclog. This is in correct - the amount of space available in an iclog for transaction data is the size of the iclog minus the space used for log record headers. This means that the calculation is out by 512 bytes per 32k of log space the transaction can consume. This is rarely an issue because maximally sized transactions are extremely uncommon, and for 4k block size filesystems maximal transaction reservations are about 400kb. Hence the error in this case is less than the size of an iclog, so that makes it even harder to hit. However, anyone using larger directory blocks (16k directory blocks push the maximum transaction size to approx. 900k on a 4k block size filesystem) or larger block size (e.g. 64k blocks push transactions to the 3-4MB size) could see the error grow to more than an iclog and at this point the transaction is guaranteed to get a reservation underrun and shutdown the filesystem. Fix this by adjusting the calculation to calculate the correct number of iclogs required and account for them all up front. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:09 -05:00
Dave Chinner	b1c1b5b610	xfs: Clean up xfs_trans_committed code after factoring Now that the code has been factored, clean up all the remaining style cruft, simplify the code and re-order functions so that it doesn't need forward declarations. Also move the remaining functions that require forward declarations (xfs_trans_uncommit, xfs_trans_free) so that all the forward declarations can be removed from the file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:09 -05:00
Dave Chinner	8e646a55ac	xfs: update and factor xfs_trans_committed() The function header to xfs-trans_committed has long had this comment: * THIS SHOULD BE REWRITTEN TO USE xfs_trans_next_item() To prepare for different methods of committing items, convert the code to use xfs_trans_next_item() and factor the code into smaller, more digestible chunks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:09 -05:00
Christoph Hellwig	a3ccd2ca43	xfs: clean up xfs_trans_commit logic even more > +shut_us_down: > + shutdown = XFS_FORCED_SHUTDOWN(mp) ? EIO : 0; > + if (!(tp->t_flags & XFS_TRANS_DIRTY) \|\| shutdown) { > + xfs_trans_unreserve_and_mod_sb(tp); > + /* This whole area in _xfs_trans_commit is still a complete mess. So while touching this code, unravel this mess as well to make the whole flow of the function simpler and clearer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:08 -05:00
Dave Chinner	0924378a68	xfs: split out iclog writing from xfs_trans_commit() Split the the part of xfs_trans_commit() that deals with writing the transaction into the iclog into a separate function. This isolates the physical commit process from the logical commit operation and makes it easier to insert different transaction commit paths without affecting the existing algorithm adversely. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:08 -05:00
Dave Chinner	713bf88bba	xfs: fix reservation release commit flag in xfs_bmap_add_attrfork() xfs_bmap_add_attrfork() passes XFS_TRANS_PERM_LOG_RES to xfs_trans_commit() to indicate that the commit should release the permanent log reservation as part of the commit. This is wrong - the correct flag is XFS_TRANS_RELEASE_LOG_RES - and it is only by the chance that both these flags have the value of 0x4 that the code is doing the right thing. Fix it by changing to use the correct flag. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:08 -05:00
Dave Chinner	8e12385086	xfs: remove stale parameter from ->iop_unpin method The staleness of a object being unpinned can be directly derived from the object itself - there is no need to extract it from the object then pass it as a parameter into IOP_UNPIN(). This means we can kill the XFS_LID_BUF_STALE flag - it is set, checked and cleared in the same places XFS_BLI_STALE flag in the xfs_buf_log_item so it is now redundant and hence safe to remove. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:08 -05:00
Dave Chinner	4aaf15d1aa	xfs: Add inode pin counts to traces We don't record pin counts in inode events right now, and this makes it difficult to track down problems related to pinning inodes. Add the pin count to the inode trace class and add trace events for pinning and unpinning inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:08 -05:00
Dave Chinner	43f5efc5b5	xfs: factor log item initialisation Each log item type does manual initialisation of the log item. Delayed logging introduces new fields that need initialisation, so factor all the open coded initialisation into a common function first. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:07 -05:00
Jan Engelhardt	e2a07812e9	xfs: add blockdev name to kthreads This allows to see in `ps` and similar tools which kthreads are allotted to which block device/filesystem, similar to what jbd2 does. As the process name is a fixed 16-char array, no extra space is needed in tasks. PID TTY STAT TIME COMMAND 2 ? S 0:00 [kthreadd] 197 ? S 0:00 \_ [jbd2/sda2-8] 198 ? S 0:00 \_ [ext4-dio-unwrit] 204 ? S 0:00 \_ [flush-8:0] 2647 ? S 0:00 \_ [xfs_mru_cache] 2648 ? S 0:00 \_ [xfslogd/0] 2649 ? S 0:00 \_ [xfsdatad/0] 2650 ? S 0:00 \_ [xfsconvertd/0] 2651 ? S 0:00 \_ [xfsbufd/ram0] 2652 ? S 0:00 \_ [xfsaild/ram0] 2653 ? S 0:00 \_ [xfssyncd/ram0] Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:07 -05:00
Zhitong Wang	fda168c245	xfs: Fix integer overflow in fs/xfs/linux-2.6/xfs_ioctl*.c The am_hreq.opcount field in the xfs_attrmulti_by_handle() interface is not bounded correctly. The opcount is used to determine the size of the buffer required. The size is bounded, but can overflow and so the size checks may not be sufficient to catch invalid opcounts. Fix it by catching opcount values that would cause overflows before calculating the size. Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:07 -05:00
Dave Chinner	9bf729c0af	xfs: add a shrinker to background inode reclaim On low memory boxes or those with highmem, kernel can OOM before the background reclaims inodes via xfssyncd. Add a shrinker to run inode reclaim so that it inode reclaim is expedited when memory is low. This is more complex than it needs to be because the VM folk don't want a context added to the shrinker infrastructure. Hence we need to add a global list of XFS mount structures so the shrinker can traverse them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-04-29 16:22:13 -05:00
Jens Axboe	7407cf355f	Merge branch 'master' into for-2.6.35 Conflicts: fs/block_dev.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-04-29 09:36:24 +02:00
Dmitry Monakhov	fbd9b09a17	blkdev: generalize flags for blkdev_issue_fn functions The patch just convert all blkdev_issue_xxx function to common set of flags. Wait/allocation semantics preserved. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-04-28 19:47:36 +02:00
Dave Chinner	dd77ef924c	xfs: more swap extent fixes for dynamic fork offsets A new xfsqa test (226) with a prototype xfs_fsr change to try to handle dynamic fork offsets better triggers an assertion failure where the inode data fork is in btree format, yet there is room in the inode for it to be in extent format. The two inodes look like: before: ino 0x101 (target), num_extents 11, Max in-fork extents 6, broot size 40, fork offset 96 before: ino 0x115 (temp), num_extents 5, Max in-fork extents 3, broot size 40, fork offset 56 after: ino 0x101 (target), num_extents 5, Max in-fork extents 6, broot size 40, fork offset 96 after: ino 0x115 (temp), num_extents 11, Max in-fork extents 3, broot size 40, fork offset 56 Basically the target inode ends up with 5 extents in btree format, but it had space for 6 extents in extent format, so ends up incorrect. Notably here the broot size is the same, and that is where the kernel code is going wrong - the btree root will fit, so it lets the swap go ahead. The check should not allow the swap to take place if the number of extents while in btree format is less than the number of extents that can fit in the inode in extent format. Adding that check will prevent this swap and corruption from occurring. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-04-26 12:38:51 -05:00
Dave Chinner	f1d486a361	xfs: don't warn on EAGAIN in inode reclaim Any inode reclaim flush that returns EAGAIN will result in the inode reclaim being attempted again later. There is no need to issue a warning into the logs about this situation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-04-16 13:51:44 -05:00
Dave Chinner	b6f8dd49db	xfs: ensure that sync updates the log tail correctly Updates to the VFS layer removed an extra ->sync_fs call into the filesystem during the sync process (from the quota code). Unfortunately the sync code was unknowingly relying on this call to make sure metadata buffers were flushed via a xfs_buftarg_flush() call to move the tail of the log forward in memory before the final transactions of the sync process were issued. As a result, the old code would write a very recent log tail value to the log by the end of the sync process, and so a subsequent crash would leave nothing for log recovery to do. Hence in qa test 182, log recovery only replayed a small handle for inode fsync transactions in this case. However, with the removal of the extra ->sync_fs call, the log tail was now not moved forward with the inode fsync transactions near the end of the sync procese the first (and only) buftarg flush occurred after these transactions went to disk. The result is that log recovery now sees a large number of transactions for metadata that is already on disk. This usually isn't a problem, but when the transactions include inode chunk allocation, the inode create transactions and all subsequent changes are replayed as we cannt rely on what is on disk is valid. As a result, if the inode was written and contains unlogged changes, the unlogged changes are lost, thereby violating sync semantics. The fix is to always issue a transaction after the buftarg flush occurs is the log iѕ not idle or covered. This results in a dummy transaction being written that contains the up-to-date log tail value, which will be very recent. Indeed, it will be at least as recent as the old code would have left on disk, so log recovery will behave exactly as it used to in this situation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-04-16 13:51:23 -05:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Dave Chinner	e8c3753ce4	xfs: don't warn about page discards on shutdown If we are doing a forced shutdown, we can get lots of noise about delalloc pages being discarded. This is happens by design during a forced shutdown, so don't spam the logs with these messages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:53 -05:00
Alex Elder	8a262e573d	xfs: use scalable vmap API Re-apply a commit that had been reverted due to regressions that have since been fixed. From `95f8e302c0` Mon Sep 17 00:00:00 2001 From: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:43:09 +1100 Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (`db64fe02`) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Only modifications here were a minor reformat, plus making the patch apply given the new use of xfs_buf_is_vmapped(). Modified-by: Alex Elder <aelder@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:36 -05:00
Alex Elder	cd9640a70d	xfs: remove old vmap cache Re-apply a commit that had been reverted due to regressions that have since been fixed. Original commit: `d2859751cd` Author: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:40:44 +1100 XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much). Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> The only change I made was to use the "new" xfs_buf_is_vmapped() function in a place it had been open-coded in the original. Modified-by: Alex Elder <aelder@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:19 -05:00
Linus Torvalds	66ce3cf84d	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (21 commits) xfs: return inode fork offset in bulkstat for fsr xfs: Increase the default size of the reserved blocks pool xfs: truncate delalloc extents when IO fails in writeback xfs: check for more work before sleeping in xfssyncd xfs: Fix a build warning in xfs_aops.c xfs: fix locking for inode cache radix tree tag updates xfs: remove xfs_ipin/xfs_iunpin xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait xfs: kill xfs_lrw.h xfs: factor common xfs_trans_bjoin code xfs: stop passing opaque handles to xfs_log.c routines xfs: split xfs_bmap_btalloc xfs: fix xfs_fsblock_t tracing xfs: fix inode pincount check in fsync xfs: Non-blocking inode locking in IO completion xfs: implement optimized fdatasync xfs: remove wrapper for the fsync file operation xfs: remove wrappers for read/write file operations xfs: merge xfs_lrw.c into xfs_file.c xfs: fix dquota trace format ...	2010-03-06 11:32:21 -08:00
Linus Torvalds	05c5cb31ec	Merge branch 'for-2.6.34' of git://linux-nfs.org/~bfields/linux * 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: (22 commits) nfsd4: fix minor memory leak svcrpc: treat uid's as unsigned nfsd: ensure sockets are closed on error Revert "sunrpc: move the close processing after do recvfrom method" Revert "sunrpc: fix peername failed on closed listener" sunrpc: remove unnecessary svc_xprt_put NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN xfs_export_operations.commit_metadata commit_metadata export operation replacing nfsd_sync_dir lockd: don't clear sm_monitored on nsm_reboot_lookup lockd: release reference to nsm_handle in nlm_host_rebooted nfsd: Use vfs_fsync_range() in nfsd_commit NFSD: Create PF_INET6 listener in write_ports SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found" SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt() NFSD: Support AF_INET6 in svc_addsock() function SUNRPC: Use rpc_pton() in ip_map_parse() nfsd: 4.1 has an rfc number nfsd41: Create the recovery entry for the NFSv4.1 client nfsd: use vfs_fsync for non-directories ...	2010-03-06 11:31:38 -08:00
Linus Torvalds	e213e26ab3	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits) quota: stop using QUOTA_OK / NO_QUOTA dquot: cleanup dquot initialize routine dquot: move dquot initialization responsibility into the filesystem dquot: cleanup dquot drop routine dquot: move dquot drop responsibility into the filesystem dquot: cleanup dquot transfer routine dquot: move dquot transfer responsibility into the filesystem dquot: cleanup inode allocation / freeing routines dquot: cleanup space allocation / freeing routines ext3: add writepage sanity checks ext3: Truncate allocated blocks if direct IO write fails to update i_size quota: Properly invalidate caches even for filesystems with blocksize < pagesize quota: generalize quota transfer interface quota: sb_quota state flags cleanup jbd: Delay discarding buffers in journal_unmap_buffer ext3: quota_write cross block boundary behaviour quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota quota: split out compat_sys_quotactl support from quota.c quota: split out netlink notification support from quota.c quota: remove invalid optimization from quota_sync_all ... Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c	2010-03-05 13:20:53 -08:00
Christoph Hellwig	a9185b41a4	pass writeback_control to ->write_inode This gives the filesystem more information about the writeback that is happening. Trond requested this for the NFS unstable write handling, and other filesystems might benefit from this too by beeing able to distinguish between the different callers in more detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-05 13:25:52 -05:00
Christoph Hellwig	26821ed40b	make sure data is on disk before calling ->write_inode Similar to the fsync issue fixed a while ago in commit `2daea67e96` we need to write for data to actually hit the disk before writing out the metadata to guarantee data integrity for filesystems that modify the inode in the data I/O completion path. Currently XFS and NFS handle this manually, and AFS has a write_inode method that does nothing but waiting for data, while others are possibly missing out on this. Fortunately this change has a lot less impact than the fsync change as none of the write_inode methods starts data writeout of any form by itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-05 13:25:10 -05:00
Alex Elder	9b1f56d60a	Merge branch 'for-2.6.34-rc1-batch2' into for-linus	2010-03-05 11:45:03 -06:00
Dave Chinner	07000ee686	xfs: return inode fork offset in bulkstat for fsr So that fsr can attempt to get the fork offset of the temporary inode it uses the same as the inode it is defragmenting, pass the fork offset out in the bulkstat information. The bulkstat structure has padding that has always been zeroed, so userspace can tell if this field is set or not by use of the xattr present flag and a non-zero value for the fork offset. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-05 11:02:07 -06:00
Dave Chinner	8babd8a2e7	xfs: Increase the default size of the reserved blocks pool The current default size of the reserved blocks pool is easy to deplete with certain workloads, in particular workloads that do lots of concurrent delayed allocation extent conversions. If enough transactions are running in parallel and the entire pool is consumed then subsequent calls to xfs_trans_reserve() will fail with ENOSPC. Also add a rate limited warning so we know if this starts happening again. This is an updated version of an old patch from Lachlan McIlroy. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-05 11:01:59 -06:00
Dave Chinner	3ed3a4343b	xfs: truncate delalloc extents when IO fails in writeback We currently use block_invalidatepage() to clean up pages where I/O fails in ->writepage(). Unfortunately, if the page has delalloc regions on it, we fail to remove the delalloc regions when we invalidate the page. This can result in tripping a BUG() in xfs_get_blocks() later on if a direct IO read is done on that same region - the delalloc extent is returned when none is supposed to be there. Fix this by truncating away the delalloc regions on the page before invalidating it. Because they are delalloc, we can do this without needing a transaction. Indeed - if we get ENOSPC errors, we have to be able to do this truncation without a transaction as there is no space left for block reservation (typically why we see a ENOSPC in writeback). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-05 11:01:53 -06:00
Dave Chinner	20f6b2c785	xfs: check for more work before sleeping in xfssyncd xfssyncd processes a queue of work by detaching the queue and then iterating over all the work items. It then sleeps for a time period or until new work comes in. If new work is queued while xfssyncd is actively processing the detached work queue, it will not process that new work until after a sleep timeout or the next work event queued wakes it. Fix this by checking the work queue again before going to sleep. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-05 11:01:45 -06:00
Dave Chinner	694189328a	xfs: Fix a build warning in xfs_aops.c Fix a build warning that slipped through. Dave Chinner had posted an updated version of his patch but the previous version--without this fix--was what got committed. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-05 11:01:22 -06:00
Christoph Hellwig	ac0e773718	quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota We already do these checks in the generic code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-03-05 00:20:25 +01:00
Christoph Hellwig	8c4e4acd66	quota: clean up Q_XQUOTASYNC Currently Q_XQUOTASYNC calls into the quota_sync method, but XFS does something entirely different in it than the rest of the filesystems. xfs_quota which calls Q_XQUOTASYNC expects an asynchronous data writeout to flush delayed allocations, while the "VFS" quota support wants to flush changes to the quota file. So make Q_XQUOTASYNC call into the writeback code directly and make the quota_sync method optional as XFS doesn't need in the sense expected by the rest of the quota code. GFS2 was using limited XFS-style quota and has a quota_sync method fitting neither the style used by vfs_quota_sync nor xfs_fs_quota_sync. I left it in for now as per discussion with Steve it expects to be called from the sync path this way. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-03-05 00:20:24 +01:00
J. Bruce Fields	4ea41e2de5	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs into for-2.6.34-incoming Resolve merge conflict in fs/xfs/linux-2.6/xfs_export.c.	2010-03-04 12:04:51 -05:00
Linus Torvalds	0a135ba14d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: add __percpu sparse annotations to what's left percpu: add __percpu sparse annotations to fs percpu: add __percpu sparse annotations to core kernel subsystems local_t: Remove leftover local.h this_cpu: Remove pageset_notifier this_cpu: Page allocator conversion percpu, x86: Generic inc / dec percpu instructions local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c module: Use this_cpu_xx to dynamically allocate counters local_t: Remove cpu_local_xx macros percpu: refactor the code in pcpu_[de]populate_chunk() percpu: remove compile warnings caused by __verify_pcpu_ptr() percpu: make accessors check for percpu pointer in sparse percpu: add __percpu for sparse. percpu: make access macros universal percpu: remove per_cpu__ prefix.	2010-03-03 07:34:18 -08:00
Christoph Hellwig	f1f724e4b5	xfs: fix locking for inode cache radix tree tag updates The radix-tree code requires it's users to serialize tag updates against other updates to the tree. While XFS protects tag updates against each other it does not serialize them against updates of the tree contents, which can lead to tag corruption. Fix the inode cache to always take pag_ici_lock in exclusive mode when updating radix tree tags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Patrick Schreurs <patrick@news-service.com> Tested-by: Patrick Schreurs <patrick@news-service.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 19:14:36 -06:00
Christoph Hellwig	a14a5ab58f	xfs: remove xfs_ipin/xfs_iunpin Inodes are only pinned/unpinned via the inode item methods, and lots of code relies on that fact. So remove the separate xfs_ipin/xfs_iunpin helpers and merge them into their only callers. This also fixes up various duplicate and/or incorrect comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:56 -06:00
Christoph Hellwig	60ec678371	xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait Remove the inode item pointer and ili_last_lsn checks in __xfs_iunpin_wait as any pinned inode is guaranteed to have them valid. After this the xfs_iunpin_nowait case is nothing more than a xfs_log_force_lsn, as we know that the caller has already checked the pincount. Make xfs_iunpin_nowait the new low-level routine just doing the log force and rewrite xfs_iunpin_wait around it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:50 -06:00
Christoph Hellwig	d7658d487f	xfs: kill xfs_lrw.h Move the two declarations to better fitting headers now that xfs_lrw.c is gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:44 -06:00
Christoph Hellwig	d7e84f4137	xfs: factor common xfs_trans_bjoin code Most of xfs_trans_bjoin is duplicated in xfs_trans_get_buf, xfs_trans_getsb and xfs_trans_read_buf. Add a new _xfs_trans_bjoin which can be called by all four functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:37 -06:00
Christoph Hellwig	35a8a72f06	xfs: stop passing opaque handles to xfs_log.c routines Currenly we pass opaque xfs_log_ticket_t handles instead of struct xlog_ticket pointers, and void pointers instead of struct xlog_in_core pointers to various log manager functions. Instead pass properly typed pointers after adding forward declarations for them to xfs_log.h, and adjust the touched function prototypes to the standard XFS style while at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:32 -06:00
Christoph Hellwig	c467c049e7	xfs: split xfs_bmap_btalloc Split out the nullfb case into a separate function to reduce the stack footprint and make the code more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:25 -06:00
Christoph Hellwig	f7008d0aeb	xfs: fix xfs_fsblock_t tracing Using a static buffer in xfs_fmtfsblock means we can corrupt traces if multiple CPUs hit this code path at the same. Just remove xfs_fmtfsblock for now and print the block number purely numerical. If we want the NULLFSBLOCK and NULLSTARTBLOCK formatting back the best way would be a decoding plugin in the trace-cmd userspace command. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:17 -06:00
Christoph Hellwig	024910cbac	xfs: fix inode pincount check in fsync We need to hold the ilock to check the inode pincount safely. While we're at it also remove the check for ip->i_itemp->ili_last_lsn, a pinned inode always has it set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:10 -06:00
Dave Chinner	77d7a0c2ee	xfs: Non-blocking inode locking in IO completion The introduction of barriers to loop devices has created a new IO order completion dependency that XFS does not handle. The loop device implements barriers using fsync and so turns a log IO in the XFS filesystem on the loop device into a data IO in the backing filesystem. That is, the completion of log IOs in the loop filesystem are now dependent on completion of data IO in the backing filesystem. This can cause deadlocks when a flush daemon issues a log force with an inode locked because the IO completion of IO on the inode is blocked by the inode lock. This in turn prevents further data IO completion from occuring on all XFS filesystems on that CPU (due to the shared nature of the completion queues). This then prevents the log IO from completing because the log is waiting for data IO completion as well. The fix for this new completion order dependency issue is to make the IO completion inode locking non-blocking. If the inode lock can't be grabbed, simply requeue the IO completion back to the work queue so that it can be processed later. This prevents the completion queue from being blocked and allows data IO completion on other inodes to proceed, hence avoiding completion order dependent deadlocks. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:52 -06:00
Christoph Hellwig	66d834ea60	xfs: implement optimized fdatasync Allow us to track the difference between timestamp and size updates by using mark_inode_dirty from the I/O completion code, and checking the VFS inode flags in xfs_file_fsync. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:45 -06:00
Christoph Hellwig	fd3200bef7	xfs: remove wrapper for the fsync file operation Currently the fsync file operation is divided into a low-level routine doing all the work and one that implements the Linux file operation and does minimal argument wrapping. This is a leftover from the days of the vnode operations layer and can be removed to simplify the code a bit, as well as preparing for the implementation of an optimized fdatasync which needs to look at the Linux inode state. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:38 -06:00
Christoph Hellwig	00258e36b2	xfs: remove wrappers for read/write file operations Currently the aio_read, aio_write, splice_read and splice_write file operations are divided into a low-level routine doing all the work and one that implements the Linux file operations and does minimal argument wrapping. This is a leftover from the days of the vnode operations layer and can be removed to simplify the code a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:29 -06:00
Christoph Hellwig	dda35b8f84	xfs: merge xfs_lrw.c into xfs_file.c Currently the code to implement the file operations is split over two small files. Merge the content of xfs_lrw.c into xfs_file.c to have it in one place. Note that I haven't done various cleanups that are possible after this yet, they will follow in the next patch. Also the function xfs_dev_is_read_only which was in xfs_lrw.c before really doesn't fit in here at all and was moved to xfs_mount.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:18 -06:00
Christoph Hellwig	b262e5dfd9	xfs: fix dquota trace format The be32_to_cpu in the TP_printk output breaks automatic parsing of the trace format by the trace-cmd tools, so we have to move it into the TP_assign block. While we're at it also fix the format for the quota limits to more regular and easier parseable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:11 -06:00
Eric Sandeen	a9cc799eca	xfs: increase readdir buffer size While doing some testing of readdir perf a while back, I noticed that the buffer size we're using internally is smaller than what glibc gives us by default. Upping this size helped a bit, and seems safe. glibc's __alloc_dir() does: const size_t default_allocation = (4 * BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : 4 * BUFSIZ); const size_t small_allocation = (BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : BUFSIZ); size_t allocation = default_allocation; #ifdef _STATBUF_ST_BLKSIZE if (statp != NULL && default_allocation < statp->st_blksize) allocation = statp->st_blksize; #endif and #define _G_BUFSIZ 8192 #define _IO_BUFSIZ _G_BUFSIZ # define BUFSIZ _IO_BUFSIZ so the default buffer is 4 * 8192 = 32768 (except in the unlikely case of blocks > 32k....) Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:33:41 -06:00
Linus Torvalds	b305956abc	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (52 commits) fs/xfs: Correct NULL test xfs: optimize log flushing in xfs_fsync xfs: only clear the suid bit once in xfs_write xfs: kill xfs_bawrite xfs: log changed inodes instead of writing them synchronously xfs: remove invalid barrier optimization from xfs_fsync xfs: kill the unused XFS_QMOPT_* flush flags V2 xfs: Use delay write promotion for dquot flushing xfs: Sort delayed write buffers before dispatch xfs: Don't issue buffer IO direct from AIL push V2 xfs: Use delayed write for inodes rather than async V2 xfs: Make inode reclaim states explicit xfs: more reserved blocks fixups xfs: turn off sign warnings xfs: don't hold onto reserved blocks on remount,ro xfs: quota limit statvfs available blocks xfs: replace KM_LARGE with explicit vmalloc use xfs: cleanup up xfs_log_force calling conventions xfs: kill XLOG_VEC_SET_TYPE xfs: remove duplicate buffer flags ...	2010-02-26 17:18:52 -08:00
Linus Torvalds	f24407d2bd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt: xfs: fix xfs to work with Virtually Indexed architectures sh: add mm API for DMA to vmalloc/vmap areas arm: add mm API for DMA to vmalloc/vmap areas parisc: add mm API for DMA to vmalloc/vmap areas mm: add coherence API for DMA to vmalloc/vmap areas	2010-02-26 17:05:10 -08:00
Ben Myers	978ebd97d1	xfs_export_operations.commit_metadata This is the commit_metadata export operation for XFS. - Takes one inode to be committed. - Forces the log up to the lsn of the inode. - Doesn't force the log if the inode doesn't have a pincount. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> [bfields@citi.umich.edu: trivial whitespace fix] Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-02-20 13:14:50 -08:00
Tejun Heo	003cb608a2	percpu: add __percpu sparse annotations to fs Add __percpu sparse annotations to fs. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk>	2010-02-17 11:17:38 +09:00
Julia Lawall	d67b1b0325	fs/xfs: Correct NULL test Test the value that was just allocated rather than the previously tested one. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r@ expression x; expression e; identifier l; @@ if (x == NULL \|\| ...) { ... when forall return ...; } ... when != goto l; when != x = e when != &x x == NULL // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-13 13:22:53 -06:00
Christoph Hellwig	180040b89e	xfs: optimize log flushing in xfs_fsync If we have a pinned inode it must have a log item attached to it. Usually that log item will have ili_last_lsn already set, in which case we only need to flush the log up to that LSN instead of doing a full log force. This gives speedups of about 5% in some fsync heavy workloads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-12 13:45:14 -06:00
Christoph Hellwig	87185517de	xfs: only clear the suid bit once in xfs_write file_remove_suid already calls into ->setattr to clear the suid and sgid bits if needed, no need to start a second transaction to do it ourselves. Note that xfs_write_clear_setuid issues a sync transaction while the path through ->setattr doesn't, but that is consistant with the other filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-12 13:43:57 -06:00
James Bottomley	73c77e2ccc	xfs: fix xfs to work with Virtually Indexed architectures xfs_buf.c includes what is essentially a hand rolled version of blk_rq_map_kern(). In order to work properly with the vmalloc buffers that xfs uses, this hand rolled routine must also implement the flushing API for vmap/vmalloc areas. [style updates from hch@lst.de] Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-02-05 12:32:35 -06:00
Dave Chinner	5322892d86	xfs: kill xfs_bawrite There are no more users of this function left in the XFS code now that we've switched everything to delayed write flushing. Remove it. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-04 10:09:14 +11:00
Christoph Hellwig	07fec73625	xfs: log changed inodes instead of writing them synchronously When an inode has already be flushed delayed write, xfs_inode_clean() returns true and hence xfs_fs_write_inode() can return on a synchronous inode write without having written the inode. Currently these sycnhronous writes only come sync(1), unmount, a sycnhronous NFS export and cachefiles so should be relatively rare and out of common performance paths. Realistically, a synchronous inode write is not necessary here; we can avoid writing the inode by logging any non-transactional changes that are pending. This needs to be done with synchronous transactions, but it avoids seeking between the log and inode clusters as we do now. We don't force the log if the inode is pinned, though, so this differs from the fsync case. For normal sys_sync and unmount behaviour this is fine because we do a synchronous log force in xfs_sync_data which is called from the ->sync_fs code. It does however break the NFS synchronous export guarantees for now, but work is under way to fix this at a higher level or for the higher level to provide an additional flag in the writeback control to tell us that a log force is needed. Portions of this patch are based on work from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-02-09 11:43:49 +11:00
Christoph Hellwig	e8b217e753	xfs: remove invalid barrier optimization from xfs_fsync We always need to flush the disk write cache and can't skip it just because the no inode attributes have changed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-02-02 10:16:26 +11:00
Dave Chinner	20026d9201	xfs: kill the unused XFS_QMOPT_* flush flags V2 dquots are never flushed asynchronously. Remove the flag and the async write support from the flush function. Make the default flush a delwri flush to make the inode flush code, which leaves the XFS_QMOPT_SYNC the only flag remaining. Convert that to use SYNC_WAIT instead, just like the inode flush code. V2: - just pass flush flags straight through Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-04 09:48:58 +11:00
Dave Chinner	7d6a7bde52	xfs: Use delay write promotion for dquot flushing xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item pushing used to do to flush out delayed write dquot buffers. Change it to use the new promotion method rather than an async flush. Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock held, yet the callers make the assumption that after this call the flush lock is held. Always return with the flush lock held. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-26 15:13:41 +11:00
Dave Chinner	089716aa14	xfs: Sort delayed write buffers before dispatch Currently when the xfsbufd writes delayed write buffers, it pushes them to disk in the order they come off the delayed write list. If there are lots of buffers ѕpread widely over the disk, this results in overwhelming the elevator sort queues in the block layer and we end up losing the posibility of merging adjacent buffers to minimise the number of IOs. Use the new generic list_sort function to sort the delwri dispatch queue before issue to ensure that the buffers are pushed in the most friendly order possible to the lower layers. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-26 15:13:25 +11:00
Dave Chinner	d808f617ad	xfs: Don't issue buffer IO direct from AIL push V2 All buffers logged into the AIL are marked as delayed write. When the AIL needs to push the buffer out, it issues an async write of the buffer. This means that IO patterns are dependent on the order of buffers in the AIL. Instead of flushing the buffer, promote the buffer in the delayed write list so that the next time the xfsbufd is run the buffer will be flushed by the xfsbufd. Return the state to the xfsaild that the buffer was promoted so that the xfsaild knows that it needs to cause the xfsbufd to run to flush the buffers that were promoted. Using the xfsbufd for issuing the IO allows us to dispatch all buffer IO from the one queue. This means that we can make much more enlightened decisions on what order to flush buffers to disk as we don't have multiple places issuing IO. Optimisations to xfsbufd will be in a future patch. Version 2 - kill XFS_ITEM_FLUSHING as it is now unused. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-02 10:13:42 +11:00
Dave Chinner	c854363e80	xfs: Use delayed write for inodes rather than async V2 We currently do background inode flush asynchronously, resulting in inodes being written in whatever order the background writeback issues them. Not only that, there are also blocking and non-blocking asynchronous inode flushes, depending on where the flush comes from. This patch completely removes asynchronous inode writeback. It removes all the strange writeback modes and replaces them with either a synchronous flush or a non-blocking delayed write flush. That is, inode flushes will only issue IO directly if they are synchronous, and background flushing may do nothing if the operation would block (e.g. on a pinned inode or buffer lock). Delayed write flushes will now result in the inode buffer sitting in the delwri queue of the buffer cache to be flushed by either an AIL push or by the xfsbufd timing out the buffer. This will allow accumulation of dirty inode buffers in memory and allow optimisation of inode cluster writeback at the xfsbufd level where we have much greater queue depths than the block layer elevators. We will also get adjacent inode cluster buffer IO merging for free when a later patch in the series allows sorting of the delayed write buffers before dispatch. This effectively means that any inode that is written back by background writeback will be seen as flush locked during AIL pushing, and will result in the buffers being pushed from there. This writeback path is currently non-optimal, but the next patch in the series will fix that problem. A side effect of this delayed write mechanism is that background inode reclaim will no longer directly flush inodes, nor can it wait on the flush lock. The result is that inode reclaim must leave the inode in the reclaimable state until it is clean. Hence attempts to reclaim a dirty inode in the background will simply skip the inode until it is clean and this allows other mechanisms (i.e. xfsbufd) to do more optimal writeback of the dirty buffers. As a result, the inode reclaim code has been rewritten so that it no longer relies on the ambiguous return values of xfs_iflush() to determine whether it is safe to reclaim an inode. Portions of this patch are derived from patches by Christoph Hellwig. Version 2: - cleanup reclaim code as suggested by Christoph - log background reclaim inode flush errors - just pass sync flags to xfs_iflush Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-06 12:39:36 +11:00
Dave Chinner	777df5afdb	xfs: Make inode reclaim states explicit A.K.A.: don't rely on xfs_iflush() return value in reclaim We have gradually been moving checks out of the reclaim code because they are duplicated in xfs_iflush(). We've had a history of problems in this area, and many of them stem from the overloading of the return values from xfs_iflush() and interaction with inode flush locking to determine if the inode is safe to reclaim. With the desire to move to delayed write flushing of inodes and non-blocking inode tree reclaim walks, the overloading of the return value of xfs_iflush makes it very difficult to determine the correct thing to do next. This patch explicitly re-adds the checks to the inode reclaim code, removing the reliance on the return value of xfs_iflush() to determine what to do next. It also means that we can clearly document all the inode states that reclaim must handle and hence we can easily see that we handled all the necessary cases. This also removes the need for the xfs_inode_clean() check in xfs_iflush() as all callers now check this first (safely). Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-06 12:37:26 +11:00
Eric Sandeen	d5db0f97fb	xfs: more reserved blocks fixups This mangles the reserved blocks counts a little more. 1) add a helper function for the default reserved count 2) add helper functions to save/restore counts on ro/rw 3) save/restore reserved blocks on freeze/thaw 4) disallow changing reserved count while readonly V2: changed field name to match Dave's changes Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-08 17:41:48 -06:00
Dave Chinner	388f1f0c34	xfs: turn off sign warnings Because they cause warnings in static inline functions conditionally compiled into XFS from the VFS (e.g. fsnotify). Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-26 15:10:15 +11:00
Dave Chinner	cbe132a8bd	xfs: don't hold onto reserved blocks on remount,ro If we hold onto reserved blocks when doing a remount,ro we end up writing the blocks used count to disk that includes the reserved blocks. Reserved blocks are not actually used, so this results in the values in the superblock being incorrect. Hence if we run xfs_check or xfs_repair -n while the filesystem is mounted remount,ro we end up with an inconsistent filesystem being reported. Also, running xfs_copy on the remount,ro filesystem will result in an inconsistent image being generated. To fix this, unreserve the blocks when doing the remount,ro, and reserved them again on remount,rw. This way a remount,ro filesystem will appear consistent on disk to all utilities. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-26 15:08:49 +11:00
Christoph Hellwig	9b00f30762	xfs: quota limit statvfs available blocks A "df" run on an NFS client of an exported XFS file system reports the wrong information for "available" blocks. When a block quota is enforced, the amount reported as free is limited by the quota, but the amount reported available is not (and should be). Reported-by: Guk-Bong, Kwon <gbkwon@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 16:34:23 -06:00
Christoph Hellwig	bdfb04301f	xfs: replace KM_LARGE with explicit vmalloc use We use the KM_LARGE flag to make kmem_alloc and friends use vmalloc if necessary. As we only need this for a few boot/mount time allocations just switch to explicit vmalloc calls there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:56 -06:00
Christoph Hellwig	a14a348bff	xfs: cleanup up xfs_log_force calling conventions Remove the XFS_LOG_FORCE argument which was always set, and the XFS_LOG_URGE define, which was never used. Split xfs_log_force into a two helpers - xfs_log_force which forces the whole log, and xfs_log_force_lsn which forces up to the specified LSN. The underlying implementations already were entirely separate, as were the users. Also re-indent the new _xfs_log_force/_xfs_log_force which previously had a weird coding style. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:49 -06:00
Christoph Hellwig	4139b3b337	xfs: kill XLOG_VEC_SET_TYPE This macro only obsfucates the log item type assignments, so kill it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:43 -06:00
Christoph Hellwig	0cadda1c5f	xfs: remove duplicate buffer flags Currently we define aliases for the buffer flags in various namespaces, which only adds confusion. Remove all but the XBF_ flags to clean this up a bit. Note that we still abuse XFS_B_ASYNC/XBF_ASYNC for some non-buffer uses, but I'll clean that up later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:36 -06:00
Christoph Hellwig	a210c1aa7f	xfs: implement quota warnings via netlink Wire up quota_send_warning to send quota warnings over netlink. This is used by various desktops to show user quota warnings. Tested by running the quota_nld daemon while running the xfstest quota tests and observing the warnings. I'll see how I can get a more formal testcase for it written. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:28 -06:00
Christoph Hellwig	4d1f88d75b	xfs: clean up error handling in xfs_trans_dqresv Move the error code selection after the goto label and fold the xfs_quota_error helper into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:22 -06:00
Christoph Hellwig	512dd1abd9	xfs: kill XFS_QMOPT_ASYNC The option is unused and one of the few remaining users of xfs_bawrite, so let's get rid of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:10 -06:00
Dave Chinner	587aa0feb7	xfs: rearrange xfs_mod_sb() to avoid array subscript warning gcc warns of an array subscript out of bounds in xfs_mod_sb(). The code is written in such a way that if the array subscript is out of bounds, then it will assert fail. Rearrange the code to avoid the bounds check warning. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 12:04:53 +11:00
Dave Chinner	f0a0eaa8da	xfs: suppress spurious uninitialised var warning in xfs_bmapi() Initialise the xfs_bmalloca_t structure to zero to avoid uninitialised variable warnings. This is done by zeroing the arg structure rather than using the uninitialised_var() trick so we know for certain that the structure is correctly initialised as xfs_bmapi is a very complex function and it is difficult to prove warnings are spurious. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:50:06 +11:00
Dave Chinner	58c75cfb51	xfs: make compile warn about char sign mismatches again The -fno-unsigned-char directive has no effect anymore as the XFs build is clean. However, the kernel build hides pointer sign differences so turn that back on so that we can clean up all the mismatches prior to a userspace code resync. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:49:18 +11:00
Dave Chinner	4a24cb7140	xfs: clean up sign warnings in dir2 code We are now consistently using unsigned char strings for names so fix up the remaining warnings in the dir2 code to complete the cleanup. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:48:05 +11:00
Dave Chinner	a9273ca5c6	xfs: convert attr to use unsigned names To be consistent with the directory code, the attr code should use unsigned names. Convert the names from the vfs at the highest level to unsigned, and ænsure they are consistenly used as unsigned down to disk. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:47:48 +11:00
Dave Chinner	b9c4864957	xfs: xfs_buf_iomove() doesn't care about signedness xfs_buf_iomove() uses xfs_caddr_t as it's parameter types, but it doesn't care about the signedness of the variables as it is just copying the data. Change the prototype to use void * so that we don't get sign warnings at call sites. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:47:39 +11:00
Dave Chinner	a3380ae39f	xfs: make xfs_dir_cilookup_result use unsigned char For consistency with the result of the code. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:47:25 +11:00
Dave Chinner	2bc754213d	xfs: convert dirnameops to unsigned char names To be consistent across the codebase, convert the dirnameops to pass the directory names by unsigned char strings. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:47:17 +11:00
Dave Chinner	046ea75313	xfs: convert DM ops to use unsigned char names dmops uses a signed char for it's namespace event. To be consistent with the rest of the code, convert them to unsigned char for the namespace string. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:47:08 +11:00
Dave Chinner	e2bcd936eb	xfs: directory names are unsigned Convert the struct xfs_name to use unsigned chars for the name strings to match both what is stored on disk (__uint8_t) and what the VFS expects (unsigned char). Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:44:58 +11:00
Christoph Hellwig	4e23471a3f	xfs: move more buffer helpers into xfs_buf.c Move xfsbdstrat and xfs_bdstrat_cb from xfs_lrw.c and xfs_bioerror and xfs_bioerror_relse from xfs_rw.c into xfs_buf.c. This also means xfs_bioerror and xfs_bioerror_relse can be marked static now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:35:17 -06:00
Christoph Hellwig	64e0bc7d2a	xfs: clean up xfs_bwrite Fold XFS_bwrite into it's only caller, xfs_bwrite and move it into xfs_buf.c instead of leaving it as a fairly large inline function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:35:07 -06:00
Christoph Hellwig	873ff5501d	xfs: clean up log buffer writes Don't bother using XFS_bwrite as it doesn't provide much code for our use case. Instead opencode it and fold xlog_bdstrat_cb into the new xlog_bdstrat helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:54 -06:00
Dave Chinner	e57336ff7f	xfs: embed the pagb_list array in the perag structure Now that the perag structure is allocated memory rather than held in an array, we don't need to have the busy extent array external to the structure. Embed it into the perag structure to avoid needing an extra allocation when setting up. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:39 -06:00
Dave Chinner	8b26c5825e	xfs: handle ENOMEM correctly during initialisation of perag structures Add proper error handling in case an error occurs while initializing new perag structures for a mount point. The mount structure is restored to its previous state by deleting and freeing any perag structures added during the call. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:30 -06:00
Dave Chinner	b657fc82a3	xfs: Kill filestreams cache flush The filestreams cache flush is not needed in the sync code as it does not affect data writeback, and it is now not used by the growfs code, either, so kill it. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:22 -06:00
Dave Chinner	0fa800fbd5	xfs: Add trace points for per-ag refcount debugging. Uninline xfs_perag_{get,put} so that tracepoints can be inserted into them to speed debugging of reference count problems. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:12 -06:00
Dave Chinner	aed3bb90ab	xfs: Reference count per-ag structures Reference count the per-ag structures to ensure that we keep get/put pairs balanced. Assert that the reference counts are zero at unmount time to catch leaks. In future, reference counts will enable us to safely remove perag structures by allowing us to detect when they are no longer in use. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:04 -06:00
Dave Chinner	1c1c6ebcf5	xfs: Replace per-ag array with a radix tree The use of an array for the per-ag structures requires reallocation of the array when growing the filesystem. This requires locking access to the array to avoid use after free situations, and the locking is difficult to get right. To avoid needing to reallocate an array, change the per-ag structures to an allocated object per ag and index them using a tree structure. The AGs are always densely indexed (hence the use of an array), but the number supported is 2^32 and lookups tend to be random and hence indexing needs to scale. A simple choice is a radix tree - it works well with this sort of index. This change also removes another large contiguous allocation from the mount/growfs path in XFS. The growing process now needs to change to only initialise the new AGs required for the extra space, and as such only needs to exclusively lock the tree for inserts. The rest of the code only needs to lock the tree while doing lookups, and hence this will remove all the deadlocks that currently occur on the m_perag_lock as it is now an innermost lock. The lock is also changed to a spinlock from a read/write lock as the hold time is now extremely short. To complete the picture, the per-ag structures will need to be reference counted to ensure that we don't free/modify them while they are still in use. This will be done in subsequent patch. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:33:52 -06:00
Dave Chinner	44b56e0a1a	xfs: convert remaining direct references to m_perag Convert the remaining direct lookups of the per ag structures to use get/put accesses. Ensure that the loops across AGs and prior users of the interface balance gets and puts correctly. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:33:39 -06:00
Dave Chinner	4196ac08c0	xfs: Convert filestreams code to use per-ag get/put routines Use xfs_perag_get() and xfs_perag_put() in the filestreams code. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:33:22 -06:00
Dave Chinner	a862e0fdcb	xfs: Don't directly reference m_perag in allocation code Start abstracting the perag references so that the indexing of the structures is not directly coded into all the places that uses the perag structures. This will allow us to separate the use of the perag structure and the way it is indexed and hence avoid the known deadlocks related to growing a busy filesystem. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:33:12 -06:00
Dave Chinner	5017e97d52	xfs: rename xfs_get_perag xfs_get_perag is really getting the perag that an inode belongs to based on it's inode number. Convert the use of this function to just get the perag from a provided ag number. Use this new function to obtain the per-ag structure when traversing the per AG inode trees for sync and reclaim. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:33:02 -06:00
Dave Chinner	c9c129714e	xfs: Don't wake xfsbufd when idle The xfsbufd wakes every xfsbufd_centisecs (once per second by default) for each filesystem even when the filesystem is idle. If the xfsbufd has nothing to do, put it into a long term sleep and only wake it up when there is work pending (i.e. dirty buffers to flush soon). This will make laptop power misers happy. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:32:54 -06:00
Dave Chinner	453eac8a9a	xfs: Don't wake the aild once per second Now that the AIL push algorithm is traversal safe, we don't need a watchdog function in the xfsaild to catch pushes that fail to make progress. Remove the watchdog timeout and make pushes purely driven by demand. This will remove the once-per-second wakeup that is seen when the filesystem is idle and make laptop power misers happy. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:32:46 -06:00
Dave Chinner	f0a7695380	xfs: Use list_heads for log recovery item lists Remove the roll-your-own linked list operations. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:31:51 -06:00
Eric Sandeen	5d77c0dc0c	xfs: make several more functions static Just minor housekeeping, a lot more functions can be trivially made static; others could if we reordered things a bit... Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:31:38 -06:00
Dave Chinner	6bded0f383	xfs: clean up inconsistent variable naming in xfs_swap_extent The swap extent ioctl passes in a target inode and a temporary inode which are clearly named in the ioctl structure. The code then assigns temp to target and vice versa, making it extremely difficult to work out which inode is which later in the code. Make this consistent throughout the code. Also make xfs_swap_extent static as there are no external users of the function. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:31:23 -06:00
Dave Chinner	3a85cd96d3	xfs: add tracing to xfs_swap_extents To be able to diagnose whether the swap extents function is detecting compatible inode data fork configurations for swapping extents, add tracing points to the code to allow us to see the format of the inode forks before and after the swap. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:20:06 -06:00
Dave Chinner	e09f98606d	xfs: xfs_swap_extents needs to handle dynamic fork offsets When swapping extents, we can corrupt inodes by swapping data forks that are in incompatible formats. This is caused by the two indoes having different fork offsets due to the presence of an attribute fork on an attr2 filesystem. xfs_fsr tries to be smart about setting the fork offset, but the trick it plays only works on attr1 (old fixed format attribute fork) filesystems. Changing the way xfs_fsr sets up the attribute fork will prevent this situation from ever occurring, so in the kernel code we can get by with a preventative fix - check that the data fork in the defragmented inode is in a format valid for the inode it is being swapped into. This will lead to files that will silently and potentially repeatedly fail defragmentation, so issue a warning to the log when this particular failure occurs to let us know that xfs_fsr needs updating/fixing. To help identify how to improve xfs_fsr to avoid this issue, add trace points for the inodes being swapped so that we can determine why the swap was rejected and to confirm that the code is making the right decisions and modifications when swapping forks. A further complication is even when the swap is allowed to proceed when the fork offset is different between the two inodes then value for the maximum number of extents the data fork can hold can be wrong. Make sure these are also set correctly after the swap occurs. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:49:07 -06:00
Dave Chinner	3daeb42c13	xfs: fix missing error check in xfs_rtfree_range When xfs_rtfind_forw() returns an error, the block is returned uninitialised. xfs_rtfree_range() is not checking the error return, so could be using an uninitialised block number for modifying bitmap summary info. The problem was found by gcc when compiling the userspace libxfs code - it is an copy of the kernel code with the exact same bug. gcc gives an uninitialised variable warning on the userspace code but not on the kernel code. You gotta love the consistency (Mmmm, slightly chewy today!). Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:46:19 -06:00
Dave Chinner	4b6a46882c	xfs: fix stale inode flush avoidance When reclaiming stale inodes, we need to guarantee that inodes are unpinned before returning with a "clean" status. If we don't we can reclaim inodes that are pinned, leading to use after free in the transaction subsystem as transactions complete. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:46:02 -06:00
Dave Chinner	126976c7c1	xfs: Remove inode iolock held check during allocation lockdep complains about a the lock not being initialised as we do an ASSERT based check that the lock is not held before we initialise it to catch inodes freed with the lock held. lockdep does this check for us in the lock initialisation code, so remove the ASSERT to stop the lockdep warning. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:45:33 -06:00
Dave Chinner	57817c6822	xfs: reclaim all inodes by background tree walks We cannot do direct inode reclaim without taking the flush lock to ensure that we do not reclaim an inode under IO. We check the inode is clean before doing direct reclaim, but this is not good enough because the inode flush code marks the inode clean once it has copied the in-core dirty state to the backing buffer. It is the flush lock that determines whether the inode is still under IO, even though it is marked clean, and the inode is still required at IO completion so we can't reclaim it even though it is clean in core. Hence the requirement that we need to take the flush lock even on clean inodes because this guarantees that the inode writeback IO has completed and it is safe to reclaim the inode. With delayed write inode flushing, we coul dend up waiting a long time on the flush lock even for a clean inode. The background reclaim already handles this efficiently, so avoid all the problems by killing the direct reclaim path altogether. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:44:44 -06:00
Dave Chinner	018027be90	xfs: Avoid inodes in reclaim when flushing from inode cache The reclaim code will handle flushing of dirty inodes before reclaim occurs, so avoid them when determining whether an inode is a candidate for flushing to disk when walking the radix trees. This is based on a test patch from Christoph Hellwig. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:44:21 -06:00
Dave Chinner	c8e20be020	xfs: reclaim inodes under a write lock Make the inode tree reclaim walk exclusive to avoid races with concurrent sync walkers and lookups. This is a version of a patch posted by Christoph Hellwig that avoids all the code duplication. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:43:55 -06:00
Dave Chinner	fd45e47841	xfs: Ensure we force all busy extents in range to disk When we search for and find a busy extent during allocation we force the log out to ensure the extent free transaction is on disk before the allocation transaction. The current implementation has a subtle bug in it--it does not handle multiple overlapping ranges. That is, if we free lots of little extents into a single contiguous extent, then allocate the contiguous extent, the busy search code stops searching at the first extent it finds that overlaps the allocated range. It then uses the commit LSN of the transaction to force the log out to. Unfortunately, the other busy ranges might have more recent commit LSNs than the first busy extent that is found, and this results in xfs_alloc_search_busy() returning before all the extent free transactions are on disk for the range being allocated. This can lead to potential metadata corruption or stale data exposure after a crash because log replay won't replay all the extent free transactions that cover the allocation range. Modified-by: Alex Elder <aelder@sgi.com> (Dropped the "found" argument from the xfs_alloc_busysearch trace event.) Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-10 12:22:02 -06:00
Dave Chinner	44e08c45cc	xfs: Don't flush stale inodes Because inodes remain in cache much longer than inode buffers do under memory pressure, we can get the situation where we have stale, dirty inodes being reclaimed but the backing storage has been freed. Hence we should never, ever flush XFS_ISTALE inodes to disk as there is no guarantee that the backing buffer is in cache and still marked stale when the flush occurs. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-10 12:22:00 -06:00
Christoph Hellwig	d6d59bada3	xfs: fix timestamp handling in xfs_setattr We currently have some rather odd code in xfs_setattr for updating the a/c/mtime timestamps: - first we do a non-transaction update if all three are updated together - second we implicitly update the ctime for various changes instead of relying on the ATTR_CTIME flag - third we set the timestamps to the current time instead of the arguments in the iattr structure in many cases. This patch makes sure we update it in a consistent way: - always transactional - ctime is only updated if ATTR_CTIME is set or we do a size update, which is a special case - always to the times passed in from the caller instead of the current time The only non-size caller of xfs_setattr that doesn't come from the VFS is updated to set ATTR_CTIME and pass in a valid ctime value. Reported-by: Eric Blake <ebb9@byu.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-10 12:21:58 -06:00
Christoph Hellwig	ea9a48881e	xfs: use DECLARE_EVENT_CLASS Using DECLARE_EVENT_CLASS allows us to to use trace event code instead of duplicating it in the binary. This was not available before 2.6.33 so it had to be done as a separate step once the prerequisite was merged. This only requires changes to xfs_trace.h and the results are rather impressive: hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o* text data bss dec hex filename 607732 41884 3616 653232 9f7b0 fs/xfs/xfs.o 1026732 41884 3808 1072424 105d28 fs/xfs/xfs.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-10 12:21:56 -06:00
Dave Chinner	a539bd8c86	xfs: kill some warnings on i386 builds Randy Dunlap Reported printk() format-related warnings reported on i386 builds in his environment. Dave Chinner provided this patch to eliminate them. Signed-off by: Dave Chinner <david@fromorbit.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-08 13:32:29 -06:00
Christoph Hellwig	eaff8079d4	kill I_LOCK After I_SYNC was split from I_LOCK the leftover is always used together with I_NEW and thus superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-12-17 11:03:25 -05:00
Linus Torvalds	bea4c899f2	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: XFS: Free buffer pages array unconditionally xfs: kill xfs_bmbt_rec_32/64 types xfs: improve metadata I/O merging in the elevator xfs: check for not fully initialized inodes in xfs_ireclaim	2009-12-16 13:29:39 -08:00
Dave Chinner	3fc98b1ac0	XFS: Free buffer pages array unconditionally The code in xfs_free_buf() only attempts to free the b_pages array if the buffer is a page cache backed or page allocated buffer. The extra log buffer that is used when the log wraps uses pages that are allocated to a different log buffer, but it still has a b_pages array allocated when those pages are associated to with the extra buffer in xfs_buf_associate_memory. Hence we need to always attempt to free the b_pages array when tearing down a buffer, not just on buffers that are explicitly marked as page bearing buffers. This fixes a leak detected by the kernel memory leak code. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-16 13:41:20 -06:00
Christoph Hellwig	a5f9be58c2	xfs: kill xfs_bmbt_rec_32/64 types For a long time we've always stored bmap btree records in the 64bit format, so kill off the dead 32bit type, and make sure the 64bit type is named just xfs_bmbt_rec everywhere, without any size postfix. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-16 13:41:20 -06:00
Dave Chinner	2ee1abad73	xfs: improve metadata I/O merging in the elevator Change all async metadata buffers to use [READ\|WRITE]_META I/O types so that the I/O doesn't get issued immediately. This allows merging of adjacent metadata requests but still prioritises them over bulk data. This shows a 10-15% improvement in sequential create speed of small files. Don't include the log buffers in this classification - leave them as sync types so they are issued immediately. Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-16 13:41:19 -06:00
Christoph Hellwig	b44b112627	xfs: check for not fully initialized inodes in xfs_ireclaim Add an assert for inodes not added to the inode cache in xfs_ireclaim, to make sure we're not going to introduce something like the famous nfsd inode cache bug again. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-16 13:20:15 -06:00
Christoph Hellwig	1e431f5ce7	cleanup blockdev_direct_IO locking Currently the locking in blockdev_direct_IO is a mess, we have three different locking types and very confusing checks for some of them. The most complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be used. This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case is unused anyway, and the write side is almost identical to DIO_NO_LOCKING. The difference is that DIO_NO_LOCKING always sets the create argument for the get_blocks callback to zero, but we can easily move that to the actual get_blocks callbacks. There are four users of the DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is fine with the new version, ocfs2 only errors out if create were ever set, and we can remove this dead code now, the block device code only ever uses create for an error message if we are fully beyond the device which can never happen, and last but not least XFS will need the new behavour for writes. Now we can replace the lock_type variable with a flags one, where no flag means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first flag. Separate out the check for not allowing to fill holes into a separate flag, although for now both flags always get set at the same time. Also revamp the documentation of the locking scheme to actually make sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-12-16 12:16:49 -05:00
Christoph Hellwig	431547b3c4	sanitize xattr handler prototypes Add a flags argument to struct xattr_handler and pass it to all xattr handler methods. This allows using the same methods for multiple handlers, e.g. for the ACL methods which perform exactly the same action for the access and default ACLs, just using a different underlying attribute. With a little more groundwork it'll also allow sharing the methods for the regular user/trusted/secure handlers in extN, ocfs2 and jffs2 like it's already done for xfs in this patch. Also change the inode argument to the handlers to a dentry to allow using the handlers mechnism for filesystems that require it later, e.g. cifs. [with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-12-16 12:16:49 -05:00
Christoph Hellwig	5fe878ae7f	direct-io: cleanup blockdev_direct_IO locking Currently the locking in blockdev_direct_IO is a mess, we have three different locking types and very confusing checks for some of them. The most complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be used. This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case is unused anyway, and the write side is almost identical to DIO_NO_LOCKING. The difference is that DIO_NO_LOCKING always sets the create argument for the get_blocks callback to zero, but we can easily move that to the actual get_blocks callbacks. There are four users of the DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is fine with the new version, ocfs2 only errors out if create were ever set, and we can remove this dead code now, the block device code only ever uses create for an error message if we are fully beyond the device which can never happen, and last but not least XFS will need the new behavour for writes. Now we can replace the lock_type variable with a flags one, where no flag means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first flag. Separate out the check for not allowing to fill holes into a separate flag, although for now both flags always get set at the same time. Also revamp the documentation of the locking scheme to actually make sense. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alex Elder <aelder@sgi.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:13 -08:00
Linus Torvalds	d180ec5d34	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: event tracing support xfs: change the xfs_iext_insert / xfs_iext_remove xfs: cleanup bmap extent state macros	2009-12-15 09:12:43 -08:00
Joe Perches	03daa57cdb	fs/xfs/xfs_log_recover.c: use %pU to print UUIDs Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:33 -08:00
Christoph Hellwig	0b1b213fcf	xfs: event tracing support Convert the old xfs tracing support that could only be used with the out of tree kdb and xfsidbg patches to use the generic event tracer. To use it make sure CONFIG_EVENT_TRACING is enabled and then enable all xfs trace channels by: echo 1 > /sys/kernel/debug/tracing/events/xfs/enable or alternatively enable single events by just doing the same in one event subdirectory, e.g. echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable or set more complex filters, etc. In Documentation/trace/events.txt all this is desctribed in more detail. To reads the events do a cat /sys/kernel/debug/tracing/trace Compared to the last posting this patch converts the tracing mostly to the one tracepoint per callsite model that other users of the new tracing facility also employ. This allows a very fine-grained control of the tracing, a cleaner output of the traces and also enables the perf tool to use each tracepoint as a virtual performance counter, allowing us to e.g. count how often certain workloads git various spots in XFS. Take a look at http://lwn.net/Articles/346470/ for some examples. Also the btree tracing isn't included at all yet, as it will require additional core tracing features not in mainline yet, I plan to deliver it later. And the really nice thing about this patch is that it actually removes many lines of code while adding this nice functionality: fs/xfs/Makefile \| 8 fs/xfs/linux-2.6/xfs_acl.c \| 1 fs/xfs/linux-2.6/xfs_aops.c \| 52 - fs/xfs/linux-2.6/xfs_aops.h \| 2 fs/xfs/linux-2.6/xfs_buf.c \| 117 +-- fs/xfs/linux-2.6/xfs_buf.h \| 33 fs/xfs/linux-2.6/xfs_fs_subr.c \| 3 fs/xfs/linux-2.6/xfs_ioctl.c \| 1 fs/xfs/linux-2.6/xfs_ioctl32.c \| 1 fs/xfs/linux-2.6/xfs_iops.c \| 1 fs/xfs/linux-2.6/xfs_linux.h \| 1 fs/xfs/linux-2.6/xfs_lrw.c \| 87 -- fs/xfs/linux-2.6/xfs_lrw.h \| 45 - fs/xfs/linux-2.6/xfs_super.c \| 104 --- fs/xfs/linux-2.6/xfs_super.h \| 7 fs/xfs/linux-2.6/xfs_sync.c \| 1 fs/xfs/linux-2.6/xfs_trace.c \| 75 ++ fs/xfs/linux-2.6/xfs_trace.h \| 1369 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/linux-2.6/xfs_vnode.h \| 4 fs/xfs/quota/xfs_dquot.c \| 110 --- fs/xfs/quota/xfs_dquot.h \| 21 fs/xfs/quota/xfs_qm.c \| 40 - fs/xfs/quota/xfs_qm_syscalls.c \| 4 fs/xfs/support/ktrace.c \| 323 --------- fs/xfs/support/ktrace.h \| 85 -- fs/xfs/xfs.h \| 16 fs/xfs/xfs_ag.h \| 14 fs/xfs/xfs_alloc.c \| 230 +----- fs/xfs/xfs_alloc.h \| 27 fs/xfs/xfs_alloc_btree.c \| 1 fs/xfs/xfs_attr.c \| 107 --- fs/xfs/xfs_attr.h \| 10 fs/xfs/xfs_attr_leaf.c \| 14 fs/xfs/xfs_attr_sf.h \| 40 - fs/xfs/xfs_bmap.c \| 507 +++------------ fs/xfs/xfs_bmap.h \| 49 - fs/xfs/xfs_bmap_btree.c \| 6 fs/xfs/xfs_btree.c \| 5 fs/xfs/xfs_btree_trace.h \| 17 fs/xfs/xfs_buf_item.c \| 87 -- fs/xfs/xfs_buf_item.h \| 20 fs/xfs/xfs_da_btree.c \| 3 fs/xfs/xfs_da_btree.h \| 7 fs/xfs/xfs_dfrag.c \| 2 fs/xfs/xfs_dir2.c \| 8 fs/xfs/xfs_dir2_block.c \| 20 fs/xfs/xfs_dir2_leaf.c \| 21 fs/xfs/xfs_dir2_node.c \| 27 fs/xfs/xfs_dir2_sf.c \| 26 fs/xfs/xfs_dir2_trace.c \| 216 ------ fs/xfs/xfs_dir2_trace.h \| 72 -- fs/xfs/xfs_filestream.c \| 8 fs/xfs/xfs_fsops.c \| 2 fs/xfs/xfs_iget.c \| 111 --- fs/xfs/xfs_inode.c \| 67 -- fs/xfs/xfs_inode.h \| 76 -- fs/xfs/xfs_inode_item.c \| 5 fs/xfs/xfs_iomap.c \| 85 -- fs/xfs/xfs_iomap.h \| 8 fs/xfs/xfs_log.c \| 181 +---- fs/xfs/xfs_log_priv.h \| 20 fs/xfs/xfs_log_recover.c \| 1 fs/xfs/xfs_mount.c \| 2 fs/xfs/xfs_quota.h \| 8 fs/xfs/xfs_rename.c \| 1 fs/xfs/xfs_rtalloc.c \| 1 fs/xfs/xfs_rw.c \| 3 fs/xfs/xfs_trans.h \| 47 + fs/xfs/xfs_trans_buf.c \| 62 - fs/xfs/xfs_vnodeops.c \| 8 70 files changed, 2151 insertions(+), 2592 deletions(-) Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-14 23:08:16 -06:00
Christoph Hellwig	6ef3554422	xfs: change the xfs_iext_insert / xfs_iext_remove Change the xfs_iext_insert / xfs_iext_remove prototypes to pass more information which will allow pushing the trace points from the callers into those functions. This includes folding the whichfork information into the state variable to minimize the addition stack footprint. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-14 23:08:15 -06:00
Christoph Hellwig	7574aa92f9	xfs: cleanup bmap extent state macros Cleanup the extent state macros in the bmap code to use one common set of flags that we can pass to the tracing code later and remove a lot of the macro obsfucation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-14 23:08:15 -06:00
Linus Torvalds	d0316554d3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits) m68k: rename global variable vmalloc_end to m68k_vmalloc_end percpu: add missing per_cpu_ptr_to_phys() definition for UP percpu: Fix kdump failure if booted with percpu_alloc=page percpu: make misc percpu symbols unique percpu: make percpu symbols in ia64 unique percpu: make percpu symbols in powerpc unique percpu: make percpu symbols in x86 unique percpu: make percpu symbols in xen unique percpu: make percpu symbols in cpufreq unique percpu: make percpu symbols in oprofile unique percpu: make percpu symbols in tracer unique percpu: make percpu symbols under kernel/ and mm/ unique percpu: remove some sparse warnings percpu: make alloc_percpu() handle array types vmalloc: fix use of non-existent percpu variable in put_cpu_var() this_cpu: Use this_cpu_xx in trace_functions_graph.c this_cpu: Use this_cpu_xx for ftrace this_cpu: Use this_cpu_xx in nmi handling this_cpu: Use this_cpu operations in RCU this_cpu: Use this_cpu ops for VM statistics ... Fix up trivial (famous last words) global per-cpu naming conflicts in arch/x86/kvm/svm.c mm/slab.c	2009-12-14 09:58:24 -08:00
Linus Torvalds	3126c136bc	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits) ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks() ext3: Fix data / filesystem corruption when write fails to copy data ext4: Support for 64-bit quota format ext3: Support for vfsv1 quota format quota: Implement quota format with 64-bit space and inode limits quota: Move definition of QFMT_OCFS2 to linux/quota.h ext2: fix comment in ext2_find_entry about return values ext3: Unify log messages in ext3 ext2: clear uptodate flag on super block I/O error ext2: Unify log messages in ext2 ext3: make "norecovery" an alias for "noload" ext3: Don't update the superblock in ext3_statfs() ext3: journal all modifications in ext3_xattr_set_handle ext2: Explicitly assign values to on-disk enum of filetypes quota: Fix WARN_ON in lookup_one_len const: struct quota_format_ops ubifs: remove manual O_SYNC handling afs: remove manual O_SYNC handling kill wait_on_page_writeback_range vfs: Implement proper O_SYNC semantics ...	2009-12-11 15:31:13 -08:00
Linus Torvalds	f4d544ee57	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: Fix error return for fallocate() on XFS xfs: cleanup dmapi macros in the umount path xfs: remove incorrect sparse annotation for xfs_iget_cache_miss xfs: kill the STATIC_INLINE macro xfs: uninline xfs_get_extsz_hint xfs: rename xfs_attr_fetch to xfs_attr_get_int xfs: simplify xfs_buf_get / xfs_buf_read interfaces xfs: remove IO_ISAIO xfs: Wrapped journal record corruption on read at recovery xfs: cleanup data end I/O handlers xfs: use WRITE_SYNC_PLUG for synchronous writeout xfs: reset the i_iolock lock class in the reclaim path xfs: I/O completion handlers must use NOFS allocations xfs: fix mmap_sem/iolock inversion in xfs_free_eofblocks xfs: simplify inode teardown	2009-12-11 15:30:29 -08:00
Jason Gunthorpe	44a743f687	xfs: Fix error return for fallocate() on XFS Noticed that through glibc fallocate would return 28 rather than -1 and errno = 28 for ENOSPC. The xfs routines uses XFS_ERROR format positive return error codes while the syscalls use negative return codes. Fixup the two cases in xfs_vn_fallocate syscall to convert to negative. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:23 -06:00
Christoph Hellwig	30ac0683dd	xfs: cleanup dmapi macros in the umount path Stop the flag saving as we never mangle those in the unmount path, and hide all the weird arguents to the dmapi code inside the XFS_SEND_PREUNMOUNT / XFS_SEND_UNMOUNT macros. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:23 -06:00
Christoph Hellwig	0c3dc2b02a	xfs: remove incorrect sparse annotation for xfs_iget_cache_miss xfs_iget_cache_miss does not get called with the pag_ici_lock held, so the __releases annotation is incorrect. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:23 -06:00
Christoph Hellwig	b8f82a4a6f	xfs: kill the STATIC_INLINE macro Remove our own STATIC_INLINE macro. For small function inside implementation files just use STATIC and let gcc inline it, and for those in headers do the normal static inline - they are all small enough to be inlined for debug builds, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:22 -06:00
Christoph Hellwig	5683f53e36	xfs: uninline xfs_get_extsz_hint This function is too large to efficiently be inlined. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:22 -06:00
Christoph Hellwig	e82fa0c7ca	xfs: rename xfs_attr_fetch to xfs_attr_get_int Using a totally different name for the low-level get operation does not fit the _int convention used in the rest of the attr code, so rename it. While we're at it also fix the prototype to use the normal convention and mark it static as it's never used outside of xfs_attr.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:22 -06:00
Christoph Hellwig	6ad112bfb5	xfs: simplify xfs_buf_get / xfs_buf_read interfaces Currently the low-level buffer cache interfaces are highly confusing as we have a _flags variant of each that does actually respect the flags, and one without _flags which has a flags argument that gets ignored and overriden with a default set. Given that very few places use the default arguments get rid of the duplication and convert all callers to pass the flags explicitly. Also remove the now confusing _flags postfix. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:21 -06:00
Christoph Hellwig	c355c656fe	xfs: remove IO_ISAIO We set the IO_ISAIO flag for all read/write I/O since early Linux 2.6.x. Remove it as it has lost it's purpose long ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:21 -06:00
Andy Poling	fc5bc4c85c	xfs: Wrapped journal record corruption on read at recovery Summary of problem: If a journal record wraps at the physical end of the journal, it has to be read in two parts in xlog_do_recovery_pass(): a read at the physical end and a read at the physical beginning. If xlog_bread() has to re-align the first read, the second read request does not take that re-alignment into account. If the first read was re-aligned, the second read over-writes the end of the data from the first read, effectively corrupting it. This can happen either when reading the record header or reading the record data. The first sanity check in xlog_recover_process_data() is to check for a valid clientid, so that is the error reported. Summary of fix: If there was a first read at the physical end, XFS_BUF_PTR() returns where the data was requested to begin. Conversely, because it is the result of xlog_align(), offset indicates where the requested data for the first read actually begins - whether or not xlog_bread() has re-aligned it. Using offset as the base for the calculation of where to place the second read data ensures that it will be correctly placed immediately following the data from the first read instead of sometimes over-writing the end of it. The attached patch has resolved the reported problem of occasional inability to recover the journal (reporting "bad clientid"). Signed-off-by: Andy Poling <andy@realbig.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:21 -06:00
Christoph Hellwig	5ec4fabb02	xfs: cleanup data end I/O handlers Currently we have different end I/O handlers for read vs the different types of write I/O. But they are all very similar so we could just use one with a few conditionals and reduce code size a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:20 -06:00
Christoph Hellwig	06342cf8ad	xfs: use WRITE_SYNC_PLUG for synchronous writeout The VM and I/O schedulers now expect us to use WRITE_SYNC_PLUG for synchronous writeout. Right now I can't see any changes in performance numbers with this, but we're getting some beating for not using it, and the knowledge definitely could help the block code to make better decisions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:20 -06:00
Christoph Hellwig	033da48fda	xfs: reset the i_iolock lock class in the reclaim path The iolock is used for protecting reads, writes and block truncates against each other. We have two classes of callers, the first one is induced by a file operation and requires a reference to the inode be held and not dropped after the operation is done: - xfs_vm_vmap, xfs_vn_fallocate, xfs_read, xfs_write, xfs_splice_read, xfs_splice_write and xfs_setattr are all implementations of VFS methods that require a live inode - xfs_getbmap and xfs_swap_extents are ioctl subcommand for which the same is true - xfs_truncate_file is only called on quota inodes just returned from xfs_iget - xfs_sync_inode_data does the lock just after an igrab() - xfs_filestream_associate and xfs_filestream_new_ag take the iolock on the parent inode of an inode which by VFS rules must be referenced And we have various calls to truncate blocks past EOF or the whole file when dropping the last reference to an inode. Unfortunately lockdep complains when we do memory allocations that can recurse into the filesystem in the first class because the second class happens to take the same lock. To avoid this re-init the iolock in the beginning of xfs_fs_clear_inode to get a new lock class. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:20 -06:00
Christoph Hellwig	80641dc66a	xfs: I/O completion handlers must use NOFS allocations When completing I/O requests we must not allow the memory allocator to recurse into the filesystem, as we might deadlock on waiting for the I/O completion otherwise. The only thing currently allocating normal GFP_KERNEL memory is the allocation of the transaction structure for the unwritten extent conversion. Add a memflags argument to _xfs_trans_alloc to allow controlling the allocator behaviour. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Thomas Neumann <tneumann@users.sourceforge.net> Tested-by: Thomas Neumann <tneumann@users.sourceforge.net> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:20 -06:00
Christoph Hellwig	c56c9631cb	xfs: fix mmap_sem/iolock inversion in xfs_free_eofblocks When xfs_free_eofblocks is called from ->release the VM might already hold the mmap_sem, but in the write path we take the iolock before taking the mmap_sem in the generic write code. Switch xfs_free_eofblocks to only trylock the iolock if called from ->release and skip trimming the prellocated blocks in that case. We'll still free them later on the final iput. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:19 -06:00
Christoph Hellwig	848ce8f731	xfs: simplify inode teardown Currently the reclaim code for the case where we don't reclaim the final reclaim is overly complicated. We know that the inode is clean but instead of just directly reclaiming the clean inode we go through the whole process of marking the inode reclaimable just to directly reclaim it from the calling context. Besides being overly complicated this introduces a race where iget could recycle an inode between marked reclaimable and actually being reclaimed leading to panics. This patch gets rid of the existing reclaim path, and replaces it with a simple call to xfs_ireclaim if the inode was clean. While we're at it we also use the slightly more lax xfs_inode_clean check we'd use later to determine if we need to flush the inode here. Finally get rid of xfs_reclaim function and place the remaining small bits of reclaim code directly into xfs_fs_destroy_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Patrick Schreurs <patrick@news-service.com> Reported-by: Tommy van Leeuwen <tommy@news-service.com> Tested-by: Patrick Schreurs <patrick@news-service.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-12-11 15:11:19 -06:00
Christoph Hellwig	6b2f3d1f76	vfs: Implement proper O_SYNC semantics While Linux provided an O_SYNC flag basically since day 1, it took until Linux 2.4.0-test12pre2 to actually get it implemented for filesystems, since that day we had generic_osync_around with only minor changes and the great "For now, when the user asks for O_SYNC, we'll actually give O_DSYNC" comment. This patch intends to actually give us real O_SYNC semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC patches which are required before this patch it's actually surprisingly simple, we just need to figure out when to set the datasync flag to vfs_fsync_range and when not. This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's numerical value to keep binary compatibility, and adds a new real O_SYNC flag. To guarantee backwards compatiblity it is defined as expanding to both the O_DSYNC and the new additional binary flag (__O_SYNC) to make sure we are backwards-compatible when compiled against the new headers. This also means that all places that don't care about the differences can just check O_DSYNC and get the right behaviour for O_SYNC, too - only places that actuall care need to check __O_SYNC in addition. Drivers and network filesystems have been updated in a fail safe way to always do the full sync magic if O_DSYNC is set. The few places setting O_SYNC for lower layers are kept that way for now to stay failsafe. We enforce that O_DSYNC is set when __O_SYNC is set early in the open path to make sure we always get these sane options. Note that parisc really screwed up their headers as they already define a O_DSYNC that has always been a no-op. We try to repair it by using it for the new O_DSYNC and redefinining O_SYNC to send both the traditional O_SYNC numerical value _and_ the O_DSYNC one. Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger@sun.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>	2009-12-10 15:02:50 +01:00
Linus Torvalds	4ef58d4e2a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...	2009-12-09 19:43:33 -08:00
Linus Torvalds	6035ccd8e9	Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits) cfq-iosched: Do not access cfqq after freeing it block: include linux/err.h to use ERR_PTR cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit blkio: Allow CFQ group IO scheduling even when CFQ is a module blkio: Implement dynamic io controlling policy registration blkio: Export some symbols from blkio as its user CFQ can be a module block: Fix io_context leak after failure of clone with CLONE_IO block: Fix io_context leak after clone with CLONE_IO cfq-iosched: make nonrot check logic consistent io controller: quick fix for blk-cgroup and modular CFQ cfq-iosched: move IO controller declerations to a header file cfq-iosched: fix compile problem with !CONFIG_CGROUP blkio: Documentation blkio: Wait on sync-noidle queue even if rq_noidle = 1 blkio: Implement group_isolation tunable blkio: Determine async workload length based on total number of queues blkio: Wait for cfq queue to get backlogged if group is empty blkio: Propagate cgroup weight updation to cfq groups blkio: Drop the reference to queue once the task changes cgroup blkio: Provide some isolation between groups ...	2009-12-08 08:19:16 -08:00
Linus Torvalds	1557d33007	Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits) security/tomoyo: Remove now unnecessary handling of security_sysctl. security/tomoyo: Add a special case to handle accesses through the internal proc mount. sysctl: Drop & in front of every proc_handler. sysctl: Remove CTL_NONE and CTL_UNNUMBERED sysctl: kill dead ctl_handler definitions. sysctl: Remove the last of the generic binary sysctl support sysctl net: Remove unused binary sysctl code sysctl security/tomoyo: Don't look at ctl_name sysctl arm: Remove binary sysctl support sysctl x86: Remove dead binary sysctl support sysctl sh: Remove dead binary sysctl support sysctl powerpc: Remove dead binary sysctl support sysctl ia64: Remove dead binary sysctl support sysctl s390: Remove dead sysctl binary support sysctl frv: Remove dead binary sysctl support sysctl mips/lasat: Remove dead binary sysctl support sysctl drivers: Remove dead binary sysctl support sysctl crypto: Remove dead binary sysctl support sysctl security/keys: Remove dead binary sysctl support sysctl kernel: Remove binary sysctl logic ...	2009-12-08 07:38:50 -08:00
Jiri Kosina	d014d04386	Merge branch 'for-next' into for-linus Conflicts: kernel/irq/chip.c	2009-12-07 18:36:35 +01:00
André Goddard Rosa	af901ca181	tree-wide: fix assorted typos all over the place That is "success", "unknown", "through", "performance", "[re\|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over\|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:55 +01:00
Wu Fengguang	0d99519efe	writeback: remove unused nonblocking and congestion checks - no one is calling wb_writeback and write_cache_pages with wbc.nonblocking=1 any more - lumpy pageout will want to do nonblocking writeback without the congestion wait So remove the congestion checks as suggested by Chris. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: Alex Elder <aelder@sgi.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 13:54:25 +01:00
Eric W. Biederman	6d4561110a	sysctl: Drop & in front of every proc_handler. For consistency drop & in front of every proc_handler. Explicity taking the address is unnecessary and it prevents optimizations like stubbing the proc_handlers to NULL. Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-18 08:37:40 -08:00
Nathaniel W. Turner	6c06f072c2	xfs: copy li_lsn before dropping AIL lock Access to log items on the AIL is generally protected by m_ail_lock; this is particularly needed when we're getting or setting the 64-bit li_lsn on a 32-bit platform. This patch fixes a couple places where we were accessing the log item after dropping the AIL lock on 32-bit machines. This can result in a partially-zeroed log->l_tail_lsn if xfs_trans_ail_delete is racing with xfs_trans_ail_update, and in at least some cases, this can leave the l_tail_lsn with a zero cycle number, which means xlog_space_left will think the log is full (unless CONFIG_XFS_DEBUG is set, in which case we'll trip an ASSERT), leading to processes stuck forever in xlog_grant_log_space. Thanks to Adrian VanderSpek for first spotting the race potential and to Dave Chinner for debug assistance. Signed-off-by: Nathaniel W. Turner <nate@houseofnate.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-11-17 10:26:49 -06:00
Jan Rekorajski	8ec6dba258	XFS bug in log recover with quota (bugzilla id 855) Hi, I was hit by a bug in linux 2.6.31 when XFS is not able to recover the log after a crash if fs was mounted with quotas. Gory details in XFS bugzilla: http://oss.sgi.com/bugzilla/show_bug.cgi?id=855. It looks like wrong struct is used in buffer length check, and the following patch should fix the problem. xfs_dqblk_t has a size of 104+32 bytes, while xfs_disk_dquot_t is 104 bytes long, and this is exactly what I see in system logs - "XFS: dquot too small (104) in xlog_recover_do_dquot_trans." Signed-off-by: Jan Rekorajski <baggins@sith.mimuw.edu.pl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-11-17 10:26:38 -06:00
Eric W. Biederman	ab09203e30	sysctl fs: Remove dead binary sysctl support Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name and .strategy members of sysctl tables are dead code. Remove them. Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 02:04:55 -08:00
Linus Torvalds	a80a66caf8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs * 'for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs: xfs: fix xfs_quota remove error xfs: free temporary cursor in xfs_dialloc	2009-10-31 12:12:49 -07:00
Ryota Yamauchi	c7ff91d722	xfs: fix xfs_quota remove error The xfs_quota returns ENOSYS when remove command is executed. Reproducable with following steps. # mount -t xfs -o uquota /dev/sda7 /mnt/mp1 # xfs_quota -x -c off -c remove XFS_QUOTARM: Function not implemented. The remove command is allowed during quotaoff, but xfs_fs_set_xstate() checks whether quota is running, and it leads to ENOSYS. To solve this problem, add a check for X_QUOTARM. Signed-off-by: Ryota Yamauchi <r-yamauchi@vf.jp.nec.com> Signed-off-by: Utako Kusaka <u-kusaka@wm.jp.nec.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2009-10-30 09:27:44 +01:00
Eric Sandeen	3b826386d3	xfs: free temporary cursor in xfs_dialloc Commit `bd16956599` seems to have a slight regression where this code path: if (!--searchdistance) { /* * Not in range - save last search * location and allocate a new inode */ ... goto newino; } doesn't free the temporary cursor (tcur) that got dup'd in this function. This leaks an item in the xfs_btree_cur zone, and it's caught on module unload: =========================================================== BUG xfs_btree_cur: Objects remaining on kmem_cache_close() ----------------------------------------------------------- It seems like maybe a single free at the end of the function might be cleaner, but for now put a del_cursor right in this code block similar to the handling in the rest of the function. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Christoph Hellwig <hch@lst.de>	2009-10-30 09:27:07 +01:00
Linus Torvalds	2375669214	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix double IRELE in xfs_dqrele_inode	2009-10-29 08:18:25 -07:00
Alex Elder	ba313e68fa	Merge branch 'master' of ssh://oss.sgi.com/oss/git/xfs/xfs into for-linus	2009-10-13 15:47:22 -05:00
Christoph Hellwig	05277c75f6	xfs: fix double IRELE in xfs_dqrele_inode xfs_dqrele_inode calls xfs_iput to release the ilock and a reference and then also calls IRELE which does a second decrement of the reference count. This leads to a premature freeing of inodes when quotas were turned off while the filesystem was mounted. Thanks to Utako Kusaka for reporting the bug and provinding a good testcase. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Utako Kusaka <u-kusaka@wm.jp.nec.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-13 13:16:36 -05:00
Linus Torvalds	a372bf8b6a	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: stop calling filemap_fdatawait inside ->fsync fix readahead calculations in xfs_dir2_leaf_getdents() xfs: make sure xfs_sync_fsdata covers the log xfs: mark inodes dirty before issuing I/O xfs: cleanup ->sync_fs xfs: fix xfs_quiesce_data xfs: implement ->dirty_inode to fix timestamp handling	2009-10-09 13:29:42 -07:00
Alex Elder	e09d39968b	Merge branch 'master' into for-linus	2009-10-08 13:53:44 -05:00
Christoph Hellwig	d0800703fe	xfs: stop calling filemap_fdatawait inside ->fsync Now that the VFS actually waits for the data I/O to complete before calling into ->fsync we can stop doing it ourselves. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:02:48 -05:00
Eric Sandeen	8e69ce1471	fix readahead calculations in xfs_dir2_leaf_getdents() This is for bug #850, http://oss.sgi.com/bugzilla/show_bug.cgi?id=850 XFS file system segfaults , repeatedly and 100% reproducable in 2.6.30 , 2.6.31 The above only showed up on a CONFIG_XFS_DEBUG=y kernel, because xfs_bmapi() ASSERTs that it has been asked for at least one map, and it was getting 0. The root cause is that our guesstimated "bufsize" from xfs_file_readdir was fairly small, and the bufsize -= length; in the loop was going negative - except bufsize is a size_t, so it was wrapping to a very large number. Then when we did ra_want = howmany(bufsize + mp->m_dirblksize, mp->m_sb.sb_blocksize) - 1; with that very large number, the (int) ra_want was coming out negative, and a subsequent compare: if (1 + ra_want > map_blocks ... was coming out -true- (negative int compare w/ uint) and we went back to xfs_bmapi() for more, even though we did not need more, and asked for 0 maps, and hit the ASSERT. We have kind of a type mess here, but just keeping bufsize from going negative is probably sufficient to avoid the problem. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:02:12 -05:00
Dave Chinner	dce5065a57	xfs: make sure xfs_sync_fsdata covers the log We want to always cover the log after writing out the superblock, and in case of a synchronous writeout make sure we actually wait for the log to be covered. That way a filesystem that has been sync()ed can be considered clean by log recovery. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:01:49 -05:00
Dave Chinner	932640e8ad	xfs: mark inodes dirty before issuing I/O To make sure they get properly waited on in sync when I/O is in flight and we latter need to update the inode size. Requires a new helper to check if an ioend structure is beyond the current EOF. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:01:26 -05:00
Christoph Hellwig	69961a26b8	xfs: cleanup ->sync_fs Sort out ->sync_fs to not perform a superblock writeback for the wait = 0 case as that is just an optional first pass and the superblock will be written back properly in the next call with wait = 1. Instead perform an opportunistic quota writeback to have less work later. Also remove the freeze special case as we do a proper wait = 1 call in the freeze code anyway. Also rename the function to xfs_fs_sync_fs to match the normal naming convention, update comments and avoid calling into the laptop_mode logic on an error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:01:03 -05:00
Dave Chinner	c90b07e8dd	xfs: fix xfs_quiesce_data We need to do a synchronous xfs_sync_fsdata to make sure the superblock actually is on disk when we return. Also remove SYNC_BDFLUSH flag to xfs_sync_inodes because that particular flag is never checked. Move xfs_filestream_flush call later to only release inodes after they have been written out. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:00:36 -05:00
Christoph Hellwig	f9581b1443	xfs: implement ->dirty_inode to fix timestamp handling This is picking up on Felix's repost of Dave's patch to implement a .dirty_inode method. We really need this notification because the VFS keeps writing directly into the inode structure instead of going through methods to update this state. In addition to the long-known atime issue we now also have a caller in VM code that updates c/mtime that way for shared writeable mmaps. And I found another one that no one has noticed in practice in the FIFO code. So implement ->dirty_inode to set i_update_core whenever the inode gets externally dirtied, and switch the c/mtime handling to the same scheme we already use for atime (always picking up the value from the Linux inode). Note that this patch also removes the xfs_synchronize_atime call in xfs_reclaim it was superflous as we already synchronize the time when writing the inode via the log (xfs_inode_item_format) or the normal buffers (xfs_iflush_int). In addition also remove the I_CLEAR check before copying the Linux timestamps - now that we always have the Linux inode available we can always use the timestamps in it. Also switch to just using file_update_time for regular reads/writes - that will get us all optimization done to it for free and make sure we notice early when it breaks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:00:03 -05:00
Christoph Lameter	7a9e02d6bb	this_cpu: xfs_icsb_modify_counters does not need "cpu" variable The xfs_icsb_modify_counters() function no longer needs the cpu variable if we use this_cpu_ptr() and we can get rid of get/put_cpu(). Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Olaf Weber <olaf@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2009-10-03 19:48:23 +09:00
Alexey Dobriyan	f0f37e2f77	const: mark struct vm_struct_operations * mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-27 11:39:25 -07:00
Linus Torvalds	db16826367	Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits) HWPOISON: Enable error_remove_page on btrfs HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs HWPOISON: Add madvise() based injector for hardware poisoned pages v4 HWPOISON: Enable error_remove_page for NFS HWPOISON: Enable .remove_error_page for migration aware file systems HWPOISON: The high level memory error handler in the VM v7 HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process HWPOISON: shmem: call set_page_dirty() with locked page HWPOISON: Define a new error_remove_page address space op for async truncation HWPOISON: Add invalidate_inode_page HWPOISON: Refactor truncate to allow direct truncating of page v2 HWPOISON: check and isolate corrupted free pages v2 HWPOISON: Handle hardware poisoned pages in try_to_unmap HWPOISON: Use bitmask/action code for try_to_unmap behaviour HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 HWPOISON: Add poison check to page fault handling HWPOISON: Add basic support for poisoned pages in fault handler v3 HWPOISON: Add new SIGBUS error codes for hardware poison signals HWPOISON: Add support for poison swap entries v2 HWPOISON: Export some rmap vma locking to outside world ...	2009-09-24 07:53:22 -07:00
Alexey Dobriyan	8d65af789f	sysctl: remove "struct file *" argument of ->proc_handler It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:04 -07:00
Linus Torvalds	342ff1a1b5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) trivial: fix typo in aic7xxx comment trivial: fix comment typo in drivers/ata/pata_hpt37x.c trivial: typo in kernel-parameters.txt trivial: fix typo in tracing documentation trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c trivial: remove unnecessary semicolons trivial: Fix duplicated word "options" in comment trivial: kbuild: remove extraneous blank line after declaration of usage() trivial: improve help text for mm debug config options trivial: doc: hpfall: accept disk device to unload as argument trivial: doc: hpfall: reduce risk that hpfall can do harm trivial: SubmittingPatches: Fix reference to renumbered step trivial: fix typos "man[ae]g?ment" -> "management" trivial: media/video/cx88: add __init/__exit macros to cx88 drivers trivial: fix typo in CONFIG_DEBUG_FS in gcov doc trivial: fix missing printk space in amd_k7_smp_check trivial: fix typo s/ketymap/keymap/ in comment trivial: fix typo "to to" in multiple files trivial: fix typos in comments s/DGBU/DBGU/ ...	2009-09-22 07:51:45 -07:00
Alexey Dobriyan	b87221de6a	const: mark remaining super_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Alexey Dobriyan	0d54b217a2	const: make struct super_block::s_qcop const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Anand Gadiyar	411c940385	trivial: fix typo "for for" in multiple files trivial: fix typo "for for" in multiple files Signed-off-by: Anand Gadiyar <gadiyar@ti.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:54 +02:00
Andi Kleen	aa261f549d	HWPOISON: Enable .remove_error_page for migration aware file systems Enable removing of corrupted pages through truncation for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs These should cover most server needs. I chose the set of migration aware file systems for this for now, assuming they have been especially audited. But in general it should be safe for all file systems on the data area that support read/write and truncate. Caveat: the hardware error handler does not take i_mutex for now before calling the truncate function. Is that ok? Cc: tytso@mit.edu Cc: hch@infradead.org Cc: mfasheh@suse.com Cc: aia21@cantab.net Cc: hugh.dickins@tiscali.co.uk Cc: swhiteho@redhat.com Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:16 +02:00
Alex Elder	fdec29c5fc	Merge branch 'master' of git://oss.sgi.com/xfs/xfs into for-linus Conflicts: fs/xfs/linux-2.6/xfs_lrw.c	2009-09-15 21:37:47 -05:00
Jaswinder Singh Rajput	9ef96da6ec	xfs: includecheck fix for fs/xfs/xfs_iops.c fix the following 'make includecheck' warning: fs/xfs/linux-2.6/xfs_iops.c: xfs_acl.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-09-15 12:30:30 -05:00
Alexey Dobriyan	361735fd8f	xfs: switch to seq_file create_proc_read_entry() is getting deprecated. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-09-15 12:29:24 -05:00
Jan Kara	af0f4414f3	xfs: Convert sync_page_range() to simple filemap_write_and_wait_range() Christoph Hellwig says that it is enough for XFS to call filemap_write_and_wait_range() instead of sync_page_range() because we do all the metadata syncing when forcing the log. CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2009-09-14 17:08:17 +02:00
Christoph Hellwig	4734d401d4	xfs: use correct log reservation when handling ENOSPC in xfs_create We added the ENOSPC handling patch in xfs_create just after it got mered with xfs_mkdir. Change the log reservation to the variable for either the create or mkdir value so it does the right thing if get here for creating a directory. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-09-09 18:19:02 -05:00
Linus Torvalds	18f4c64477	jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()' This avoids an indirect call in the VFS for each path component lookup. Well, at least as long as you own the directory in question, and the ACL check is unnecessary. Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-08 11:09:04 -07:00
Alex Elder	988abe4075	xfs: xfs_showargs() reports group and project quotas enabled If you enable group or project quotas on an XFS file system, then the mount table presented through /proc/self/mounts erroneously shows that both options are in effect for the file system. The root of the problem is some bad logic in the xfs_showargs() function, which is used to format the file system type-specific options in effect for a file system. The problem originated in this GIT commit: Move platform specific mount option parse out of core XFS code Date: 11/22/07 Author: Dave Chinner SHA1 ID: `a67d7c5f5d` For XFS quotas, project and group quota management are mutually exclusive--only one can be in effect at a time. There are two parts to managing quotas: aggregating usage information; and enforcing limits. It is possible to have a quota in effect (aggregating usage) but not enforced. These features are recorded on an XFS mount point using these flags: XFS_PQUOTA_ACCT - Project quotas are aggregated XFS_GQUOTA_ACCT - Group quotas are aggregated XFS_OQUOTA_ENFD - Project/group quotas are enforced The code in error is in fs/xfs/linux-2.6/xfs_super.c: if (mp->m_qflags & (XFS_PQUOTA_ACCT\|XFS_OQUOTA_ENFD)) seq_puts(m, "," MNTOPT_PRJQUOTA); else if (mp->m_qflags & XFS_PQUOTA_ACCT) seq_puts(m, "," MNTOPT_PQUOTANOENF); if (mp->m_qflags & (XFS_GQUOTA_ACCT\|XFS_OQUOTA_ENFD)) seq_puts(m, "," MNTOPT_GRPQUOTA); else if (mp->m_qflags & XFS_GQUOTA_ACCT) seq_puts(m, "," MNTOPT_GQUOTANOENF); The problem is that XFS_OQUOTA_ENFD will be set in mp->m_qflags if either group or project quotas are enforced, and as a result both MNTOPT_PRJQUOTA and MNTOPT_GRPQUOTA will be shown as mount options. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-09-02 17:02:24 -05:00
Christoph Hellwig	81e251766e	xfs: un-static xfs_inobt_lookup xfs_inobt_lookup is also used in xfs_itable.c, remove the STATIC modifier from it's declaration to fix non-debug builds. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 20:43:01 -05:00
Christoph Hellwig	3725867dcc	xfs: actually enable the swapext compat handler Fix a small typo in the compat ioctl handler that cause the swapext compat handler to never be called. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Torsten Kaiser <just.for.lkml@googlemail.com> Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 17:00:46 -05:00
Christoph Hellwig	f4378b6eaf	xfs: actually enable the swapext compat handler Fix a small typo in the compat ioctl handler that cause the swapext compat handler to never be called. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Torsten Kaiser <just.for.lkml@googlemail.com> Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 16:55:53 -05:00
Christoph Hellwig	aa72a5cf00	xfs: simplify xfs_trans_iget xfs_trans_iget is a wrapper for xfs_iget that adds the inode to the transaction after it is read. Except when the inode already is in the inode cache, in which case it returns the existing locked inode with increment lock recursion counts. Now, no one in the tree every decrements these lock recursion counts, so any user of this gets a potential double unlock when both the original owner of the inode and the xfs_trans_iget caller unlock it. When looking back in a git bisect in the historic XFS tree there was only one place that decremented these counts, xfs_trans_iput. Introduced in commit ca25df7a840f426eb566d52667b6950b92bb84b5 by Adam Sweeney in 1993, and removed in commit 19f899a3ab155ff6a49c0c79b06f2f61059afaf3 by Steve Lord in 2003. And as long as it didn't slip through git bisects cracks never actually used in that time frame. A quick audit of the callers of xfs_trans_iget shows that no caller really relies on this behaviour fortunately - xfs_ialloc allows this inode from disk so it must not be there before, and all the RT allocator routines only every add each RT bitmap inode once. In addition to removing lots of code and reducing the size of the inode item this patch also avoids the double inode cache lookup in each create/mkdir/mknod transaction. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:46:16 -05:00
Christoph Hellwig	13e6d5cdde	xfs: merge fsync and O_SYNC handling The guarantees for O_SYNC are exactly the same as the ones we need to make for an fsync call (and given that Linux O_SYNC is O_DSYNC the equivalent is fdadatasync, but we treat both the same in XFS), except with a range data writeout. Jan Kara has started unifying these two path for filesystems using the generic helpers, and I've started to look at XFS. The actual transaction commited by xfs_fsync and xfs_write_sync_logforce has a different transaction number, but actually is exactly the same. We'll only use the fsync transaction going forward. One major difference is that xfs_write_sync_logforce never issues a cache flush unless we commit a transaction causing that as a side-effect, which is an obvious bug in the O_SYNC handling. Second all the locking and i_update_size vs i_update_core changes from `978b723712` never made it to xfs_write_sync_logforce, so we add them back. To make xfs_fsync easily usable from the O_SYNC path, the filemap_fdatawait call is moved up to xfs_file_fsync, so that we don't wait on the whole file after we already waited for our portion in xfs_write. We'll also use a plain call to filemap_write_and_wait_range instead of the previous sync_page_rang which did it in two steps including an half-hearted inode write out that doesn't help us. Once we're done with this also remove the now useless i_update_size tracking. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:45:57 -05:00
Dave Chinner	bd16956599	xfs: speed up free inode search Don't search too far - abort if it is outside a certain radius and simply do a linear search for the first free inode. In AGs with a million inodes this can speed up allocation speed by 3-4x. [hch: ported to the new xfs_ialloc.c world order] Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:45:48 -05:00
Christoph Hellwig	2187550525	xfs: rationalize xfs_inobt_lookup* Currenly we have a xfs_inobt_lookup* variant for each comparism direction, and all these get all three fields of the inobt records passed, while the common case is just looking for the inode number and we have only marginally more callers than xfs_inobt_lookup* variants. So opencode a direct call to xfs_btree_lookup for the single case where we need all fields, and replace xfs_inobt_lookup* with a xfs_inobt_looku that just takes the inode number and the direction for all other callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:45:39 -05:00
Christoph Hellwig	4254b0bbb1	xfs: untangle xfs_dialloc Clarify the control flow in xfs_dialloc. Factor out a helper to go to the next node from the current one and improve the control flow by expanding composite if statements and using gotos. The xfs_ialloc_next_rec helper is borrowed from Dave Chinners dynamic allocation policy patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:45:29 -05:00
Dave Chinner	0b48db80ba	xfs: factor out debug checks from xfs_dialloc and xfs_difree Factor out a common helper from repeated debug checks in xfs_dialloc and xfs_difree. [hch: split out from Dave's dynamic allocation policy patches] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:45:18 -05:00
Christoph Hellwig	afabc24a73	xfs: improve xfs_inobt_update prototype Both callers of xfs_inobt_update have the record in form of a xfs_inobt_rec_incore_t, so just pass a pointer to it instead of the individual variables. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:45:08 -05:00
Christoph Hellwig	2e287a731e	xfs: improve xfs_inobt_get_rec prototype Most callers of xfs_inobt_get_rec need to fill a xfs_inobt_rec_incore_t, and those who don't yet are fine with a xfs_inobt_rec_incore_t, instead of the three individual variables, too. So just change xfs_inobt_get_rec to write the output into a xfs_inobt_rec_incore_t directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:44:56 -05:00
Dave Chinner	85c0b2ab5e	xfs: factor out inode initialisation Factor out code to initialize new inode clusters into a function of it's own. This keeps xfs_ialloc_ag_alloc smaller and better structured and enables a future inode cluster initialization transaction. Also initialize the agno variable earlier in xfs_ialloc_ag_alloc to avoid repeated byte swaps. [hch: The original patch is from Dave from his unpublished inode create transaction patch series, with some modifcations by me to apply stand-alone] Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-09-01 12:44:27 -05:00
Julia Lawall	a0f7bfd342	fs/xfs: Correct redundant test bp was tested for NULL a few lines before, followed by a return, and there is no intervening modification of its value. A simplified version of the semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; expression E; position p1,p2; @@ if (x == NULL \|\| ...) { ... when forall return ...; } ... when != $x=E\\|x--\\|x++\\|--x\\|++x\\|x-=E\\|x+=E\\|x\|=E\\|x&=E\\|&x$ ( x == NULL \| x != NULL ) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Acked-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-31 14:46:22 -05:00
Eric Sandeen	eb00457d62	xfs: remove XFS_INO64_OFFSET Commit `a19d9f887d` removed the ino64 option but left the XFS_INO64_OFFSET define it used in place - just remove it. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-31 14:46:22 -05:00
Eric Sandeen	fef1111ecd	un-static xfs_read_agf CONFIG_XFS_DEBUG builds still need xfs_read_agf to be non-static, oops. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-31 14:46:21 -05:00
Eric Sandeen	d96f8f891f	xfs: add more statics & drop some unused functions A lot more functions could be made static, but they need forward declarations; this does some easy ones, and also found a few unused functions in the process. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-31 14:46:20 -05:00
Christoph Hellwig	bc990f5cb4	xfs: fix locking in xfs_iget_cache_hit The locking in xfs_iget_cache_hit currently has numerous problems: - we clear the reclaim tag without i_flags_lock which protects modifications to it - we call inode_init_always which can sleep with pag_ici_lock held (this is oss.sgi.com BZ #819) - we acquire and drop i_flags_lock a lot and thus provide no consistency between the various flags we set/clear under it This patch fixes all that with a major revamp of the locking in the function. The new version acquires i_flags_lock early and only drops it once we need to call into inode_init_always or before calling xfs_ilock. This patch fixes a bug seen in the wild where we race modifying the reclaim tag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-17 01:23:48 -05:00
Linus Torvalds	78efd1ddd9	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix spin_is_locked assert on uni-processor builds xfs: check for dinode realtime flag corruption use XFS_CORRUPTION_ERROR in xfs_btree_check_sblock xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_get xfs: switch to NOFS allocation under i_lock in xfs_readlink_bmap xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_set xfs: switch to NOFS allocation under i_lock in xfs_buf_associate_memory xfs: switch to NOFS allocation under i_lock in xfs_dir_cilookup_result xfs: switch to NOFS allocation under i_lock in xfs_da_buf_make xfs: switch to NOFS allocation under i_lock in xfs_da_state_alloc xfs: switch to NOFS allocation under i_lock in xfs_getbmap xfs: avoid memory allocation under m_peraglock in growfs code	2009-08-12 08:49:35 -07:00
Christoph Hellwig	a8914f3a6d	xfs: fix spin_is_locked assert on uni-processor builds Without SMP or preemption spin_is_locked always returns false, so we can't do an assert with it. Instead use assert_spin_locked, which does the right thing on all builds. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reported-by: Johannes Engel <jcnengel@googlemail.com> Tested-by: Johannes Engel <jcnengel@googlemail.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:08:27 -05:00
Christoph Hellwig	b89d4208de	xfs: check for dinode realtime flag corruption Ramon tested XFS with a modified version of fsfuzzer and hit a NULL pointer dereference in __xfs_get_blocks due to the RT device target pointer being NULL. To fix this reject inode with the realtime bit set on a a filesystem without an RT subvolume during inode read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reported-by: Ramon de Carvalho Valle <ramon@risesecurity.org> Tested-by: Ramon de Carvalho Valle <ramon@risesecurity.org> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:08:21 -05:00
Eric Sandeen	e0c222c411	use XFS_CORRUPTION_ERROR in xfs_btree_check_sblock In Red Hat Bug 512552 - Can't write to XFS mount during raid5 resync a user ran into corruption while resyncing a raid, and we failed a consistency test, but didn't get much more info; it'd be nice to call XFS_CORRUPTION_ERROR here so we can see the buffer contents. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:08:10 -05:00
Christoph Hellwig	ddd3a14e0f	xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_get xfs_attr_rmtval_get is always called with i_lock held, but i_lock is taken in reclaim context so all allocations under it must avoid recursions into the filesystem. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:08:01 -05:00
Christoph Hellwig	7b02ecb303	xfs: switch to NOFS allocation under i_lock in xfs_readlink_bmap xfs_readlink_bmap is called with i_lock held, but i_lock is taken in reclaim context so all allocations under it must avoid recursions into the filesystem. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:07:53 -05:00
Christoph Hellwig	10746e47e7	xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_set xfs_attr_rmtval_set is always called with i_lock held, and i_lock is taken in reclaim context so all allocations under it must avoid recursions into the filesystem. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:07:44 -05:00
Christoph Hellwig	36fae17a64	xfs: switch to NOFS allocation under i_lock in xfs_buf_associate_memory xfs_buf_associate_memory is used for setting up the spare buffer for the log wrap case in xlog_sync which can happen under i_lock when called from xfs_fsync. The i_lock mutex is taken in reclaim context so all allocations under it must avoid recursions into the filesystem. There are a couple more uses of xfs_buf_associate_memory in the log recovery code that are also affected by this, but I'd rather keep the code simple than passing on a gfp_mask argument. Longer term we should just stop requiring the memoery allocation in xlog_sync by some smaller rework of the buffer layer. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:07:38 -05:00
Christoph Hellwig	3f52c2f0a0	xfs: switch to NOFS allocation under i_lock in xfs_dir_cilookup_result xfs_dir_cilookup_result is always called with i_lock held, but i_lock is taken in reclaim context so all allocations under it must avoid recursions into the filesystem. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:07:23 -05:00
Christoph Hellwig	73195ed786	xfs: switch to NOFS allocation under i_lock in xfs_da_buf_make i_lock is taken in the reclaim context so all allocations under it must avoid recursions into the filesystem. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:07:14 -05:00
Christoph Hellwig	f41d7fb9da	xfs: switch to NOFS allocation under i_lock in xfs_da_state_alloc xfs_da_state_alloc is always called with i_lock held, but i_lock is taken in reclaim context so all allocations under it must avoid recursions into the filesystem. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:07:07 -05:00
Christoph Hellwig	ca35dcd6ca	xfs: switch to NOFS allocation under i_lock in xfs_getbmap xfs_getbmap allocates memory with i_lock held, but i_lock is taken in reclaim context so all allocations under it must avoid recursions into the filesystem. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:06:59 -05:00
Christoph Hellwig	0cc6eee130	xfs: avoid memory allocation under m_peraglock in growfs code Allocate the memory for the larger m_perag array before taking the per-AG lock as the per-AG lock can be taken under the i_lock which can be taken from reclaim context. Reported by the new reclaim context tracing in lockdep. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-08-12 01:06:51 -05:00
Christoph Hellwig	b36ec0428a	xfs: fix freeing of inodes not yet added to the inode cache When freeing an inode that lost race getting added to the inode cache we must not call into ->destroy_inode, because that would delete the inode that won the race from the inode cache radix tree. This patch uses splits a new xfs_inode_free helper out of xfs_ireclaim and uses that plus __destroy_inode to make sure we really only free the memory allocted for the inode that lost the race, and not mess with the inode cache state. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reported-by: Alex Samad <alex@samad.com.au> Reported-by: Andrew Randrianasulu <randrik@mail.ru> Reported-by: Stephane <sharnois@max-t.com> Reported-by: Tommy <tommy@news-service.com> Reported-by: Miah Gregory <mace@darksilence.net> Reported-by: Gabriel Barazer <gabriel@oxeva.fr> Reported-by: Leandro Lucarella <llucax@gmail.com> Reported-by: Daniel Burr <dburr@fami.com.au> Reported-by: Nickolay <newmail@spaces.ru> Reported-by: Michael Guntsche <mike@it-loops.com> Reported-by: Dan Carley <dan.carley+linuxkern-bugs@gmail.com> Reported-by: Michael Ole Olsen <gnu@gmx.net> Reported-by: Michael Weissenbacher <mw@dermichi.com> Reported-by: Martin Spott <Martin.Spott@mgras.net> Reported-by: Christian Kujau <lists@nerdbynature.de> Tested-by: Michael Guntsche <mike@it-loops.com> Tested-by: Dan Carley <dan.carley+linuxkern-bugs@gmail.com> Tested-by: Christian Kujau <lists@nerdbynature.de>	2009-08-07 14:38:34 -03:00
Christoph Hellwig	54e346215e	vfs: fix inode_init_always calling convention Currently inode_init_always calls into ->destroy_inode if the additional initialization fails. That's not only counter-intuitive because inode_init_always did not allocate the inode structure, but in case of XFS it's actively harmful as ->destroy_inode might delete the inode from a radix-tree that has never been added. This in turn might end up deleting the inode for the same inum that has been instanciated by another process and cause lots of cause subtile problems. Also in the case of re-initializing a reclaimable inode in XFS it would free an inode we still want to keep alive. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-08-07 14:38:25 -03:00
Linus Torvalds	f5266cbd2f	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: bump up nr_to_write in xfs_vm_writepage xfs: reduce bmv_count in xfs_vn_fiemap	2009-07-31 12:17:37 -07:00
Eric Sandeen	c8a4051c37	xfs: bump up nr_to_write in xfs_vm_writepage VM calculation for nr_to_write seems off. Bump it way up, this gets simple streaming writes zippy again. To be reviewed again after Jens' writeback changes. Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Cc: Chris Mason <chris.mason@oracle.com> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-07-31 00:57:11 -05:00
Eric Sandeen	97db39a1f6	xfs: reduce bmv_count in xfs_vn_fiemap commit `6321e3ed2a` caused the full bmv_count's worth of getbmapx structures to get allocated; telling it to do MAXEXTNUM was a bit insane, resulting in ENOMEM every time. Chop it down to something reasonable, the number of slots in the caller's input buffer. If this is too large the caller may get ENOMEM but the reason should not be a mystery, and they can try again with something smaller. We add 1 to the value because in the normal getbmap world, bmv_count includes the header and xfs_getbmap does: nex = bmv->bmv_count - 1; if (nex <= 0) return XFS_ERROR(EINVAL); Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Olaf Weber <olaf@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-07-31 00:56:58 -05:00
Alexey Dobriyan	405f55712d	headers: smp_lock.h redux * Remove smp_lock.h from files which don't need it (including some headers!) * Add smp_lock.h to files which do need it * Make smp_lock.h include conditional in hardirq.h It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT This will make hardirq.h inclusion cheaper for every PREEMPT=n config (which includes allmodconfig/allyesconfig, BTW) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-12 12:22:34 -07:00
Jens Axboe	8aa7e847d8	Fix congestion_wait() sync/async vs read/write confusion Commit `1faa16d228` accidentally broke the bdi congestion wait queue logic, causing us to wait on congestion for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-10 20:31:53 +02:00
Al Viro	1cbd20d820	switch xfs to generic acl caching helpers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-06-24 08:17:07 -04:00
Bartlomiej Zolnierkiewicz	90c699a9ee	block: rename CONFIG_LBD to CONFIG_LBDAF Follow-up to "block: enable by default support for large devices and files on 32-bit archs". Rename CONFIG_LBD to CONFIG_LBDAF to: - allow update of existing [def]configs for "default y" change - reflect that it is used also for large files support nowadays Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-19 08:08:50 +02:00
Felix Blyakher	fd40261354	Merge branch 'master' of git://oss.sgi.com/xfs/xfs into for-linus	2009-06-12 21:28:59 -05:00
Christoph Hellwig	e83f1eb6bf	xfs: fix small mismerge in xfs_vn_mknod Identation got messed up when merging the current_umask changes with the generic ACL support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-12 21:15:31 -05:00
Christoph Hellwig	493b87e5ed	xfs: fix warnings with CONFIG_XFS_QUOTA disabled Fix warnings about unitialized dquot variables by making sure xfs_qm_vop_dqalloc touches it even when quotas are disabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-12 21:15:12 -05:00
Felix Blyakher	7747a0b0af	xfs: fix freeing memory in xfs_getbmap() Regression from commit `28e211700a`. Need to free temporary buffer allocated in xfs_getbmap(). Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Hedi Berriche <hedi@sgi.com> Reported-by: Justin Piszcz <jpiszcz@lucidpixels.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-06-12 10:26:52 -05:00
Christoph Hellwig	f95022161d	xfs: remove ->write_super and stop maintaining ->s_dirt the write_super method is used for (1) writing back the superblock periodically from pdflush (2) called just before ->sync_fs for data integerity syncs We don't need (1) because we have our own peridoc writeout through xfssyncd, and we don't need (2) because xfs_fs_sync_fs performs a proper synchronous superblock writeout after all other data and metadata has been written out. Also remove ->s_dirt tracking as it's only used to decide when too call ->write_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-06-11 21:36:10 -04:00
Felix Blyakher	35fd035968	Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs	2009-06-11 16:56:49 -05:00
Linus Torvalds	c9059598ea	Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits) block: add request clone interface (v2) floppy: fix hibernation ramdisk: remove long-deprecated "ramdisk=" boot-time parameter fs/bio.c: add missing __user annotation block: prevent possible io_context->refcount overflow Add serial number support for virtio_blk, V4a block: Add missing bounce_pfn stacking and fix comments Revert "block: Fix bounce limit setting in DM" cciss: decode unit attention in SCSI error handling code cciss: Remove no longer needed sendcmd reject processing code cciss: change SCSI error handling routines to work with interrupts enabled. cciss: separate error processing and command retrying code in sendcmd_withirq_core() cciss: factor out fix target status processing code from sendcmd functions cciss: simplify interface of sendcmd() and sendcmd_withirq() cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code cciss: Use schedule_timeout_uninterruptible in SCSI error handling code block: needs to set the residual length of a bidi request Revert "block: implement blkdev_readpages" block: Fix bounce limit setting in DM Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt ... Manually fix conflicts with tracing updates in: block/blk-sysfs.c drivers/ide/ide-atapi.c drivers/ide/ide-cd.c drivers/ide/ide-floppy.c drivers/ide/ide-tape.c include/trace/events/block.h kernel/trace/blktrace.c	2009-06-11 11:10:35 -07:00
Christoph Hellwig	ef14f0c157	xfs: use generic Posix ACL code This patch rips out the XFS ACL handling code and uses the generic fs/posix_acl.c code instead. The ondisk format is of course left unchanged. This also introduces the same ACL caching all other Linux filesystems do by adding pointers to the acl and default acl in struct xfs_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-10 17:07:47 +02:00
Christoph Hellwig	8b5403a6d7	xfs: remove SYNC_BDFLUSH SYNC_BDFLUSH is a leftover from IRIX and rather misnamed for todays code. Make xfs_sync_fsdata and xfs_dq_sync use the SYNC_TRYLOCK flag for not blocking on logs just as the inode sync code already does. For xfs_sync_fsdata it's a trivial 1:1 replacement, but for xfs_qm_sync I use the opportunity to decouple the non-blocking lock case from the different flushing modes, similar to the inode sync code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:37:16 +02:00
Christoph Hellwig	b0710ccc6d	xfs: remove SYNC_IOWAIT We want to wait for all I/O to finish when we do data integrity syncs. So there is no reason to keep SYNC_WAIT separate from SYNC_IOWAIT. This causes a little change in behaviour for the ENOSPC flushing code which now does a second submission and wait of buffered I/O, but that should finish ASAP as we already did an asynchronous writeout earlier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:37:11 +02:00
Christoph Hellwig	075fe10286	xfs: split xfs_sync_inodes xfs_sync_inodes is used to write back either file data or inode metadata. In general we always do these separately, except for one fishy case in xfs_fs_put_super that does both. So separate xfs_sync_inodes into separate xfs_sync_data and xfs_sync_attr functions. In xfs_fs_put_super we first call the data sync and then the attr sync as that was the previous order. The moved log force in that path doesn't make a difference because we will force the log again as part of the real unmount process. The filesystem readonly checks are not performed by the new function but instead moved into the callers, given that most callers alredy have it further up in the stack. Also add debug checks that we do not pass in incorrect flags in the new xfs_sync_data and xfs_sync_attr function and fix the one place that did pass in a wrong flag. Also remove a comment mentioning xfs_sync_inodes that has been incorrect for a while because we always take either the iolock or ilock in the sync path these days. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:48 +02:00
Christoph Hellwig	fe588ed328	xfs: use generic inode iterator in xfs_qm_dqrele_all_inodes Use xfs_inode_ag_iterator instead of opencoding the inode walk in the quota code. Mark xfs_inode_ag_iterator and xfs_sync_inode_valid non-static to allow using them from the quota code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:27 +02:00
Dave Chinner	75f3cb1393	xfs: introduce a per-ag inode iterator Given that we walk across the per-ag inode lists so often, it makes sense to introduce an iterator for this. Convert the sync and reclaim code to use this new iterator, quota code will follow in the next patch. Also change xfs_reclaim_inode to return -EGAIN instead of 1 for an inode already under reclaim. This simplifies the AG iterator and doesn't matter for the only other caller. [hch: merged the lookup and execute callbacks back into one to get the pag_ici_lock locking correct and simplify the code flow] Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:14 +02:00
Dave Chinner	abc1064742	xfs: remove unused parameter from xfs_reclaim_inodes The noblock parameter of xfs_reclaim_inodes is only ever set to zero. Remove it and all the conditional code that is never executed. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:12 +02:00
Dave Chinner	1da8eecab5	xfs: factor out inode validation for sync Separate the validation of inodes found by the radix tree walk from the radix tree lookup. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:07 +02:00
Christoph Hellwig	845b6d0cbb	xfs: split inode flushing from xfs_sync_inodes_ag In many cases we only want to sync inode metadata. Split out the inode flushing into a separate helper to prepare factoring the inode sync code. Based on a patch from Dave Chinner, but redone to keep the current behaviour exactly and leave changes to the flushing logic to another patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:05 +02:00
Dave Chinner	5a34d5cd09	xfs: split inode data writeback from xfs_sync_inodes_ag In many cases we only want to sync inode data. Start spliting the inode sync into data sync and inode sync by factoring out the inode data flush. [hch: minor cleanups] Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:03 +02:00
Christoph Hellwig	7d095257e3	xfs: kill xfs_qmops Kill the quota ops function vector and replace it with direct calls or stubs in the CONFIG_XFS_QUOTA=n case. Make sure we check XFS_IS_QUOTA_RUNNING in the right spots. We can remove the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set otherwise. This brings us back closer to the way this code worked in IRIX and earlier Linux versions, but we keep a lot of the more useful factoring of common code. Eventually we should also kill xfs_qm_bhv.c, but that's left for a later patch. Reduces the size of the source code by about 250 lines and the size of XFS module by about 1.5 kilobytes with quotas enabled: text data bss dec hex filename 615957 2960 3848 622765 980ad fs/xfs/xfs.o 617231 3152 3848 624231 98667 fs/xfs/xfs.o.old Fallout: - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects the inode locked and xfs_qm_dqattach which does the locking around it, thus removing XFS_QMOPT_ILOCKED. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:33:32 +02:00
Christoph Hellwig	0c5e1ce89f	xfs: validate quota log items during log recovery Arkadiusz has seen really strange crashes in xfs_qm_dqcheck that I can only explain by a log item being too smal to actually fit the xfs_dqblk_t we're dereferencing all over xfs_qm_dqcheck. So add graceful checks for NULL or too small quota items to the log recovery code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:33:21 +02:00
Christoph Hellwig	e1696834e8	xfs: update max log size Commit a6634fba3dec4a92f0a2c4e30c80b634c0576ad5 in xfsprogs increased the maximum log size supported by mkfs. Merged back the changes to xfs_fs.h so the growfs enforced the same limit and the headers are in sync. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:32:59 +02:00
Linus Torvalds	4157fd85fc	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: prevent deadlock in xfs_qm_shake() xfs: fix overflow in xfs_growfs_data_private xfs: fix double unlock in xfs_swap_extents()	2009-06-02 09:47:21 -07:00
Felix Blyakher	1b17d76646	xfs: prevent deadlock in xfs_qm_shake() It's possible to recurse into filesystem from the memory allocation, which deadlocks in xfs_qm_shake(). Add check for __GFP_FS, and bail out if it is not set. Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Hedi Berriche <hedi@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-01 22:59:45 -05:00
Eric Sandeen	e6da7c9fed	xfs: fix overflow in xfs_growfs_data_private In the case where growing a filesystem would leave the last AG too small, the fixup code has an overflow in the calculation of the new size with one fewer ag, because "nagcount" is a 32 bit number. If the new filesystem has > 2^32 blocks in it this causes a problem resulting in an EINVAL return from growfs: # xfs_io -f -c "truncate 19998630180864" fsfile # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile # mount -o loop fsfile /mnt # xfs_growfs /mnt meta-data=/dev/loop0 isize=256 agcount=52, agsize=76288719 blks = sectsz=512 attr=2 data = bsize=4096 blocks=3905982455, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument Reported-by: richard.ems@cape-horn-eng.com Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-01 22:59:38 -05:00
Felix Blyakher	1f23920dbf	xfs: fix double unlock in xfs_swap_extents() Regreesion from commit `ef8f7fc`, which rearranged the code in xfs_swap_extents() leading to double unlock of xfs inode ilock. That resulted in xfs_fsr deadlocking itself on platforms, which don't handle double unlock of rw_semaphore nicely. It caused the count go negative, which represents the write holder, without really having one. ia64 is one of the platforms where deadlock was easily reproduced and the fix was tested. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-01 22:59:29 -05:00
Felix Blyakher	4156e735d3	xfs: prevent deadlock in xfs_qm_shake() It's possible to recurse into filesystem from the memory allocation, which deadlocks in xfs_qm_shake(). Add check for __GFP_FS, and bail out if it is not set. Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Hedi Berriche <hedi@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-01 13:13:24 -05:00
Eric Sandeen	096324873f	xfs: fix overflow in xfs_growfs_data_private In the case where growing a filesystem would leave the last AG too small, the fixup code has an overflow in the calculation of the new size with one fewer ag, because "nagcount" is a 32 bit number. If the new filesystem has > 2^32 blocks in it this causes a problem resulting in an EINVAL return from growfs: # xfs_io -f -c "truncate 19998630180864" fsfile # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile # mount -o loop fsfile /mnt # xfs_growfs /mnt meta-data=/dev/loop0 isize=256 agcount=52, agsize=76288719 blks = sectsz=512 attr=2 data = bsize=4096 blocks=3905982455, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument Reported-by: richard.ems@cape-horn-eng.com Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-05-26 17:46:37 -05:00
Martin K. Petersen	e1defc4ff0	block: Do away with the notion of hardsect_size Until now we have had a 1:1 mapping between storage device physical block size and the logical block sized used when addressing the device. With SATA 4KB drives coming out that will no longer be the case. The sector size will be 4KB but the logical block size will remain 512-bytes. Hence we need to distinguish between the physical block size and the logical ditto. This patch renames hardsect_size to logical_block_size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:54 +02:00
Felix Blyakher	ec91d1335f	xfs: fix double unlock in xfs_swap_extents() Regreesion from commit `ef8f7fc`, which rearranged the code in xfs_swap_extents() leading to double unlock of xfs inode ilock. That resulted in xfs_fsr deadlocking itself on platforms, which don't handle double unlock of rw_semaphore nicely. It caused the count go negative, which represents the write holder, without really having one. ia64 is one of the platforms where deadlock was easily reproduced and the fix was tested. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-05-08 00:29:44 -05:00
Linus Torvalds	b4348f32da	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix getbmap vs mmap deadlock xfs: a couple getbmap cleanups xfs: add more checks to superblock validation xfs_file_last_byte() needs to acquire ilock	2009-05-02 16:52:50 -07:00
Christoph Hellwig	28e211700a	xfs: fix getbmap vs mmap deadlock xfs_getbmap (or rather the formatters called by it) copy out the getbmap structures under the ilock, which can deadlock against mmap. This has been reported via bugzilla a while ago (#717) and has recently also shown up via lockdep. So allocate a temporary buffer to format the kernel getbmap structures into and then copy them out after dropping the locks. A little problem with this is that we limit the number of extents we can copy out by the maximum allocation size, but I see no real way around that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-04-30 00:29:02 -05:00
Christoph Hellwig	5f79ed685f	xfs: a couple getbmap cleanups - reshuffle various conditionals for data vs attr fork to make the code more readable - do fine-grainded goto-based error handling - exit early from conditionals instead of keeping a long else branch around - allow kmem_alloc to fail Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-04-30 00:28:31 -05:00
Olaf Weber	b9ec9068d7	xfs: add more checks to superblock validation There had been reports where xfs filesystem was randomly corrupted with fsfuzzer, and xfs failed to handle it gracefully. This patch fixes couple of reported problem by providing additional checks in the superblock validation routine. Signed-off-by: Olaf Weber <olaf@sgi.com> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-04-30 00:26:14 -05:00
Lachlan McIlroy	def6b3ba56	xfs_file_last_byte() needs to acquire ilock We had some systems crash with this stack: [<a00000010000cb20>] ia64_leave_kernel+0x0/0x280 [<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs] [<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs] [<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs] [<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs] [<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs] [<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs] [<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs] [<a000000100162ea0>] __fput+0x1a0/0x420 [<a000000100163160>] fput+0x40/0x60 The problem here is that xfs_file_last_byte() does not acquire the inode lock and can therefore race with another thread that is modifying the extext list. While xfs_bmap_last_offset() is trying to lookup what was the last extent some extents were merged and the extent list shrunk so the index we lookup is now beyond the end of the extent list and potentially in a freed buffer. Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-04-30 00:25:25 -05:00
Christoph Hellwig	6321e3ed2a	xfs: fix getbmap vs mmap deadlock xfs_getbmap (or rather the formatters called by it) copy out the getbmap structures under the ilock, which can deadlock against mmap. This has been reported via bugzilla a while ago (#717) and has recently also shown up via lockdep. So allocate a temporary buffer to format the kernel getbmap structures into and then copy them out after dropping the locks. A little problem with this is that we limit the number of extents we can copy out by the maximum allocation size, but I see no real way around that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-04-29 13:25:29 -05:00
Christoph Hellwig	4be4a00fb5	xfs: a couple getbmap cleanups - reshuffle various conditionals for data vs attr fork to make the code more readable - do fine-grainded goto-based error handling - exit early from conditionals instead of keeping a long else branch around - allow kmem_alloc to fail Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-04-29 10:00:01 -05:00
Olaf Weber	2ac00af7a6	xfs: add more checks to superblock validation There had been reports where xfs filesystem was randomly corrupted with fsfuzzer, and xfs failed to handle it gracefully. This patch fixes couple of reported problem by providing additional checks in the superblock validation routine. Signed-off-by: Olaf Weber <olaf@sgi.com> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-04-29 09:24:29 -05:00
Lachlan McIlroy	f25181f598	xfs_file_last_byte() needs to acquire ilock We had some systems crash with this stack: [<a00000010000cb20>] ia64_leave_kernel+0x0/0x280 [<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs] [<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs] [<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs] [<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs] [<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs] [<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs] [<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs] [<a000000100162ea0>] __fput+0x1a0/0x420 [<a000000100163160>] fput+0x40/0x60 The problem here is that xfs_file_last_byte() does not acquire the inode lock and can therefore race with another thread that is modifying the extext list. While xfs_bmap_last_offset() is trying to lookup what was the last extent some extents were merged and the extent list shrunk so the index we lookup is now beyond the end of the extent list and potentially in a freed buffer. Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-04-29 09:14:10 -05:00
Li Zefan	0e639bdeef	xfs: use memdup_user() Remove open-coded memdup_user() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:51 -04:00
Linus Torvalds	3c1795cc4b	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: remove xfs_flush_space xfs: flush delayed allcoation blocks on ENOSPC in create xfs: block callers of xfs_flush_inodes() correctly xfs: make inode flush at ENOSPC synchronous xfs: use xfs_sync_inodes() for device flushing xfs: inform the xfsaild of the push target before sleeping xfs: prevent unwritten extent conversion from blocking I/O completion xfs: fix double free of inode xfs: validate log feature fields correctly	2009-04-13 14:35:13 -07:00
Felix Blyakher	dc2a5536d6	Merge branch 'master' into for-linus	2009-04-09 14:12:07 -05:00
Dave Chinner	8de2bf937a	xfs: remove xfs_flush_space The only thing we need to do now when we get an ENOSPC condition during delayed allocation reservation is flush all the other inodes with delalloc blocks on them and retry without EOF preallocation. Remove the unneeded mess that is xfs_flush_space() and just call xfs_flush_inodes() directly from xfs_iomap_write_delay(). Also, change the location of the retry label to avoid trying to do EOF preallocation because we don't want to do that at ENOSPC. This enables us to remove the BMAPI_SYNC flag as it is no longer used. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:49:12 +02:00
Dave Chinner	153fec43ce	xfs: flush delayed allcoation blocks on ENOSPC in create If we are creating lots of small files, we can fail to get a reservation for inode create earlier than we should due to EOF preallocation done during delayed allocation reservation. Hence on the first reservation ENOSPC failure flush all the delayed allocation blocks out of the system and retry. This fixes the last commonly triggered spurious ENOSPC issue that has been reported. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:48:30 +02:00
Dave Chinner	e43afd72d2	xfs: block callers of xfs_flush_inodes() correctly xfs_flush_inodes() currently uses a magic timeout to wait for some inodes to be flushed before returning. This isn't really reliable but used to be the best that could be done due to deadlock potential of waiting for the entire flush. Now the inode flush is safe to execute while we hold page and inode locks, we can wait for all the inodes to flush synchronously. Convert the wait mechanism to a completion to do this efficiently. This should remove all remaining spurious ENOSPC errors from the delayed allocation reservation path. This is extracted almost line for line from a larger patch from Mikulas Patocka. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:47:27 +02:00
Dave Chinner	5825294edd	xfs: make inode flush at ENOSPC synchronous When we are writing to a single file and hit ENOSPC, we trigger a background flush of the inode and try again. Because we hold page locks and the iolock, the flush won't proceed until after we release these locks. This occurs once we've given up and ENOSPC has been reported. Hence if this one is the only dirty inode in the system, we'll get an ENOSPC prematurely. To fix this, remove the async flush from the allocation routines and move it to the top of the write path where we can do a synchronous flush and retry the write again. Only retry once as a second ENOSPC indicates that we really are ENOSPC. This avoids a page cache deadlock when trying to do this flush synchronously in the allocation layer that was identified by Mikulas Patocka. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:45:44 +02:00
Dave Chinner	a8d770d987	xfs: use xfs_sync_inodes() for device flushing Currently xfs_device_flush calls sync_blockdev() which is a no-op for XFS as all it's metadata is held in a different address to the one sync_blockdev() works on. Call xfs_sync_inodes() instead to flush all the delayed allocation blocks out. To do this as efficiently as possible, do it via two passes - one to do an async flush of all the dirty blocks and a second to wait for all the IO to complete. This requires some modification to the xfs-sync_inodes_ag() flush code to do efficiently. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:44:54 +02:00
Dave Chinner	9d7fef74b2	xfs: inform the xfsaild of the push target before sleeping When trying to reserve log space, we find the amount of space we need, then go to sleep waiting for space. When we are woken, we try to push the tail of the log forward to make sure we have space available. Unfortunately, this means that if there is not space available, and everyone who needs space goes to sleep there is no-one left to push the tail of the log to make space available. Once we have a thread waiting for space to become available, the others queue up behind it in a FIFO, and none of them push the tail of the log. This can result in everyone going to sleep in xlog_grant_log_space() if the first sleeper races with the last I/O that moves the tail of the log forward. With no further I/O tomove the tail of the log, there is nothing to wake the sleepers and hence all transactions just stop. Fix this by making sure the xfsaild will create enough space for the transaction that is about to sleep by moving the push target far enough forwards to ensure that that the curent proceeees will have enough space available when it is woken. That is, we push the AIL before we go to sleep. Because we've inserted the log ticket into the queue before we've pushed and gone to sleep, subsequent transactions will wait behind this one. Hence we are guaranteed to have space available when we are woken. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:42:59 +02:00
Dave Chinner	c626d174cf	xfs: prevent unwritten extent conversion from blocking I/O completion Unwritten extent conversion can recurse back into the filesystem due to memory allocation. Memory reclaim requires I/O completions to be processed to allow the callers to make progress. If the I/O completion workqueue thread is doing the recursion, then we have a deadlock situation. Move unwritten extent completion into it's own workqueue so it doesn't block I/O completions for normal delayed allocation or overwrite data. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:42:11 +02:00
Dave Chinner	705db3fd46	xfs: fix double free of inode If we fail to initialise the VFS inode in inode_init_always(), it will call ->delete_inode internally resulting in the inode being freed. Hence we need to delay the call to inode_init_always() until after the XFS inode is sufficient set up to handle a call to ->delete_inode, and then if that fails do not touch the inode again at all as it has been freed. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:40:17 +02:00
Dave Chinner	a6cb767e24	xfs: validate log feature fields correctly If the large log sector size feature bit is set in the superblock by accident (say disk corruption), the then fields that are now considered valid are not checked on production kernels. The checks are present as ASSERT statements so cause a panic on a debug kernel. Change this so that the fields are validity checked if the feature bit is set and abort the log mount if the fields do not contain valid values. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:39:27 +02:00
Linus Torvalds	ac7c1a776d	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (61 commits) Revert "xfs: increase the maximum number of supported ACL entries" xfs: cleanup uuid handling xfs: remove m_attroffset xfs: fix various typos xfs: pagecache usage optimization xfs: remove m_litino xfs: kill ino64 mount option xfs: kill mutex_t typedef xfs: increase the maximum number of supported ACL entries xfs: factor out code to find the longest free extent in the AG xfs: kill VN_BAD xfs: kill vn_atime_* helpers. xfs: cleanup xlog_bread xfs: cleanup xlog_recover_do_trans xfs: remove another leftover of the old inode log item format xfs: cleanup log unmount handling Fix xfs debug build breakage by pushing xfs_error.h after xfs: include header files for prototypes xfs: make symbols static xfs: move declaration to header file ...	2009-04-03 09:52:29 -07:00
Linus Torvalds	8fe74cf053	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Remove two unneeded exports and make two symbols static in fs/mpage.c Cleanup after commit `585d3bc06f` Trim includes of fdtable.h Don't crap into descriptor table in binfmt_som Trim includes in binfmt_elf Don't mess with descriptor table in load_elf_binary() Get rid of indirect include of fs_struct.h New helper - current_umask() check_unsafe_exec() doesn't care about signal handlers sharing New locking/refcounting for fs_struct Take fs_struct handling to new file (fs/fs_struct.c) Get rid of bumping fs_struct refcount in pivot_root(2) Kill unsharing fs_struct in __set_personality()	2009-04-02 21:09:10 -07:00
Felix Blyakher	f36345ff9a	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus	2009-04-01 16:58:39 -05:00
Nick Piggin	c2ec175c39	mm: page_mkwrite change prototype to match fault Change the page_mkwrite prototype to take a struct vm_fault, and return VM_FAULT_xxx flags. There should be no functional change. This makes it possible to return much more detailed error information to the VM (and also can provide more information eg. virtual_address to the driver, which might be important in some special cases). This is required for a subsequent fix. And will also make it easier to merge page_mkwrite() with fault() in future. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Felix Blyakher <felixb@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:14 -07:00
Al Viro	ce3b0f8d5c	New helper - current_umask() current->fs->umask is what most of fs_struct users are doing. Put that into a helper function. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:26 -04:00
Felix Blyakher	1aacc064e0	Revert "xfs: increase the maximum number of supported ACL entries" This reverts commit `8b11217173`. Premature unintended commit. Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-31 00:23:37 -05:00
Felix Blyakher	5123bc35d2	Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs	2009-03-30 22:17:44 -05:00
Christoph Hellwig	27174203f5	xfs: cleanup uuid handling The uuid table handling should not be part of a semi-generic uuid library but in the XFS code using it, so move those bits to xfs_mount.c and refactor the whole glob to make it a proper abstraction. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-30 10:21:31 +02:00
Christoph Hellwig	1a5902c5d2	xfs: remove m_attroffset With the upcoming v3 inodes the default attroffset needs to be calculated for each specific inode, so we can't cache it in the superblock anymore. Also replace the assert for wrong inode sizes with a proper error check also included in non-debug builds. Note that the ENOSYS return for that might seem odd, but that error is returned by xfs_mount_validate_sb for all theoretically valid but not supported filesystem geometries. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>	2009-03-29 19:26:46 +02:00
Malcolm Parsons	9da096fd13	xfs: fix various typos Signed-off-by: Malcolm Parsons <malcolm.parsons@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-03-29 09:55:42 +02:00
Hisashi Hifumi	bddaafa11a	xfs: pagecache usage optimization Hi. I introduced "is_partially_uptodate" aops for XFS. A page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate on pagesize != blocksize environment. This aops checks that all buffers which correspond to a part of a file that we want to read are uptodate. If so, we do not have to issue actual read IO to HDD even if a page is not uptodate because the portion we want to read are uptodate. "block_is_partially_uptodate" function is already used by ext2/3/4. With the following patch random read/write mixed workloads or random read after random write workloads can be optimized and we can get performance improvement. I did a performance test using the sysbench. #sysbench --num-threads=4 --max-requests=100000 --test=fileio --file-num=1 \ --file-block-size=8K --file-total-size=1G --file-test-mode=rndrw \ --file-fsync-freq=0 --file-rw-ratio=0.5 run -2.6.29-rc6 Test execution summary: total time: 123.8645s total number of events: 100000 total time taken by event execution: 442.4994 per-request statistics: min: 0.0000s avg: 0.0044s max: 0.3387s approx. 95 percentile: 0.0118s -2.6.29-rc6-patched Test execution summary: total time: 108.0757s total number of events: 100000 total time taken by event execution: 417.7505 per-request statistics: min: 0.0000s avg: 0.0042s max: 0.3217s approx. 95 percentile: 0.0118s arch: ia64 pagesize: 16k blocksize: 4k Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-29 09:53:38 +02:00
Christoph Hellwig	6447c36209	xfs: remove m_litino With the upcoming v3 inodes the inode data/attr area size needs to be calculated for each specific inode, so we can't cache it in the superblock anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-29 09:51:14 +02:00
Christoph Hellwig	a19d9f887d	xfs: kill ino64 mount option The ino64 mount option adds a fixed offset to 32bit inode numbers to bring them into the 64bit range. There's no need for this kind of debug tool given that it's easy to produce real 64bit inode numbers for testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-29 09:51:08 +02:00
Christoph Hellwig	a0b0b8a5b3	xfs: kill mutex_t typedef People continue to complain about this for weird reasons, but there's really no point in keeping this typedef for a couple of users anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-29 09:51:00 +02:00
Felix Blyakher	8b11217173	xfs: increase the maximum number of supported ACL entries With big installation current 25 maximum number of supported ACL entries is not enough any more. Increase the limit to 100. Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-27 17:28:43 -05:00
Dave Chinner	6cc87645e2	xfs: factor out code to find the longest free extent in the AG Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2009-03-16 08:29:46 +01:00
Christoph Hellwig	cb4c8cc1e9	xfs: kill VN_BAD Remove this rather pointless wrapper and use is_bad_inode directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-03-16 08:25:25 +01:00
Christoph Hellwig	8fab451e3c	xfs: kill vn_atime_* helpers. Two out of three are unused already, and the third is better done open-coded with a comment describing what's going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-03-16 08:24:46 +01:00
Christoph Hellwig	076e6acb8f	xfs: cleanup xlog_bread Most callers of xlog_bread need to call xlog_align to get the actual offset. Consolidate that call into the main xlog_bread and provide a _xlog_bread for those few that don't want the actual offset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-03-16 08:24:13 +01:00
Christoph Hellwig	ff0205e032	xfs: cleanup xlog_recover_do_trans Change the big if-elsif-else block handling the different item types into a more natural switch, remove assignments in conditionals and remove an out of place comment from centuries ago on IRIX. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-03-16 08:20:52 +01:00
Christoph Hellwig	dd0bbad81c	xfs: remove another leftover of the old inode log item format There's another little snipplet of code left from the handling of the old inode log item format in xlog_recover_do_inode_trans. Kill it as it can't be reached anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-03-16 08:19:59 +01:00
Christoph Hellwig	21b699c895	xfs: cleanup log unmount handling Kill the current xfs_log_unmount wrapper and opencode the two function calls in the only caller. Rename the current xfs_log_unmount_dealloc to xfs_log_unmount as it undoes xfs_log_mount and the new name makes that more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-03-16 08:19:29 +01:00
Felix Blyakher	da5309cd28	Fix xfs debug build breakage by pushing xfs_error.h after xfs_mount.h, which it depends on. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-15 08:10:25 -05:00
Christoph Hellwig	c141b2928f	xfs: only issues a cache flush on unmount if barriers are enabled Currently we unconditionally issue a flush from xfs_free_buftarg, but since 2.6.29-rc1 this gives a warning in the style of end_request: I/O error, dev vdb, sector 0 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:35:12 -06:00
Christoph Hellwig	7d46be4a25	xfs: prevent lockdep false positive in xfs_iget_cache_miss The inode can't be locked by anyone else as we just created it a few lines above and it's not been added to any lookup data structure yet. So use a trylock that must succeed to get around the lockdep warnings. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:34:59 -06:00
Christoph Hellwig	ff392c497b	xfs: prevent kernel crash due to corrupted inode log format Andras Korn reported an oops on log replay causes by a corrupted xfs_inode_log_format_t passing a 0 size to kmem_zalloc. This patch handles to small or too large numbers of log regions gracefully by rejecting the log replay with a useful error message. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Andras Korn <korn-sgi.com@chardonnay.math.bme.hu> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:34:45 -06:00
Hannes Eder	7bf446f8b5	xfs: include header files for prototypes Fix this sparse warnings: fs/xfs/linux-2.6/xfs_ioctl.c:72:1: warning: symbol 'xfs_find_handle' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_ioctl.c:249:1: warning: symbol 'xfs_open_by_handle' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_ioctl.c:361:1: warning: symbol 'xfs_readlink_by_handle' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_ioctl.c:496:1: warning: symbol 'xfs_attrmulti_attr_get' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_ioctl.c:525:1: warning: symbol 'xfs_attrmulti_attr_set' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_ioctl.c:555:1: warning: symbol 'xfs_attrmulti_attr_remove' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_ioctl.c:657:1: warning: symbol 'xfs_ioc_space' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_ioctl.c:1340:1: warning: symbol 'xfs_file_ioctl' was not declared. Should it be static? fs/xfs/support/debug.c:65:1: warning: symbol 'xfs_fs_vcmn_err' was not declared. Should it be static? fs/xfs/support/debug.c:112:1: warning: symbol 'xfs_hex_dump' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:21:16 -06:00
Hannes Eder	3180e66d77	xfs: make symbols static Instead of the keyword 'static' the macro 'STATIC' is used, so the symbols are still global with CONFIG_XFS_DEBUG. Fix this sparse warnings: fs/xfs/linux-2.6/xfs_super.c:638:1: warning: symbol 'xfs_blkdev_get' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_super.c:655:1: warning: symbol 'xfs_blkdev_put' was not declared. Should it be static? fs/xfs/linux-2.6/xfs_super.c:876:1: warning: symbol 'xfsaild' was not declared. Should it be static? fs/xfs/xfs_bmap.c:6208:1: warning: symbol 'xfs_check_block' was not declared. Should it be static? fs/xfs/xfs_dir2_leaf.c:553:1: warning: symbol 'xfs_dir2_leaf_check' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:20:56 -06:00
Hannes Eder	24418492aa	xfs: move declaration to header file Fix this sparse warning: fs/xfs/xfs_da_btree.c:1550:26: warning: symbol 'xfs_default_nameops' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:20:15 -06:00
Christoph Hellwig	b79631330a	xfs: only issues a cache flush on unmount if barriers are enabled Currently we unconditionally issue a flush from xfs_free_buftarg, but since 2.6.29-rc1 this gives a warning in the style of end_request: I/O error, dev vdb, sector 0 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-04 07:31:55 -06:00
Christoph Hellwig	ed93ec3907	xfs: prevent lockdep false positive in xfs_iget_cache_miss The inode can't be locked by anyone else as we just created it a few lines above and it's not been added to any lookup data structure yet. So use a trylock that must succeed to get around the lockdep warnings. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-04 07:31:48 -06:00
Christoph Hellwig	e8fa6b483f	xfs: prevent kernel crash due to corrupted inode log format Andras Korn reported an oops on log replay causes by a corrupted xfs_inode_log_format_t passing a 0 size to kmem_zalloc. This patch handles to small or too large numbers of log regions gracefully by rejecting the log replay with a useful error message. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Andras Korn <korn-sgi.com@chardonnay.math.bme.hu> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-04 07:31:42 -06:00
Felix Blyakher	27e88bf6af	Revert "[XFS] remove old vmap cache" This reverts commit `d2859751cd`. This commit caused regression. We'll try to fix use of new vmap API for next release. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-19 13:15:55 -06:00
Felix Blyakher	7fdf582447	Revert "[XFS] use scalable vmap API" This reverts commit `95f8e302c0`. This commit caused regression. We'll try to fix use of new vmap API for next release. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-19 13:15:44 -06:00
Felix Blyakher	3a011a1719	Revert "[XFS] remove old vmap cache" This reverts commit `d2859751cd`. This commit caused regression. We'll try to fix use of new vmap API for next release. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-18 15:57:51 -06:00
Felix Blyakher	cf7dab8017	Revert "[XFS] use scalable vmap API" This reverts commit `95f8e302c0`. This commit caused regression. We'll try to fix use of new vmap API for next release. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-18 15:41:28 -06:00
Christoph Hellwig	7c8f7af67d	xfs: reject swapext ioctl on swapfiles Swapfiles are magic - I/O is directly initialized by the VM without involving the filesystem. Swapping out extents underneath the VM thus can cause severe problems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-12 19:56:00 +01:00
Christoph Hellwig	264307520b	xfs: fix error handling in xfs_log_mount We can't just call xfs_log_unmount_dealloc on any failure because the ail thread which is torn down by xfs_log_unmount_dealloc might not be initialized yet. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reported-by: Lachlan McIlroy <lachlan@sgi.com>	2009-02-12 19:55:48 +01:00
Christoph Hellwig	fcafb71b57	xfs: get rid of indirections in the quotaops implementation Currently we call from the nicely abstracted linux quotaops into a ugly multiplexer just to split the calls out at the same boundary again. Rewrite the quota ops handling to remove that obfucation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:47:34 +01:00
Christoph Hellwig	c9a192dcf9	xfs: sanitize qh_lock wrappers Get rid of various obsfucating wrappers for accessing the quota hash lock, we only keep the accessors for accessing the mplist and freelist locks as they encode a multi-level datastructure walk. But make sure all of them are defined in the same way as simple macros. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:47:22 +01:00
Christoph Hellwig	7201813bf5	xfs: use mutex_is_locked in XFS_DQ_IS_LOCKED Now that we have a helper to test if a mutex is held use it instead of our own little hacks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:39:24 +01:00
Christoph Hellwig	e249458220	xfs: remove XFS_QM_LOCK/XFS_QM_UNLOCK/XFS_QM_HOLD/XFS_QM_RELE Remove these macros which only obsfucated the code in rather nast ways. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:38:39 +01:00
Christoph Hellwig	517b5e8c85	xfs: merge xfs_mkdir into xfs_create xfs_create and xfs_mkdir only have minor differences, so merge both of them into a sigle function. While we're at it also make the error handling code more straight-forward. Signed-off-by: Christoph Hellwig <hch@lst.de> Dave Chinner <david@fromorbit.com>	2009-02-09 08:38:02 +01:00
Christoph Hellwig	a568778739	xfs: remove uchar_t/ushort_t/uint_t/ulong_t types Just another set of types obsfucating the code, remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:37:39 +01:00
Christoph Hellwig	0d87e656dd	xfs: remove superflous inobt macros xfs_ialloc_btree.h has a a cuple of macros that only obsfucate the code but don't provide any abstraction benefits. This patches removes those and cleans up the reamaining defintions up a little. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:37:14 +01:00
Christoph Hellwig	7153f8ba2b	xfs: remove iclog calculation special cases Our default has been to always use 8 32KB log buffers for a while now, so remove the special casing for larger block size filesystem to use the same or even lower number of buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:36:46 +01:00
Christoph Hellwig	8e9b6e7fa4	xfs: remove the unused XFS_QMOPT_DQLOCK flag The XFS_QMOPT_DQLOCK flag introduces major complexity in the quota subsystem but isn't actually used anywhere. So remove it and all the hazzles it introduces. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-08 21:51:42 +01:00
Christoph Hellwig	4346cdd464	xfs: cleanup xfs_find_handle Remove the superflous igrab by keeping a reference on the path/file all the time and clean up various bits of surrounding code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-08 21:51:14 +01:00
Josef 'Jeff' Sipek	ef8f7fc549	xfs: cleanup error handling in xfs_swap_extents Use multiple lables for proper error unwinding and get rid of some now superflous variables. Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:37:43 +01:00
Christoph Hellwig	d4bb6d0698	xfs: merge xfs_inode_flush into xfs_fs_write_inode Splitting the task for a VFS-induced inode flush into two functions doesn't make any sense, so merge the two functions dealing with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-04 09:36:19 +01:00
Christoph Hellwig	e1486dea0b	xfs: factor out attr fork reset handling We currently duplicate code to reset the attribute fork after the last attribute has been deleted. Factor this out into a small helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:36:00 +01:00
Christoph Hellwig	c52e9fd8a9	xfs: remove unused XFS_MOUNT_ILOCK/XFS_MOUNT_IUNLOCK These aren't only unused but also reference a lock that doesn't exist anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:34:34 +01:00
Christoph Hellwig	cb3f35bb3b	xfs: tiny cleanup for xfs_link The source and target inodes are guaranteed to never be the same by the VFS, so no need to check for that (and we would get into bad trouble later anyway if that were the case). Also clean up the error handling to use two gotos instead of nested conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:34:20 +01:00
Christoph Hellwig	b93b6e434c	xfs: make sure to free the real-time inodes in the mount error path When mount fails after allocating the real-time inodes we currently leak them. Add a new helper to free the real-time inodes which can be used by both the mount and unmount path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:33:58 +01:00
Christoph Hellwig	f9057e3da7	xfs: cleanup error handling in xfs_mountfs: Clean up the error handling in xfs_mountfs. Use readable goto label names, simplify the uuid handling and other error conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:31:52 +01:00
Felix Blyakher	43f3f057c5	[XFS] Warn on transaction in flight on read-only remount Till VFS can correctly support read-only remount without racing, use WARN_ON instead of BUG_ON on detecting transaction in flight after quiescing filesystem. Signed-off-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-02-03 11:04:54 -06:00
Dave Chinner	6139a23609	xfs: Check buffer lengths in log recovery Before trying to obtain, read or write a buffer, check that the buffer length is actually valid. If it is not valid, then something read in the recovery process has been corrupted and we should abort recovery. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Tested-by: Eric Sesterhenn <snakebyte@gmx.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-03 11:01:32 -06:00
Dave Chinner	3228149ceb	xfs: Check buffer lengths in log recovery Before trying to obtain, read or write a buffer, check that the buffer length is actually valid. If it is not valid, then something read in the recovery process has been corrupted and we should abort recovery. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Tested-by: Eric Sesterhenn <snakebyte@gmx.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-03 10:19:33 -06:00
Eric Sandeen	f0e0059b9c	don't reallocate sxp variable passed into xfs_swapext fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels Regression caused by `743bb4650d` This was an embarrasing mistake, reallocating the sxp pointer passed in from the main ioctl switch. Signed-off-by: Eric Sandeen <sandeen@sandeen.net Reported-by: Paul Martin <pm@debian.org> Tested-by: Paul Martin <pm@debian.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-27 14:51:39 -06:00

... 30 31 32 33 34 ...

4404 Commits