linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-17 09:31:50 +00:00

Author	SHA1	Message	Date
Glauber Costa	3942c07ccf	fs: bump inode and dentry counters to long This series reworks our current object cache shrinking infrastructure in two main ways: * Noticing that a lot of users copy and paste their own version of LRU lists for objects, we put some effort in providing a generic version. It is modeled after the filesystem users: dentries, inodes, and xfs (for various tasks), but we expect that other users could benefit in the near future with little or no modification. Let us know if you have any issues. * The underlying list_lru being proposed automatically and transparently keeps the elements in per-node lists, and is able to manipulate the node lists individually. Given this infrastructure, we are able to modify the up-to-now hammer called shrink_slab to proceed with node-reclaim instead of always searching memory from all over like it has been doing. Per-node lru lists are also expected to lead to less contention in the lru locks on multi-node scans, since we are now no longer fighting for a global lock. The locks usually disappear from the profilers with this change. Although we have no official benchmarks for this version - be our guest to independently evaluate this - earlier versions of this series were performance tested (details at http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no visible performance regressions while yielding a better qualitative behavior in NUMA machines. With this infrastructure in place, we can use the list_lru entry point to provide memcg isolation and per-memcg targeted reclaim. Historically, those two pieces of work have been posted together. This version presents only the infrastructure work, deferring the memcg work for a later time, so we can focus on getting this part tested. You can see more about the history of such work at http://lwn.net/Articles/552769/ Dave Chinner (18): dcache: convert dentry_stat.nr_unused to per-cpu counters dentry: move to per-sb LRU locks dcache: remove dentries from LRU before putting on dispose list mm: new shrinker API shrinker: convert superblock shrinkers to new API list: add a new LRU list type inode: convert inode lru list to generic lru list code. dcache: convert to use new lru list infrastructure list_lru: per-node list infrastructure shrinker: add node awareness fs: convert inode and dentry shrinking to be node aware xfs: convert buftarg LRU to generic code xfs: rework buffer dispose list tracking xfs: convert dquot cache lru to list_lru fs: convert fs shrinkers to new scan/count API drivers: convert shrinkers to new count/scan API shrinker: convert remaining shrinkers to count/scan API shrinker: Kill old ->shrink API. Glauber Costa (7): fs: bump inode and dentry counters to long super: fix calculation of shrinkable objects for small numbers list_lru: per-node API vmscan: per-node deferred work i915: bail out earlier when shrinker cannot acquire mutex hugepage: convert huge zero page shrinker to new shrinker API list_lru: dynamically adjust node arrays This patch: There are situations in very large machines in which we can have a large quantity of dirty inodes, unused dentries, etc. This is particularly true when umounting a filesystem, where eventually since every live object will eventually be discarded. Dave Chinner reported a problem with this while experimenting with the shrinker revamp patchset. So we believe it is time for a change. This patch just moves int to longs. Machines where it matters should have a big long anyway. Signed-off-by: Glauber Costa <glommer@openvz.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Chinner <dchinner@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-10 18:56:29 -04:00
Dave Jones	da5338c749	Add missing unlocks to error paths of mountpoint_last. Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-10 18:56:29 -04:00
Al Viro	bcce56d585	... and fold the renamed __vfs_follow_link() into its only caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-10 18:56:29 -04:00
Christoph Hellwig	aac34df117	fs: remove vfs_follow_link For a long time no filesystem has been using vfs_follow_link, and as seen by recent filesystem submissions any new use is accidental as well. Remove vfs_follow_link, document the replacement in Documentation/filesystems/porting and also rename __vfs_follow_link to match its only caller better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-10 18:56:29 -04:00
Linus Torvalds	b05430fc93	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 3 (of many) from Al Viro: "Waiman's conversion of d_path() and bits related to it, kern_path_mountpoint(), several cleanups and fixes (exportfs one is -stable fodder, IMO). There definitely will be more... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: split read_seqretry_or_unlock(), convert d_walk() to resulting primitives dcache: Translating dentry into pathname without taking rename_lock autofs4 - fix device ioctl mount lookup introduce kern_path_mountpoint() rename user_path_umountat() to user_path_mountpoint_at() take unlazy_walk() into umount_lookup_last() Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c prune_super(): sb->s_op is never NULL exportfs: don't assume that ->iterate() won't feed us too long entries afs: get rid of redundant ->d_name.len checks	2013-09-10 12:44:24 -07:00
Linus Torvalds	d0d2727710	vfs: make sure we don't have a stale root path if unlazy_walk() fails When I moved the RCU walk termination into unlazy_walk(), I didn't copy quite all of it: for the successful RCU termination we properly add the necessary reference counts to our temporary copy of the root path, but for the failure case we need to make sure that any temporary root path information is cleared out (since it does _not_ have the proper reference counts from the RCU lookup). We could clean up this mess by just always dropping the temporary root information, but Al points out that that would mean that a single lookup through symlinks could see multiple different root entries if it races with another thread doing chroot. Not that I think we should really care (we had that before too, back before we had a copy of the root path in the nameidata). Al says he has a cunning plan. In the meantime, this is the minimal fix for the problem, even if it's not all that pretty. Reported-by: Mace Moneta <moneta.mace@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-10 12:17:49 -07:00
Dave Chinner	74ffa796e1	xfs: don't assert fail on bad inode numbers Let the inode verifier do it's work by returning an error when we fail to find correct magic numbers in an inode buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-10 14:07:54 -05:00
Dave Chinner	46f9d2eb37	xfs: aborted buf items can be in the AIL. Saw this on generic/270 after a DQALLOC transaction overrun shutdown: XFS: Assertion failed: !(bip->bli_item.li_flags & XFS_LI_IN_AIL), file: fs/xfs/xfs_buf_item.c, line: 952 ..... xfs_buf_item_relse+0x4f/0xd0 xfs_buf_item_unlock+0x1b4/0x1e0 xfs_trans_free_items+0x7d/0xb0 xfs_trans_cancel+0x13c/0x1b0 xfs_symlink+0x37e/0xa60 .... When a transaction abort occured. If we are aborting a transaction and trigger this code path, then the item may be dirty. If the item is dirty, then it may be in the AIL. Hence if we are aborting, we need to check if the item is in the AIL and remove it before freeing it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-10 13:58:07 -05:00
Dave Chinner	fdd3cceef4	xfs: factor all the kmalloc-or-vmalloc fallback allocations We have quite a few places now where we do: x = kmem_zalloc(large size) if (!x) x = kmem_zalloc_large(large size) and do a similar dance when freeing the memory. kmem_free() already does the correct freeing dance, and kmem_zalloc_large() is only ever called in these constructs, so just factor it all into kmem_zalloc_large() and kmem_free(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-10 13:57:03 -05:00
Dave Chinner	2dc164f296	xfs: fix memory allocation failures with ACLs Ever since increasing the number of supported ACLs from 25 to as many as can fit in an xattr, there have been reports of order 4 memory allocations failing in the ACL code. Fix it in the same way we've fixed all the xattr read/write code that has the same problem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-10 13:56:25 -05:00
Dave Chinner	0a4edc8f0b	xfs: ensure we copy buffer type in da btree root splits When splitting the root of the da btree, we shuffled data between buffers and the structures that track them. At one point, we copy data and state from one buffer to another, including the ops associated with the buffer. When we do this, we also need to copy the buffer type associated with the buf log item so that the buffer is logged correctly. If we don't do that, log recovery won't recognise it and hence it won't recalculate the CRC on the buffer after recovery. This leads to a directory block that can't be read after recovery has run. Found by inspection after finding the same problem with remote symlink buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-10 13:34:05 -05:00
Dave Chinner	daf7b799a9	xfs: set remote symlink buffer type for recovery The logging of a remote symlink block does not set the buffer type being logged, and hence on recovery the type of buffer is not recognised and hence CRCs are not calculated after replay. This results in log recoery throwing: XFS (vdc): Unknown buffer type 0 errors, and subsequent reads of the symlink failing CRC verification. Found via fsstress + godown. Reported by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-10 12:57:09 -05:00
Dave Chinner	638f44163d	xfs: recovery of swap extents operations for CRC filesystems This is the recovery side of the btree block owner change operation performed by swapext on CRC enabled filesystems. We detect that an owner change is needed by the flag that has been placed on the inode log format flag field. Because the inode recovery is being replayed after the buffers that make up the BMBT in the given checkpoint, we can walk all the buffers and directly modify them when we see the flag set on an inode. Because the inode can be relogged and hence present in multiple chekpoints with the "change owner" flag set, we could do multiple passes across the inode to do this change. While this isn't optimal, we can't directly ignore the flag as there may be multiple independent swap extent operations being replayed on the same inode in different checkpoints so we can't ignore them. Further, because the owner change operation uses ordered buffers, we might have buffers that are newer on disk than the current checkpoint and so already have the owner changed in them. Hence we cannot just peek at a buffer in the tree and check that it has the correct owner and assume that the change was completed. So, for the moment just brute force the owner change every time we see an inode with the flag set. Note that we have to be careful here because the owner of the buffers may point to either the old owner or the new owner. Currently the verifier can't verify the owner directly, so there is no failure case here right now. If we verify the owner exactly in future, then we'll have to take this into account. This was tested in terms of normal operation via xfstests - all of the fsr tests now pass without failure. however, we really need to modify xfs/227 to stress v3 inodes correctly to ensure we fully cover this case for v5 filesystems. In terms of recovery testing, I used a hacked version of xfs_fsr that held the temp inode open for a few seconds before exiting so that the filesystem could be shut down with an open owner change recovery flags set on at least the temp inode. fsr leaves the temp inode unlinked and in btree format, so this was necessary for the owner change to be reliably replayed. logprint confirmed the tmp inode in the log had the correct flag set: INO: cnt:3 total:3 a:0x69e9e0 len:56 a:0x69ea20 len:176 a:0x69eae0 len:88 INODE: #regs:3 ino:0x44 flags:0x209 dsize:88 ^^^^^ 0x200 is set, indicating a data fork owner change needed to be replayed on inode 0x44. A printk in the revoery code confirmed that the inode change was recovered: XFS (vdc): Mounting Filesystem XFS (vdc): Starting recovery (logdev: internal) recovering owner change ino 0x44 XFS (vdc): Version 5 superblock detected. This kernel L support enabled! Use of these features in this kernel is at your own risk! XFS (vdc): Ending recovery (logdev: internal) The script used to test this was: $ cat ./recovery-fsr.sh #!/bin/bash dev=/dev/vdc mntpt=/mnt/scratch testfile=$mntpt/testfile umount $mntpt mkfs.xfs -f -m crc=1 $dev mount $dev $mntpt chmod 777 $mntpt for i in `seq 10000 -1 0`; do xfs_io -f -d -c "pwrite $(($i * 4096)) 4096" $testfile > /dev/null 2>&1 done xfs_bmap -vp $testfile \|head -20 xfs_fsr -d -v $testfile & sleep 10 /home/dave/src/xfstests-dev/src/godown -f $mntpt wait umount $mntpt xfs_logprint -t $dev \|tail -20 time mount $dev $mntpt xfs_bmap -vp $testfile umount $mntpt $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-10 12:49:57 -05:00
Andy Adamson	9f79fb4825	NFSv4.1 fix decode_free_stateid The operation status is decoded in decode_op_hdr. Stop the print_overflow message that is always hit without this patch: nfs: decode_free_stateid: prematurely hit end of receive buffer. Remaining buffer length is 0 words. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-10 13:04:37 -04:00
Milosz Tanski	9c89d62948	fscache: check consistency does not decrement refcount __fscache_check_consistency() does not decrement the count of operations active after it finishes in the success case. This leads to a hung tasks on cookie de-registration (commonly in inode eviction). INFO: task kworker/1:2:4214 blocked for more than 120 seconds. kworker/1:2 D ffff880443513fc0 0 4214 2 0x00000000 Workqueue: ceph-msgr con_work [libceph] ... Call Trace: [<ffffffff81569fc6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 [<ffffffffa0016570>] ? fscache_wait_bit_interruptible+0x30/0x30 [fscache] [<ffffffff81568d09>] schedule+0x29/0x70 [<ffffffffa001657e>] fscache_wait_atomic_t+0xe/0x20 [fscache] [<ffffffff815665cf>] out_of_line_wait_on_atomic_t+0x9f/0xe0 [<ffffffff81083560>] ? autoremove_wake_function+0x40/0x40 [<ffffffffa0015a9c>] __fscache_relinquish_cookie+0x15c/0x310 [fscache] [<ffffffffa00a4fae>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph] [<ffffffffa007e373>] ceph_destroy_inode+0x33/0x200 [ceph] [<ffffffff811c13ae>] ? __fsnotify_inode_delete+0xe/0x10 [<ffffffff8119ba1c>] destroy_inode+0x3c/0x70 [<ffffffff8119bb69>] evict+0x119/0x1b0 Signed-off-by: Milosz Tanski <milosz@adfin.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Sage Weil <sage@inktank.com>	2013-09-10 09:04:46 -07:00
Dave Chinner	21b5c9784b	xfs: swap extents operations for CRC filesystems For CRC enabled filesystems, we can't just swap inode forks from one inode to another when defragmenting a file - the blocks in the inode fork bmap btree contain pointers back to the owner inode. Hence if we are to swap the inode forks we have to atomically modify every block in the btree during the transaction. We are doing an entire fork swap here, so we could create a new transaction item type that indicates we are changing the owner of a certain structure from one value to another. If we combine this with ordered buffer logging to modify all the buffers in the tree, then we can change the buffers in the tree without needing log space for the operation. However, this then requires log recovery to perform the modification of the owner information of the objects/structures in question. This does introduce some interesting ordering details into recovery: we have to make sure that the owner change replay occurs after the change that moves the objects is made, not before. Hence we can't use a separate log item for this as we have no guarantee of strict ordering between multiple items in the log due to the relogging action of asynchronous transaction commits. Hence there is no "generic" method we can use for changing the ownership of arbitrary metadata structures. For inode forks, however, there is a simple method of communicating that the fork contents need the owner rewritten - we can pass a inode log format flag for the fork for the transaction that does a fork swap. This flag will then follow the inode fork through relogging actions so when the swap actually gets replayed the ownership can be changed immediately by log recovery. So that gives us a simple method of "whole fork" exchange between two inodes. This is relatively simple to implement, so it makes sense to do this as an initial implementation to support xfs_fsr on CRC enabled filesytems in the same manner as we do on existing filesystems. This commit introduces the swapext driven functionality, the recovery functionality will be in a separate patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-10 10:26:47 -05:00
Pavel Shilovsky	42873b0a28	CIFS: Respect epoch value from create lease context v2 that force a client to purge cache pages when a server requests it. Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-09 22:52:18 -05:00
Pavel Shilovsky	f047390a09	CIFS: Add create lease v2 context for SMB3 Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-09 22:52:14 -05:00
Pavel Shilovsky	b5c7cde3fa	CIFS: Move parsing lease buffer to ops struct Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-09 22:52:11 -05:00
Pavel Shilovsky	a41a28bda9	CIFS: Move creating lease buffer to ops struct to make adding new types of lease buffers easier. Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-09 22:52:08 -05:00
Pavel Shilovsky	53ef1016fd	CIFS: Store lease state itself rather than a mapped oplock value and separate smb20_operations struct. Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-09 22:52:05 -05:00
Dave Chinner	0f295a214b	xfs: check magic numbers in dir3 leaf verifier first Calling xfs_dir3_leaf_hdr_from_disk() in a verifier before validating the magic numbers in the buffer results in ASSERT failures due to mismatching magic numbers when a corruption occurs. Seeing as the verifier is supposed to catch the corruption and pass it back to the caller, having the verifier assert fail on error defeats the purpose of detecting the errors in the first place. Check the magic numbers direct from the buffer before decoding the header. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-09 17:43:58 -05:00
Dave Chinner	a30b036797	xfs: fix some minor sparse warnings A couple of simple locking annotations and 0 vs NULL warnings. Nothing that changes any code behaviour, just removes build noise. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-09 17:43:05 -05:00
Dave Chinner	e9fbbad8d8	xfs: fix endian warning in xlog_recover_get_buf_lsn() sparse reports: fs/xfs/xfs_log_recover.c:2017:24: sparse: cast to restricted __be64 Because I used the wrong structure for the on-disk superblock cast in `50d5c8d` ("xfs: check LSN ordering for v5 superblocks during recovery"). Fix it. Reported-by: kbuild test robot Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-09 16:42:48 -05:00
Al Viro	48f5ec21d9	split read_seqretry_or_unlock(), convert d_walk() to resulting primitives Separate "check if we need to retry" from "unlock if we are done and had seq_writelock"; that allows to use these guys in d_walk(), where we need to recheck every time we ascend back to parent, but do not want to unlock until the very end. Lift rcu_read_lock/rcu_read_unlock out into callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-09 15:22:25 -04:00
Linus Torvalds	300893b08f	xfs: update for v3.12-rc1 For 3.12-rc1 there are a number of bugfixes in addition to work to ease usage of shared code between libxfs and the kernel, the rest of the work to enable project and group quotas to be used simultaneously, performance optimisations in the log and the CIL, directory entry file type support, fixes for log space reservations, some spelling/grammar cleanups, and the addition of user namespace support. - introduce readahead to log recovery - add directory entry file type support - fix a number of spelling errors in comments - introduce new Q_XGETQSTATV quotactl for project quotas - add USER_NS support - log space reservation rework - CIL optimisations - kernel/userspace libxfs rework -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJSLeikAAoJENaLyazVq6ZOciEP/3tc850sQsPlNwP9aqd1l2Wk S1RJ8i+MUQ2W/PlbswCXvdUCT8DIwXWxL31tGvi8vtaLhh6t8ICSZwqNil+/GCIJ BErVvY4oXhEMHhlbIRRvpxblTfJGiYy3puUEz9VI0yDdUVnC33+DuEeLTQ/0mibo /UUqKFmM3KYpOc8vIQvH5K5i8PkjtMt9yge0k4l9COD30gtY2okkaD4b1voOsKc+ 5YFqulq7zcXBUYti+EFCQeV8aUBTGEPN4PJRdcS12/ylzsTzZivAOO+QREu7qBW8 x+Gj8fOC+yYWCttmJlfa1n8taxge3ndEuzKN97nvvfQgjvvunMvwJ499skryYVdB EcPnBnpDUQuz/y7exKBT9uROK817vZBtfHzSova29ayQSWC+qDpNE4xXeDIqeCtT CPxdHuWMOvIdZg41E4x7je0elaZl8EAZ8hycc2WuRhtukEkIdE1O8aD7IVrMYee8 kg+aVHG5nmYRInO1WuMinbtiCzwvVoBJToWM3y4cbfgW0dILASRyL53HDd+eCr1j kOpPIVgXlBZgiPMmdYahWxyVVWcE7zyex0w4frzWVlJMZ4lP5brppD6qfQg1JwOB z21Y95F5C2GxSyN/Lwps0G6jujHrpe6GVeYK7uKCtnqTD83nSShv5Naln7pQ3AUs qUMsqmJob4+bwt94Xgbx =V4s4 -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs Pull xfs updates from Ben Myers: "For 3.12-rc1 there are a number of bugfixes in addition to work to ease usage of shared code between libxfs and the kernel, the rest of the work to enable project and group quotas to be used simultaneously, performance optimisations in the log and the CIL, directory entry file type support, fixes for log space reservations, some spelling/grammar cleanups, and the addition of user namespace support. - introduce readahead to log recovery - add directory entry file type support - fix a number of spelling errors in comments - introduce new Q_XGETQSTATV quotactl for project quotas - add USER_NS support - log space reservation rework - CIL optimisations - kernel/userspace libxfs rework" * tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits) xfs: XFS_MOUNT_QUOTA_ALL needed by userspace xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino Fix wrong flag ASSERT in xfs_attr_shortform_getvalue xfs: finish removing IOP_* macros. xfs: inode log reservations are too small xfs: check correct status variable for xfs_inobt_get_rec() call xfs: inode buffers may not be valid during recovery readahead xfs: check LSN ordering for v5 superblocks during recovery xfs: btree block LSN escaping to disk uninitialised XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 xfs: fix bad dquot buffer size in log recovery readahead xfs: don't account buffer cancellation during log recovery readahead xfs: check for underflow in xfs_iformat_fork() xfs: xfs_dir3_sfe_put_ino can be static xfs: introduce object readahead to log recovery xfs: Simplify xfs_ail_min() with list_first_entry_or_null() xfs: Register hotcpu notifier after initialization xfs: add xfs sb v4 support for dirent filetype field xfs: Add write support for dirent filetype field xfs: Add read-only support for dirent filetype field ...	2013-09-09 11:19:09 -07:00
Olof Johansson	45150c43b1	direct-io: Use return from cmpxchg to decide of assignment happened Not using the return value can in the generic case be racy, so it's in general good practice to check the return value instead. This also resolved the warning caused on ARM and other architectures: fs/direct-io.c: In function 'sb_init_dio_done_wq': fs/direct-io.c:557:2: warning: value computed is not used [-Wunused-value] Signed-off-by: Olof Johansson <olof@lixom.net> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: H Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-09 10:47:42 -07:00
Waiman Long	232d2d60aa	dcache: Translating dentry into pathname without taking rename_lock When running the AIM7's short workload, Linus' lockref patch eliminated most of the spinlock contention. However, there were still some left: 8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock \|--42.21%-- d_path \| proc_pid_readlink \| SyS_readlinkat \| SyS_readlink \| system_call \| __GI___readlink \| \|--40.97%-- sys_getcwd \| system_call \| __getcwd The big one here is the rename_lock (seqlock) contention in d_path() and the getcwd system call. This patch will eliminate the need to take the rename_lock while translating dentries into the full pathnames. The need to take the rename_lock is to make sure that no rename operation can be ongoing while the translation is in progress. However, only one thread can take the rename_lock thus blocking all the other threads that need it even though the translation process won't make any change to the dentries. This patch will replace the writer's write_seqlock/write_sequnlock sequence of the rename_lock of the callers of the prepend_path() and __dentry_path() functions with the reader's read_seqbegin/read_seqretry sequence within these 2 functions. As a result, the code will have to retry if one or more rename operations had been performed. In addition, RCU read lock will be taken during the translation process to make sure that no dentries will go away. To prevent live-lock from happening, the code will switch back to take the rename_lock if read_seqretry() fails for three times. To further reduce spinlock contention, this patch does not take the dentry's d_lock when copying the filename from the dentries. Instead, it treats the name pointer and length as unreliable and just copy the string byte-by-byte over until it hits a null byte or the end of string as specified by the length. This should avoid stepping into invalid memory address. The error cases are left to be handled by the sequence number check. The following code re-factoring are also made: 1. Move prepend('/') into prepend_name() to remove one conditional check. 2. Move the global root check in prepend_path() back to the top of the while loop. With this patch, the _raw_spin_lock will now account for only 1.2% of the total CPU cycles for the short workload. This patch also has the effect of reducing the effect of running perf on its profile since the perf command itself can be a heavy user of the d_path() function depending on the complexity of the workload. When taking the perf profile of the high-systime workload, the amount of spinlock contention contributed by running perf without this patch was about 16%. With this patch, the spinlock contention caused by the running of perf will go away and we will have a more accurate perf profile. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-09 13:44:16 -04:00
Artem Savkov	d9b2c8714a	aio: rcu_read_lock protection for new rcu_dereference calls Patch "aio: fix rcu sparse warnings introduced by ioctx table lookup patch" (`77d30b14d2` in linux-next.git) introduced a couple of new rcu_dereference calls which are not protected by rcu_read_lock and result in following warnings during syscall fuzzing(trinity): [ 471.646379] =============================== [ 471.649727] [ INFO: suspicious RCU usage. ] [ 471.653919] 3.11.0-next-20130906+ #496 Not tainted [ 471.657792] ------------------------------- [ 471.661235] fs/aio.c:503 suspicious rcu_dereference_check() usage! [ 471.665968] [ 471.665968] other info that might help us debug this: [ 471.665968] [ 471.672141] [ 471.672141] rcu_scheduler_active = 1, debug_locks = 1 [ 471.677549] 1 lock held by trinity-child0/3774: [ 471.681675] #0: (&(&mm->ioctx_lock)->rlock){+.+...}, at: [<c119ba1a>] SyS_io_setup+0x63a/0xc70 [ 471.688721] [ 471.688721] stack backtrace: [ 471.692488] CPU: 1 PID: 3774 Comm: trinity-child0 Not tainted 3.11.0-next-20130906+ #496 [ 471.698437] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 471.703151] 00000000 00000000 c58bbf30 c18a814b de2234c0 c58bbf58 c10a4ec6 c1b0d824 [ 471.709544] c1b0f60e 00000001 00000001 c1af61b0 00000000 cb670ac0 c3aca000 c58bbfac [ 471.716251] c119bc7c 00000002 00000001 00000000 c119b8dd 00000000 c10cf684 c58bbfb4 [ 471.722902] Call Trace: [ 471.724859] [<c18a814b>] dump_stack+0x4b/0x66 [ 471.728772] [<c10a4ec6>] lockdep_rcu_suspicious+0xc6/0x100 [ 471.733716] [<c119bc7c>] SyS_io_setup+0x89c/0xc70 [ 471.737806] [<c119b8dd>] ? SyS_io_setup+0x4fd/0xc70 [ 471.741689] [<c10cf684>] ? __audit_syscall_entry+0x94/0xe0 [ 471.746080] [<c18b1fcc>] syscall_call+0x7/0xb [ 471.749723] [<c1080000>] ? task_fork_fair+0x240/0x260 Signed-off-by: Artem Savkov <artem.savkov@gmail.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>	2013-09-09 12:29:35 -04:00
Linus Torvalds	bf97293eb8	NFS client updates for Linux 3.12 Highlights include: - Fix NFSv4 recovery so that it doesn't recover lost locks in cases such as lease loss due to a network partition, where doing so may result in data corruption. Add a kernel parameter to control choice of legacy behaviour or not. - Performance improvements when 2 processes are writing to the same file. - Flush data to disk when an RPCSEC_GSS session timeout is imminent. - Implement NFSv4.1 SP4_MACH_CRED state protection to prevent other NFS clients from being able to manipulate our lease and file lockingr state. - Allow sharing of RPCSEC_GSS caches between different rpc clients - Fix the broken NFSv4 security auto-negotiation between client and server - Fix rmdir() to wait for outstanding sillyrename unlinks to complete - Add a tracepoint framework for debugging NFSv4 state recovery issues. - Add tracing to the generic NFS layer. - Add tracing for the SUNRPC socket connection state. - Clean up the rpc_pipefs mount/umount event management. - Merge more patches from Chuck in preparation for NFSv4 migration support. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) iQIcBAABAgAGBQJSLelVAAoJEGcL54qWCgDyo2IQAKOfRJyZVnf4ipxi3xLNl1QF w/70DVSIF1S1djWN7G3vgkxj/R8KCvJ8CcvkAD2BEgRDeZJ9TtyKAdM/jYLZ+W05 7k2QKk8fkwZmc1Y2qDqFwKHzP5ZgP5L2nGx7FNhi/99wEAe47yFG3qd3rUWKrcOf mnd863zgGDE2Q10slhoq/bywwMJo6tKZNeaIE8kPjgFbBEh/jslpAWr8dSA4QgvJ nZ8VB5XU8L+XJ0GpHHdjYm9LvQ51DbQ6omOF+0P4fI093azKmf4ZsrjMDWT8+iu3 XkXlnQmKLGTi7yB43hHtn2NiRqwGzCcZ1Amo9PpCFaHUt1RP9cc37UhG1T+x1xWJ STEKDbvCdQ3FU9FvbgrGEwBR0e8fNS4fZY3ToDBflIcfwre0aWs5RCodZMUD0nUI 4wY5J9NsQR/bL+v8KeUR4V4cXK8YrgL0zB4u4WYzH5Npxr5KD0NEKDNqRPhrB9l2 LLF9Haql8j76Ff0ek6UGFIZjDE0h6Fs71wLBpLj+ZWArOJ7vBuLMBSOVqNpld9+9 f2fEG7qoGF4FGTY4myH/eakMPaWnk9Ol4Ls/svSIapJ9+rePD+a93e/qnmdofIMf 4TuEYk6ERib1qXgaeDRQuCsm2YE1Co5skGMaOsRFWgReE1c12QoJQVst2nMtEKp3 uV2w8LgX18aZOZXJVkCM =ZuW+ -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Highlights include: - Fix NFSv4 recovery so that it doesn't recover lost locks in cases such as lease loss due to a network partition, where doing so may result in data corruption. Add a kernel parameter to control choice of legacy behaviour or not. - Performance improvements when 2 processes are writing to the same file. - Flush data to disk when an RPCSEC_GSS session timeout is imminent. - Implement NFSv4.1 SP4_MACH_CRED state protection to prevent other NFS clients from being able to manipulate our lease and file locking state. - Allow sharing of RPCSEC_GSS caches between different rpc clients. - Fix the broken NFSv4 security auto-negotiation between client and server. - Fix rmdir() to wait for outstanding sillyrename unlinks to complete - Add a tracepoint framework for debugging NFSv4 state recovery issues. - Add tracing to the generic NFS layer. - Add tracing for the SUNRPC socket connection state. - Clean up the rpc_pipefs mount/umount event management. - Merge more patches from Chuck in preparation for NFSv4 migration support" * tag 'nfs-for-3.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (107 commits) NFSv4: use mach cred for SECINFO_NO_NAME w/ integrity NFS: nfs_compare_super shouldn't check the auth flavour unless 'sec=' was set NFSv4: Allow security autonegotiation for submounts NFSv4: Disallow security negotiation for lookups when 'sec=' is specified NFSv4: Fix security auto-negotiation NFS: Clean up nfs_parse_security_flavors() NFS: Clean up the auth flavour array mess NFSv4.1 Use MDS auth flavor for data server connection NFS: Don't check lock owner compatability unless file is locked (part 2) NFS: Don't check lock owner compatibility in writes unless file is locked nfs4: Map NFS4ERR_WRONG_CRED to EPERM nfs4.1: Add SP4_MACH_CRED write and commit support nfs4.1: Add SP4_MACH_CRED stateid support nfs4.1: Add SP4_MACH_CRED secinfo support nfs4.1: Add SP4_MACH_CRED cleanup support nfs4.1: Add state protection handler nfs4.1: Minimal SP4_MACH_CRED implementation SUNRPC: Replace pointer values with task->tk_pid and rpc_clnt->cl_clid SUNRPC: Add an identifier for struct rpc_clnt SUNRPC: Ensure rpc_task->tk_pid is available for tracepoints ...	2013-09-09 09:19:15 -07:00
Linus Torvalds	16d70e1529	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse bugfixes from Miklos Szeredi: "Just a bunch of bugfixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: use list_for_each_entry() for list traversing fuse: readdir: check for slash in names fuse: hotfix truncate_pagecache() issue fuse: invalidate inode attributes on xattr modification fuse: postpone end_page_writeback() in fuse_writepage_locked()	2013-09-09 09:18:23 -07:00
Linus Torvalds	6c337ad6cc	This is possibly the smallest ever set of GFS2 patches for a merge window. Also, most of them are bug fixes this time. Two of my three patches (moving gfs2_sync_meta and merging the two writepage implementations) are clean ups with the third (taking the glock ref in examine_bucket) being a fix for a difficult to hit race condition. The removal of an unused memory barrier is a clean up from Bob Peterson, and the "spectator" relates to a rarely used mount option. Ben Marzinski's patch fixes a corner case where the incorrect inode flags were being set, resulting in incorrect behaviour on fsync. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJSLYWjAAoJEMrg3m4a/8jSHBUQAIBXyrLypCJNJNpYreRlg4Te YZOqXMxqrfsWD5jkfWvxV/vZyVu6FEJRJRRwKoO8foZhnVIy2HiGdpFvwk4NMzGs Ah5cCaySgDhFIKXNq1CnhVZNpRU8N+Lf+8U2MwkpPUrT+KnZlDdG8COuHbWJ3/+t prqHy0N1oa8hgj3P0oDf3kyQKiB2MQVhORTBVOWaas1mzw8vsKRjvO7k5g0VPcfV VnIYzvQ033V6pW6ymWGcbz6us7hXeFJmXo4grAdLIclK6QDt1zPLVKlvk7X/Me52 PTxO5AP/Nw6AlJABTNy8UZ3uJO3QaiqhcIz+OzIXlYpsICqma30qkHHrWzOh0Q5F HrDqh03A2zuqigamVmsJr2+tbdLWxz72D6N3eFlIBLAw2JxILosCPcGc4UNzG/7g mFr6+LDnZXzFpjT6OBdepjVH06ZqfcV+nr3KhvmKQT+tkzTizArbt8JutOdKjahH 3+pXx/bA0B7LRVYymK5/xrfbjnQp629GG3QnGiI23gtDZ6eEzm663bQz6MTgjR88 H0nigLi8d2KmM8UnrZqq5nI+LirFD/IZ5DO47Vv6w3NCZW/ZpNOodlWkd1igvhHR ugTvKqyw2O9hImM8jPoRWbqL01RGTKTMMn/3cCCO8Uww7DpI4GLb48a39buYh4+2 kijIxCtntfQAOiR6MCvA =rsQv -----END PGP SIGNATURE----- Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw Pull GFS2 updates from Steven Whitehouse: "This is possibly the smallest ever set of GFS2 patches for a merge window. Also, most of them are bug fixes this time. Two of my three patches (moving gfs2_sync_meta and merging the two writepage implementations) are clean ups with the third (taking the glock ref in examine_bucket) being a fix for a difficult to hit race condition. The removal of an unused memory barrier is a clean up from Bob Peterson, and the "spectator" relates to a rarely used mount option. Ben Marzinski's patch fixes a corner case where the incorrect inode flags were being set, resulting in incorrect behaviour on fsync" * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: GFS2: dirty inode correctly in gfs2_write_end GFS2: Don't flag consistency error if first mounter is a spectator GFS2: Remove unnecessary memory barrier GFS2: Merge ordered and writeback writepage GFS2: Take glock reference in examine_bucket() GFS2: Move gfs2_sync_meta to lops.c	2013-09-09 09:16:51 -07:00
Linus Torvalds	6cccc7d301	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "This includes both the first pile of Ceph patches (which I sent to torvalds@vger, sigh) and a few new patches that add support for fscache for Ceph. That includes a few fscache core fixes that David Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski for putting this feature together) This first batch of patches (included here) had (has) several important RBD bug fixes, hole punch support, several different cleanups in the page cache interactions, improvements in the truncate code (new truncate mutex to avoid shenanigans with i_mutex), and a series of fixes in the synchronous striping read/write code. On top of that is a random collection of small fixes all across the tree (error code checks and error path cleanup, obsolete wq flags, etc)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits) ceph: use d_invalidate() to invalidate aliases ceph: remove ceph_lookup_inode() ceph: trivial buildbot warnings fix ceph: Do not do invalidate if the filesystem is mounted nofsc ceph: page still marked private_2 ceph: ceph_readpage_to_fscache didn't check if marked ceph: clean PgPrivate2 on returning from readpages ceph: use fscache as a local presisent cache fscache: Netfs function for cleanup post readpages FS-Cache: Fix heading in documentation CacheFiles: Implement interface to check cache consistency FS-Cache: Add interface to check consistency of a cached object rbd: fix null dereference in dout rbd: fix buffer size for writes to images with snapshots libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc rbd: fix I/O error propagation for reads ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem ceph: allow sync_read/write return partial successed size of read/write. ceph: fix bugs about handling short-read for sync read mode. ceph: remove useless variable revoked_rdcache ...	2013-09-09 09:13:22 -07:00
Linus Torvalds	20e029d791	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: "This pile contains mostly fixes and improvements for issues identified by Richard W M Jones while adding UML as backend to libguestfs" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: Add irq chip um/mask handlers um: prctl: Do not include linux/ptrace.h um: Run UML in it's own session. um: Cleanup SIGTERM handling um: ubd: Introduce submit_request() um: ubd: Add REQ_FLUSH suppport um: Implement probe_kernel_read() um: hostfs: Fix writeback	2013-09-09 09:03:46 -07:00
Benjamin LaHaise	d6c355c7da	aio: fix race in ring buffer page lookup introduced by page migration support Prior to the introduction of page migration support in "fs/aio: Add support to aio ring pages migration" / `36bc08cc01`, mapping of the ring buffer pages was done via get_user_pages() while retaining mmap_sem held for write. This avoided possible races with userland racing an munmap() or mremap(). The page migration patch, however, switched to using mm_populate() to prime the page mapping. mm_populate() cannot be called with mmap_sem held. Instead of dropping the mmap_sem, revert to the old behaviour and simply drop the use of mm_populate() since get_user_pages() will cause the pages to get mapped anyways. Thanks to Al Viro for spotting this issue. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>	2013-09-09 11:57:59 -04:00
Ian Kent	ac83871996	autofs4 - fix device ioctl mount lookup When reconnecting to automounts at startup an autofs ioctl is used to find the device and inode of existing mounts so they can be used to open a file descriptor of possibly covered mounts. At this time the the caller might not yet "own" the mount so it can trigger calling ->d_automount(). This causes automount to hang when trying to reconnect to direct or offset mount types. Consequently kern_path() can't be used but kern_path_mountpoint() can be. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-08 22:07:47 -04:00
Linus Torvalds	e5c832d555	vfs: fix dentry RCU to refcounting possibly sleeping dput() This is the fix that the last two commits indirectly led up to - making sure that we don't call dput() in a bad context on the dentries we've looked up in RCU mode after the sequence count validation fails. This basically expands d_rcu_to_refcount() into the callers, and then fixes the callers to delay the dput() in the failure case until _after_ we've dropped all locks and are no longer in an RCU-locked region. The case of 'complete_walk()' was trivial, since its failure case did the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(), and as such that is just a pure expansion of the function with a trivial movement of the resulting dput() to after 'unlock_rcu_walk()'. In contrast, the unlazy_walk() case was much more complicated, because not only does convert two different dentries from RCU to be reference counted, but it used to not call unlock_rcu_walk() at all, and instead just returned an error and let the caller clean everything up in "terminate_walk()". Happily, one of the dentries in question (called "parent" inside unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants a refcount to anyway for the non-RCU case. So what the new and improved unlazy_walk() does is to first turn that dentry into a refcounted one, and once that is set up, the error cases can continue to use the terminate_walk() helper for cleanup, but for the non-RCU case. Which makes it possible to drop out of RCU mode if we actually hit the sequence number failure case. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-08 18:13:49 -07:00
Al Viro	2d86465101	introduce kern_path_mountpoint() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-08 20:20:23 -04:00
Al Viro	197df04c74	rename user_path_umountat() to user_path_mountpoint_at() ... and move the extern from linux/namei.h to fs/internal.h, along with that of vfs_path_lookup(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-08 20:20:21 -04:00
Al Viro	35759521ee	take unlazy_walk() into umount_lookup_last() ... and massage it a bit to reduce nesting Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-08 20:20:19 -04:00
Pavel Shilovsky	18cceb6a78	CIFS: Replace clientCanCache* bools with an integer that prepare the code to handle different types of SMB2 leases. Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 17:49:17 -05:00
Linus Torvalds	0d98439ea3	vfs: use lockred "dead" flag to mark unrecoverably dead dentries This simplifies the RCU to refcounting code in particular. I was originally intending to leave this for later, but walking through all the dput() logic (see previous commit), I realized that the dput() "might_sleep()" check was misleadingly weak. And I removed it as misleading, both for performance profiling and for debugging. However, the might_sleep() debugging case is actually true: the final dput() can indeed sleep, if the inode of the dentry that you are releasing ends up sleeping at iput time (see dentry_iput()). So the problem with the might_sleep() in dput() wasn't that it wasn't true, it was that it wasn't actually testing and triggering on the interesting case. In particular, just about any dput() can indeed sleep, if you happen to race with another thread deleting the file in question, and you then lose the race to the be the last dput() for that file. But because it's a very rare race, the debugging code would never trigger it in practice. Why is this problematic? The new d_rcu_to_refcount() (see commit `15570086b5`: "vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()") does a dput() for the failure case, and it does it under the RCU lock. So potentially sleeping really is a bug. But there's no way I'm going to fix this with the previous complicated "lockref_get_or_lock()" interface. And rather than revert to the old and crufty nested dentry locking code (which did get this right by delaying the reference count updates until they were verified to be safe), let's make forward progress. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-08 13:46:52 -07:00
Linus Torvalds	8aab6a2733	vfs: reorganize dput() memory accesses This is me being a bit OCD after all the dentry optimization work this merge window: profiles end up showing 'dput()' as a rather expensive operation, and there were two unrelated bad reasons for that. The first reason was reading d_lockref.count for debugging purposes, which touches the lockref cacheline (for reads) before really need to. More importantly, the debugging test in question is _wrong_, and has hidden bugs. It's true that we can only sleep when the count goes down to zero, but the test as-is hides the much more subtle bug that happens if we race with somebody else deleting the file. Anyway we _will_ touch that cacheline, but let's do it for a write and in the right routine (ie in "lockref_put_or_lock()") which annotates the costs better. So remove the misleading debug code. The other was an unnecessary access to the cacheline that contains the d_lru list, just to check whether we already were on the LRU list or not. This is exactly what we have d_flags for, so that we can avoid touching extra cache lines for the common case. So just add another bit for "is this dentry on the LRU". Finally, mark the tests properly likely/unlikely, so that the common fast-paths are dense in the instruction stream. This makes the profiles look much saner. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-08 13:26:18 -07:00
Steve French	77993be3f3	[CIFS] quiet sparse compile warning Jeff's patchset introduced trivial sparse warning on new cifs toupper routine Signed-off-by: Steve French <smfrench@gmail.com> CC: Jeff Layton <jlayton@redhat.com>	2013-09-08 14:54:24 -05:00
Shirish Pargaonkar	32811d242f	cifs: Start using per session key for smb2/3 for signature generation Switch smb2 code to use per session session key and smb3 code to use per session signing key instead of per connection key to generate signatures. For that, we need to find a session to fetch the session key to generate signature to match for every request and response packet. We also forgo checking signature for a session setup response from the server. Acked-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:47:50 -05:00
Shirish Pargaonkar	5c234aa5e3	cifs: Add a variable specific to NTLMSSP for key exchange. Add a variable specific to NTLMSSP authentication to determine whether to exchange keys during negotiation and authentication phases. Since session key for smb1 is per smb connection, once a very first sesion is established, there is no need for key exchange during subsequent session setups. As a result, smb1 session setup code sets this variable as false. Since session key for smb2 and smb3 is per smb connection, we need to exchange keys to generate session key for every sesion being established. As a result, smb2/3 session setup code sets this variable as true. Acked-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:47:49 -05:00
Shirish Pargaonkar	d4e63bd6e4	cifs: Process post session setup code in respective dialect functions. Move the post (successful) session setup code to respective dialect routines. For smb1, session key is per smb connection. For smb2/smb3, session key is per smb session. If client and server do not require signing, free session key for smb1/2/3. If client and server require signing smb1 - Copy (kmemdup) session key for the first session to connection. Free session key of that and subsequent sessions on this connection. smb2 - For every session, keep the session key and free it when the session is being shutdown. smb3 - For every session, generate the smb3 signing key using the session key and then free the session key. There are two unrelated line formatting changes as well. Reviewed-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:47:47 -05:00
Wei Yongjun	31f92e9a87	CIFS: convert to use le32_add_cpu() Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:47:43 -05:00
Pavel Shilovsky	933d4b3657	CIFS: Fix missing lease break If a server sends a lease break to a connection that doesn't have opens with a lease key specified in the server response, we can't find an open file to send an ack. Fix this by walking through all connections we have. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:41:43 -05:00
Pavel Shilovsky	1a05096de8	CIFS: Fix a memory leak when a lease break comes This happens when we receive a lease break from a server, then find an appropriate lease key in opened files and schedule the oplock_break slow work. lw pointer isn't freed in this case. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:41:40 -05:00
Jeff Layton	ec71e0e159	cifs: convert case-insensitive dentry ops to use new case conversion routines Have the case-insensitive d_compare and d_hash routines convert each character in the filenames to wchar_t's and then use the new cifs_toupper routine to convert those into uppercase. With this scheme we should more closely emulate the case conversion that the servers will do. Reported-and-Tested-by: Jan-Marek Glogowski <glogow@fbihome.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:38:08 -05:00
Jeff Layton	c2ccf53dd0	cifs: add new case-insensitive conversion routines that are based on wchar_t's The existing NLS case conversion routines do not appropriately handle the (now common) case where the local host is using UTF8. This is because nls_utf8 has no support at all for converting a utf8 string between cases and the NLS infrastructure in general cannot handle a multibyte input character. In any case, what we really need for cifs is to emulate how we expect the server to convert the character to upper or lowercase. Thus, even if we had routines that could handle utf8 case conversion, we likely would end up with the wrong result if the name ends up being in the upper planes. This patch adds a new scheme for doing unicode case conversion. The case conversion tables that Microsoft has published for Windows 8 have been converted to a set of lookup tables, and a routine is added to convert a wchar_t from lower to uppercase using those tables. Reported-and-Tested-by: Jan-Marek Glogowski <glogow@fbihome.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:38:05 -05:00
Scott Lovenberg	cdf1246ffb	cifs: Move and expand MAX_SERVER_SIZE definition MAX_SERVER_SIZE has been moved to cifs_mount.h and renamed CIFS_NI_MAXHOST for clarity. It has been expanded to 1024 as the previous value of 16 was very short. Signed-off-by: Scott Lovenberg <scott.lovenberg@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:34:22 -05:00
Scott Lovenberg	8c3a2b4c42	cifs: Move string length definitions to uapi The max string length definitions for user name, domain name, password, and share name have been moved into their own header file in uapi so the mount helper can use autoconf to define them instead of keeping the kernel side and userland side definitions in sync manually. The names have also been standardized with a "CIFS" prefix and "LEN" suffix. Signed-off-by: Scott Lovenberg <scott.lovenberg@gmail.com> Reviewed-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:34:11 -05:00
Pavel Shilovsky	d244bf2dfb	CIFS: Implement follow_link for nounix CIFS mounts by using a query reparse ioctl request. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:27:41 -05:00
Pavel Shilovsky	b42bf88828	CIFS: Implement follow_link for SMB2 that allows to access files through symlink created on a server. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:27:34 -05:00
Jeff Layton	3ae35cde67	cifs: display iocharset= option in /proc/mounts ...but only if it's not the default charset. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:24:30 -05:00
Jeff Layton	30706a5454	cifs: create a new Documentation/ directory and move docfiles into it Currently, we have a number of documentation files that live under fs/cifs/. Generally, these don't get picked up by distro packagers, since they're in a non-standard location. Move them to a new spot under Documentation/ instead. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:24:10 -05:00
Jeff Layton	73e216a8a4	cifs: ensure that srv_mutex is held when dealing with ssocket pointer Oleksii reported that he had seen an oops similar to this: BUG: unable to handle kernel NULL pointer dereference at 0000000000000088 IP: [<ffffffff814dcc13>] sock_sendmsg+0x93/0xd0 PGD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: ipt_MASQUERADE xt_REDIRECT xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables carl9170 ath usb_storage f2fs nfnetlink_log nfnetlink md4 cifs dns_resolver hid_generic usbhid hid af_packet uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev rfcomm btusb bnep bluetooth qmi_wwan qcserial cdc_wdm usb_wwan usbnet usbserial mii snd_hda_codec_hdmi snd_hda_codec_realtek iwldvm mac80211 coretemp intel_powerclamp kvm_intel kvm iwlwifi snd_hda_intel cfg80211 snd_hda_codec xhci_hcd e1000e ehci_pci snd_hwdep sdhci_pci snd_pcm ehci_hcd microcode psmouse sdhci thinkpad_acpi mmc_core i2c_i801 pcspkr usbcore hwmon snd_timer snd_page_alloc snd ptp rfkill pps_core soundcore evdev usb_common vboxnetflt(O) vboxdrv(O)Oops#2 Part8 loop tun binfmt_misc fuse msr acpi_call(O) ipv6 autofs4 CPU: 0 PID: 21612 Comm: kworker/0:1 Tainted: G W O 3.10.1SIGN #28 Hardware name: LENOVO 2306CTO/2306CTO, BIOS G2ET92WW (2.52 ) 02/22/2013 Workqueue: cifsiod cifs_echo_request [cifs] task: ffff8801e1f416f0 ti: ffff880148744000 task.ti: ffff880148744000 RIP: 0010:[<ffffffff814dcc13>] [<ffffffff814dcc13>] sock_sendmsg+0x93/0xd0 RSP: 0000:ffff880148745b00 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff880148745b78 RCX: 0000000000000048 RDX: ffff880148745c90 RSI: ffff880181864a00 RDI: ffff880148745b78 RBP: ffff880148745c48 R08: 0000000000000048 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880181864a00 R13: ffff880148745c90 R14: 0000000000000048 R15: 0000000000000048 FS: 0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000088 CR3: 000000020c42c000 CR4: 00000000001407b0 Oops#2 Part7 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff880148745b30 ffffffff810c4af9 0000004848745b30 ffff880181864a00 ffffffff81ffbc40 0000000000000000 ffff880148745c90 ffffffff810a5aab ffff880148745bc0 ffffffff81ffbc40 ffff880148745b60 ffffffff815a9fb8 Call Trace: [<ffffffff810c4af9>] ? finish_task_switch+0x49/0xe0 [<ffffffff810a5aab>] ? lock_timer_base.isra.36+0x2b/0x50 [<ffffffff815a9fb8>] ? _raw_spin_unlock_irqrestore+0x18/0x40 [<ffffffff810a673f>] ? try_to_del_timer_sync+0x4f/0x70 [<ffffffff815aa38f>] ? _raw_spin_unlock_bh+0x1f/0x30 [<ffffffff814dcc87>] kernel_sendmsg+0x37/0x50 [<ffffffffa081a0e0>] smb_send_kvec+0xd0/0x1d0 [cifs] [<ffffffffa081a263>] smb_send_rqst+0x83/0x1f0 [cifs] [<ffffffffa081ab6c>] cifs_call_async+0xec/0x1b0 [cifs] [<ffffffffa08245e0>] ? free_rsp_buf+0x40/0x40 [cifs] Oops#2 Part6 [<ffffffffa082606e>] SMB2_echo+0x8e/0xb0 [cifs] [<ffffffffa0808789>] cifs_echo_request+0x79/0xa0 [cifs] [<ffffffff810b45b3>] process_one_work+0x173/0x4a0 [<ffffffff810b52a1>] worker_thread+0x121/0x3a0 [<ffffffff810b5180>] ? manage_workers.isra.27+0x2b0/0x2b0 [<ffffffff810bae00>] kthread+0xc0/0xd0 [<ffffffff810bad40>] ? kthread_create_on_node+0x120/0x120 [<ffffffff815b199c>] ret_from_fork+0x7c/0xb0 [<ffffffff810bad40>] ? kthread_create_on_node+0x120/0x120 Code: 84 24 b8 00 00 00 4c 89 f1 4c 89 ea 4c 89 e6 48 89 df 4c 89 60 18 48 c7 40 28 00 00 00 00 4c 89 68 30 44 89 70 14 49 8b 44 24 28 <ff> 90 88 00 00 00 3d ef fd ff ff 74 10 48 8d 65 e0 5b 41 5c 41 RIP [<ffffffff814dcc13>] sock_sendmsg+0x93/0xd0 RSP <ffff880148745b00> CR2: 0000000000000088 The client was in the middle of trying to send a frame when the server->ssocket pointer got zeroed out. In most places, that we access that pointer, the srv_mutex is held. There's only one spot that I see that the server->ssocket pointer gets set and the srv_mutex isn't held. This patch corrects that. The upstream bug report was here: https://bugzilla.kernel.org/show_bug.cgi?id=60557 Cc: <stable@vger.kernel.org> Reported-by: Oleksii Shevchuk <alxchk@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-08 14:24:07 -05:00
Al Viro	d040790391	prune_super(): sb->s_op is never NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-07 19:54:56 -04:00
Al Viro	dfc59e2c90	exportfs: don't assume that ->iterate() won't feed us too long entries On some filesystems it's impossible even with fs corruption, but we'd better not rely on that, what with memcpy() into on-stack array we are doing there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-07 19:54:55 -04:00
Al Viro	5d8943b04b	afs: get rid of redundant ->d_name.len checks No dentry can get to directory modification methods without having passed either ->lookup() or ->atomic_open(); if name is rejected by those two (or by ->d_hash()) with an error, it won't be seen by anything else. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-07 19:54:55 -04:00
Weston Andros Adamson	b1b3e13694	NFSv4: use mach cred for SECINFO_NO_NAME w/ integrity Commit `97431204ea` introduced a regression that causes SECINFO_NO_NAME to fail without sending an RPC if: 1) the nfs_client's rpc_client is using krb5i/p (now tried by default) 2) the current user doesn't have valid kerberos credentials This situation is quite common - as of now a sec=sys mount would use krb5i for the nfs_client's rpc_client and a user would hardly be faulted for not having run kinit. The solution is to use the machine cred when trying to use an integrity protected auth flavor for SECINFO_NO_NAME. Older servers may not support using the machine cred or an integrity protected auth flavor for SECINFO_NO_NAME in every circumstance, so we fall back to using the user's cred and the filesystem's auth flavor in this case. We run into another problem when running against linux nfs servers - they return NFS4ERR_WRONGSEC when using integrity auth flavor (unless the mount is also that flavor) even though that is not a valid error for SECINFO*. Even though it's against spec, handle WRONGSEC errors on SECINFO_NO_NAME by falling back to using the user cred and the filesystem's auth flavor. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-07 18:39:25 -04:00
Trond Myklebust	0aea92bf67	NFS: nfs_compare_super shouldn't check the auth flavour unless 'sec=' was set Also don't worry about obsolete mount flags... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-07 18:38:04 -04:00
Trond Myklebust	47040da3c7	NFSv4: Allow security autonegotiation for submounts In cases where the parent super block was not mounted with a 'sec=' line, allow autonegotiation of security for the submounts. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-07 17:52:42 -04:00
Trond Myklebust	41d058c3ba	NFSv4: Disallow security negotiation for lookups when 'sec=' is specified Ensure that nfs4_proc_lookup_common respects the NFS_MOUNT_SECFLAVOUR flag. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-07 17:52:42 -04:00
Linus Torvalds	dc0755cdb1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 2 (of many) from Al Viro: "Mostly Miklos' series this time" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: constify dcache.c inlined helpers where possible fuse: drop dentry on failed revalidate fuse: clean up return in fuse_dentry_revalidate() fuse: use d_materialise_unique() sysfs: use check_submounts_and_drop() nfs: use check_submounts_and_drop() gfs2: use check_submounts_and_drop() afs: use check_submounts_and_drop() vfs: check unlinked ancestors before mount vfs: check submounts and drop atomically vfs: add d_walk() vfs: restructure d_genocide()	2013-09-07 14:36:57 -07:00
Linus Torvalds	c7c4591db6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace changes from Eric Biederman: "This is an assorted mishmash of small cleanups, enhancements and bug fixes. The major theme is user namespace mount restrictions. nsown_capable is killed as it encourages not thinking about details that need to be considered. A very hard to hit pid namespace exiting bug was finally tracked and fixed. A couple of cleanups to the basic namespace infrastructure. Finally there is an enhancement that makes per user namespace capabilities usable as capabilities, and an enhancement that allows the per userns root to nice other processes in the user namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Kill nsown_capable it makes the wrong thing easy capabilities: allow nice if we are privileged pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD userns: Allow PR_CAPBSET_DROP in a user namespace. namespaces: Simplify copy_namespaces so it is clear what is going on. pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup sysfs: Restrict mounting sysfs userns: Better restrictions on when proc and sysfs can be mounted vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces kernel/nsproxy.c: Improving a snippet of code. proc: Restrict mounting the proc filesystem vfs: Lock in place mounts from more privileged users	2013-09-07 14:35:32 -07:00
Linus Torvalds	11c7b03d42	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Nothing major for this kernel, just maintenance updates" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (21 commits) apparmor: add the ability to report a sha1 hash of loaded policy apparmor: export set of capabilities supported by the apparmor module apparmor: add the profile introspection file to interface apparmor: add an optional profile attachment string for profiles apparmor: add interface files for profiles and namespaces apparmor: allow setting any profile into the unconfined state apparmor: make free_profile available outside of policy.c apparmor: rework namespace free path apparmor: update how unconfined is handled apparmor: change how profile replacement update is done apparmor: convert profile lists to RCU based locking apparmor: provide base for multiple profiles to be replaced at once apparmor: add a features/policy dir to interface apparmor: enable users to query whether apparmor is enabled apparmor: remove minimum size check for vmalloc() Smack: parse multiple rules per write to load2, up to PAGE_SIZE-1 bytes Smack: network label match fix security: smack: add a hash table to quicken smk_find_entry() security: smack: fix memleak in smk_write_rules_list() xattr: Constify ->name member of "struct xattr". ...	2013-09-07 14:34:07 -07:00
Trond Myklebust	5e6b19901b	NFSv4: Fix security auto-negotiation NFSv4 security auto-negotiation has been broken since commit `4580a92d44` (NFS: Use server-recommended security flavor by default (NFSv3)) because nfs4_try_mount() will automatically select AUTH_SYS if it sees no auth flavours. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com>	2013-09-07 16:18:30 -04:00
Trond Myklebust	19e7b8d240	NFS: Clean up nfs_parse_security_flavors() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-07 16:12:45 -04:00
Trond Myklebust	74c9881162	NFS: Clean up the auth flavour array mess What is the point of having a 'auth_flavor_len' field, if it is always set to 1, and can't be used to determine if the user has selected an auth flavour? This cleanup goes back to using auth_flavor_len for its original intended purpose, and gets rid of the ad-hoc replacements. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-07 16:11:24 -04:00
Richard Weinberger	65984ff9d2	um: hostfs: Fix writeback We have to implement ->release() and trigger writeback from it. Otherwise we might lose dirty pages at munmap(). Signed-off-by: Richard Weinberger <richard@nod.at>	2013-09-07 10:38:29 +02:00
Kees Cook	cb69f36ba1	ecryptfs: avoid ctx initialization race It might be possible for two callers to race the mutex lock after the NULL ctx check. Instead, move the lock above the check so there isn't the possibility of leaking a crypto ctx. Additionally, report the full algo name when failing. Signed-off-by: Kees Cook <keescook@chromium.org> [tyhicks: remove out label, which is no longer used] Signed-off-by: Tyler Hicks <tyhicks@canonical.com>	2013-09-06 16:58:18 -07:00
Dan Carpenter	e6cbd6a44d	ecryptfs: remove check for if an array is NULL It doesn't make sense to check if an array is NULL. The compiler just removes the check. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com>	2013-09-06 16:51:56 -07:00
Yan, Zheng	a8d436f015	ceph: use d_invalidate() to invalidate aliases d_invalidate() is the standard VFS method to invalidate dentry. compare to d_delete(), it also try shrinking children dentries. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-09-06 12:55:29 -07:00
Yan, Zheng	ed284c49f6	ceph: remove ceph_lookup_inode() commit `6f60f889` (ceph: fix freeing inode vs removing session caps race) introduced ceph_lookup_inode(). But there is already a ceph_find_inode() which provides similar function. So remove ceph_lookup_inode(), use ceph_find_inode() instead. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <alex.elder@linary.org> Reviewed-by: Sage Weil <sage@inktank.com>	2013-09-06 12:55:09 -07:00
Andy Adamson	0e20162ed1	NFSv4.1 Use MDS auth flavor for data server connection Commit `4edaa308` "NFS: Use "krb5i" to establish NFSv4 state whenever possible" uses the nfs_client cl_rpcclient for all state management operations, and will use krb5i or auth_sys with no regard to the mount command authflavor choice. The MDS, as any NFSv4.1 mount point, uses the nfs_server rpc client for all non-state management operations with a different nfs_server for each fsid encountered traversing the mount point, each with a potentially different auth flavor. pNFS data servers are not mounted in the normal sense as there is no associated nfs_server structure. Data servers can also export multiple fsids, each with a potentially different auth flavor. Data servers need to use the same authflavor as the MDS server rpc client for non-state management operations. Populate a list of rpc clients with the MDS server rpc client auth flavor for the DS to use. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-06 14:49:16 -04:00
Milosz Tanski	971f0bdeaa	ceph: trivial buildbot warnings fix The linux-next build bot found a three of warnings, this addresses all of them. * non-ANSI function declaration of function 'ceph_fscache_register' and 'ceph_fscache_unregister' * symbol 'ceph_cache_netfs' was not declared, now it's extern in the header. * warning: "pr_fmt" redefined Signed-off-by: Milosz Tanski <milosz@adfin.com>	2013-09-06 16:50:12 +00:00
Milosz Tanski	e81568eb18	ceph: Do not do invalidate if the filesystem is mounted nofsc Previously we would always try to enqueue work even if the filesystem is not mounted with fscache enabled (or the file has no cookie). In the case of the filesystem mouned nofsc (but with fscache compiled in) this would lead to a crash. Signed-off-by: Milosz Tanski <milosz@adfin.com>	2013-09-06 16:50:12 +00:00
Milosz Tanski	d4d3aa38d6	ceph: page still marked private_2 Previous patch that allowed us to cleanup most of the issues with pages marked as private_2 when calling ceph_readpages. However, there seams to be a case in the error case clean up in start read that still trigers this from time to time. I've only seen this one a couple times. BUG: Bad page state in process petabucket pfn:335b82 page:ffffea000cd6e080 count:0 mapcount:0 mapping: (null) index:0x0 page flags: 0x200000000001000(private_2) Call Trace: [<ffffffff81563442>] dump_stack+0x46/0x58 [<ffffffff8112c7f7>] bad_page+0xc7/0x120 [<ffffffff8112cd9e>] free_pages_prepare+0x10e/0x120 [<ffffffff8112e580>] free_hot_cold_page+0x40/0x160 [<ffffffff81132427>] __put_single_page+0x27/0x30 [<ffffffff81132d95>] put_page+0x25/0x40 [<ffffffffa02cb409>] ceph_readpages+0x2e9/0x6f0 [ceph] [<ffffffff811313cf>] __do_page_cache_readahead+0x1af/0x260 Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Sage Weil <sage@inktank.com>	2013-09-06 16:50:12 +00:00
Milosz Tanski	9b8dd1e8a5	ceph: ceph_readpage_to_fscache didn't check if marked Previously ceph_readpage_to_fscache did not call if page was marked as cached before calling fscache_write_page resulting in a BUG inside of fscache. FS-Cache: Assertion failed ------------[ cut here ]------------ kernel BUG at fs/fscache/page.c:874! invalid opcode: 0000 [#1] SMP Call Trace: [<ffffffffa02e6566>] __ceph_readpage_to_fscache+0x66/0x80 [ceph] [<ffffffffa02caf84>] readpage_nounlock+0x124/0x210 [ceph] [<ffffffffa02cb08d>] ceph_readpage+0x1d/0x40 [ceph] [<ffffffff81126db6>] generic_file_aio_read+0x1f6/0x700 [<ffffffffa02c6fcc>] ceph_aio_read+0x5fc/0xab0 [ceph] Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Sage Weil <sage@inktank.com>	2013-09-06 16:50:12 +00:00
Milosz Tanski	76be778b3a	ceph: clean PgPrivate2 on returning from readpages In some cases the ceph readapages code code bails without filling all the pages already marked by fscache. When we return back to readahead code this causes a BUG. Signed-off-by: Milosz Tanski <milosz@adfin.com>	2013-09-06 16:50:11 +00:00
Milosz Tanski	99ccbd229c	ceph: use fscache as a local presisent cache Adding support for fscache to the Ceph filesystem. This would bring it to on par with some of the other network filesystems in Linux (like NFS, AFS, etc...) In order to mount the filesystem with fscache the 'fsc' mount option must be passed. Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Sage Weil <sage@inktank.com>	2013-09-06 16:50:11 +00:00
Milosz Tanski	cd0a2df681	Patches for Ceph FS-Cache support -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) iQIVAwUAUimQLxOxKuMESys7AQJc+Q/+N3BN+ZWRhfqSKANFyEuXIsUmzueCmmYc ZOgdRGTorlYKefqFHTFOHLvPbbxXaT2/HKUhq38yuA8UuIkoPL3rRQpjNcvWR+TX s/H17MRqPeZOfaDC/p9y2uL5MbuUvzhlCZ/GTi5w2ZwiNuWBo10gxeyCrXQSOFtH YFq0dVuG0AXzWdZuWcM3MtgY0llcMmfnZpIDjF4JDXJidgXY+wtjNUF3ByYZ+33+ +CmzXrnCaY+3N44Ji2Dn+ci8tym8uht4dnbTZFkQ0I6B+k93V2RkZeHWnDQWqW+c THjyG9c+LSf0m8FIh43DNNJSkywbh5dxsBgnqxhQTJMij0dV1ne8wjKptJMgz+0b HFUi4rE6oRQtbLdTtJhdjdFFORBGdFj71gW8foBdFAZTP4Amf/fbiAfHeK+33oDt s5PMJyfA3BDM90eBFoxDWjCEe+o6YcccHC0SVWM1ZJPQ/U0hXL99O6NSHBrr9iNP spBgM+fNTgtUMf6P6MwjJfTQHov5xevBNaLB3boUPAI6/yK8KQt9xoevJ7t902uQ /19bXoNgMmwAti1Gd6T5UnWlAHsOWdnIASUu8LVqEoh2PY42T2I4NTWhgs4NFntu 91MYysF93sTx0sMvGvWdCUSl7zMMeKXCUSUnPvMld+BMqyuK3XzsO3xkMn97t2/U p6kQZXZDlwU= =s35L -----END PGP SIGNATURE----- Merge tag 'fscache-fixes-for-ceph' into wip-fscache Patches for Ceph FS-Cache support	2013-09-06 16:41:20 +00:00
Linus Torvalds	2e515bf096	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...	2013-09-06 09:36:28 -07:00
Linus Torvalds	ec0ad73080	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3, reiserfs, udf & isofs fixes from Jan Kara: "The contains a bunch of ext3 cleanups and minor improvements, major reiserfs locking changes which should hopefully fix deadlocks introduced by BKL removal, and udf/isofs changes to refuse mounting fs rw instead of mounting it ro automatically which makes eject button work as expected for all media (see the changelog for why userspace should be ok with this change)" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: jbd: use a single printk for jbd_debug() reiserfs: locking, release lock around quota operations reiserfs: locking, handle nested locks properly reiserfs: locking, push write lock out of xattr code jbd: relocate assert after state lock in journal_commit_transaction() udf: Refuse RW mount of the filesystem instead of making it RO udf: Standardize return values in mount sequence isofs: Refuse RW mount of the filesystem instead of making it RO ext3: allow specifying external journal by pathname mount option jbd: remove unneeded semicolon	2013-09-06 09:06:02 -07:00
Linus Torvalds	eb97a784f0	f2fs updates for v3.12 This patch-set includes the following major enhancement patches. o support inline xattrs o add sysfs support to control GCs explicitly o add proc entry to show the current segment usage information o improve the GC/SSR performance The other bug fixes are as follows. o avoid the overflow on status calculation o fix some error handling routines o fix inconsistent xattr states after power-off-recovery o fix incorrect xattr node offset definition o fix deadlock condition in fsync o fix the fdatasync routine for power-off-recovery -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJSKDoaAAoJEEAUqH6CSFDSovoQAJSWnvRfeu4olkKe7LblVXA5 NFYsjtdtnWsmSY1kq2j541SLo8Kw2UibozbrN6BaJ9MOKnTz1+x0R9U0vpewmCO4 FkxlGX/3i3k/4tR0AvD4U56xgqh+IhYi18nBN8kOTwhLqjFtx5JFKAHBnGwjbB4T YpEaitNY6dL8l+DUxs11KnPmNazbck6iNGOYXpvfhTS4DNSJTT0L/fLqugDhFJNI 7e3f6vVORRwC5UdtJk6B6HXxv1pHv4uGeLki0W4jgGp7AdxpawbfeDrDcrECjoc+ 0s/QQTsjoeIKeCfojSEgLGSSl8PZpx2VVCxri+nMPjLzY81QUXbpsAlhB2RW9Uz/ E9ESAPpzL9ykh35THALic7N0ATXGlepnu0EGU6+fjWGUIyHeV+2yoswz599VliRO GunHgwrfNMyXWHw9zw6SPIJvN3caPn3wlDhffei9wOl92YkleBuHA7ojIzfRc2vz YQ7jKmZNZ/CM2qiw350XIfSaa+3iszlxwoWK7DLWQZm3um0MpYme9RmadnPvxsRM gnUYiovPwR+om3zAnURMvq/LNKi6NjflRgu2OAU/0CpJMEX9vaVe/xOKdjCs19je dxinQGuOS5P+J141SkM3jJ1eyZLC4zCyp42xQSZ1+Zg+6BU4PVY3+i4hCQFopHDt fyeav8SM/fk9HrKU3npq =YMk3 -----END PGP SIGNATURE----- Merge tag 'for-f2fs-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "This patch-set includes the following major enhancement patches: - support inline xattrs - add sysfs support to control GCs explicitly - add proc entry to show the current segment usage information - improve the GC/SSR performance The other bug fixes are as follows: - avoid the overflow on status calculation - fix some error handling routines - fix inconsistent xattr states after power-off-recovery - fix incorrect xattr node offset definition - fix deadlock condition in fsync - fix the fdatasync routine for power-off-recovery" * tag 'for-f2fs-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits) f2fs: optimize gc for better performance f2fs: merge more bios of node block writes f2fs: avoid an overflow during utilization calculation f2fs: trigger GC when there are prefree segments f2fs: use strncasecmp() simplify the string comparison f2fs: fix omitting to update inode page f2fs: support the inline xattrs f2fs: add the truncate_xattr_node function f2fs: introduce __find_xattr for readability f2fs: reserve the xattr space dynamically f2fs: add flags for inline xattrs f2fs: fix error return code in init_f2fs_fs() f2fs: fix wrong BUG_ON condition f2fs: fix memory leak when init f2fs filesystem fail f2fs: fix a compound statement label error f2fs: avoid writing inode redundantly when creating a file f2fs: alloc_page() doesn't return an ERR_PTR f2fs: should cover i_xattr_nid with its xattr node page lock f2fs: check the free space first in new_node_page f2fs: clean up the needless end 'return' of void function ...	2013-09-06 09:04:34 -07:00
Trond Myklebust	4109bb7496	NFS: Don't check lock owner compatability unless file is locked (part 2) When coalescing requests into a single READ or WRITE RPC call, and there is no file locking involved, we don't have to refuse coalescing for requests where the lock owner information doesn't match. Reported-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-06 11:27:41 -04:00
Milosz Tanski	5a6f282a20	fscache: Netfs function for cleanup post readpages Currently the fscache code expect the netfs to call fscache_readpages_or_alloc inside the aops readpages callback. It marks all the pages in the list provided by readahead with PG_private_2. In the cases that the netfs fails to read all the pages (which is legal) it ends up returning to the readahead and triggering a BUG. This happens because the page list still contains marked pages. This patch implements a simple fscache_readpages_cancel function that the netfs should call before returning from readpages. It will revoke the pages from the underlying cache backend and unmark them. The problem was originally worked out in the Ceph devel tree, but it also occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages() will clean up the unprocessed pages in the case of an error. This can be used to address the following oops: [12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping: (null) index:0x0 [12410647.597298] page flags: 0x200000000001000(private_2) ... [12410647.597334] Call Trace: [12410647.597345] [<ffffffff815523f2>] dump_stack+0x19/0x1b [12410647.597356] [<ffffffff8111def7>] bad_page+0xc7/0x120 [12410647.597359] [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120 [12410647.597361] [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170 [12410647.597363] [<ffffffff81123507>] __put_single_page+0x27/0x30 [12410647.597365] [<ffffffff81123df5>] put_page+0x25/0x40 [12410647.597376] [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph] [12410647.597379] [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260 [12410647.597382] [<ffffffff81122ea1>] ra_submit+0x21/0x30 [12410647.597384] [<ffffffff81118f64>] filemap_fault+0x254/0x490 [12410647.597387] [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0 [12410647.597391] [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0 [12410647.597395] [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0 [12410647.597398] [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930 [12410647.597401] [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110 [12410647.597403] [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10 [12410647.597405] [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [12410647.597407] [<ffffffff8113f361>] handle_mm_fault+0x251/0x370 [12410647.597411] [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30 [12410647.597414] [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550 [12410647.597418] [<ffffffff8108011d>] ? up_write+0x1d/0x20 [12410647.597422] [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0 [12410647.597425] [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240 [12410647.597427] [<ffffffff8155c3ae>] do_page_fault+0xe/0x10 [12410647.597431] [<ffffffff81558818>] page_fault+0x28/0x30 Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>	2013-09-06 09:17:30 +01:00
David Howells	5002d7bef8	CacheFiles: Implement interface to check cache consistency Implement the FS-Cache interface to check the consistency of a cache object in CacheFiles. Original-author: Hongyi Jia <jiayisuse@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Hongyi Jia <jiayisuse@gmail.com> cc: Milosz Tanski <milosz@adfin.com>	2013-09-06 09:17:30 +01:00
David Howells	da9803bc88	FS-Cache: Add interface to check consistency of a cached object Extend the fscache netfs API so that the netfs can ask as to whether a cache object is up to date with respect to its corresponding netfs object: int fscache_check_consistency(struct fscache_cookie cookie) This will call back to the netfs to check whether the auxiliary data associated with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it may also return -ENOMEM and -ERESTARTSYS. The backends now have to implement a mandatory operation pointer: int (check_consistency)(struct fscache_object *object) that corresponds to the above API call. FS-Cache takes care of pinning the object and the cookie in memory and managing this call with respect to the object state. Original-author: Hongyi Jia <jiayisuse@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Hongyi Jia <jiayisuse@gmail.com> cc: Milosz Tanski <milosz@adfin.com>	2013-09-06 09:17:30 +01:00
Phillip Lougher	9e01242386	Squashfs: add corruption check for type in squashfs_readdir() We read the type field from disk. This value should be sanity checked for correctness to avoid an out of bounds access when reading the squashfs_filetype_table array. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>	2013-09-06 04:57:54 +01:00
Phillip Lougher	f960cae535	Squashfs: add corruption check in get_dir_index_using_offset() We read the size (of the name) field from disk. This value should be sanity checked for correctness to avoid blindly reading huge amounts of unnecessary data from disk on corruption. Note, here we're not actually reading the name into a buffer, but skipping it, and so corruption doesn't cause buffer overflow, merely lots of unnecessary amounts of data to be read. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>	2013-09-06 04:57:53 +01:00
Phillip Lougher	68e7f41237	Squashfs: fix corruption checks in squashfs_readdir() The dir_count and size fields when read from disk are sanity checked for correctness. However, the sanity checks only check the values are not greater than expected. As dir_count and size were incorrectly defined as signed ints, this can lead to corrupted values appearing as negative which are not trapped. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>	2013-09-06 04:57:53 +01:00
Phillip Lougher	52e9ce1c0f	Squashfs: fix corruption checks in squashfs_lookup() The dir_count and size fields when read from disk are sanity checked for correctness. However, the sanity checks only check the values are not greater than expected. As dir_count and size were incorrectly defined as signed ints, this can lead to corrupted values appearing as negative which are not trapped. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>	2013-09-06 04:57:53 +01:00
Phillip Lougher	9dbc41d5d3	Squashfs: fix corruption check in get_dir_index_using_name() Patch "Squashfs: sanity check information from disk" from Dan Carpenter adds a missing check for corruption in the "size" field while reading the directory index from disk. It, however, sets err to -EINVAL, this value is not used later, and so setting it is completely redundant. So remove it. Errors in reading the index are deliberately non-fatal. If we get an error in reading the index we just return the part of the index we have managed to read - the index isn't essential, just quicker. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>	2013-09-06 04:57:52 +01:00
Linus Torvalds	b14662cae0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc Pull sparc changes from David Miller: "Several bug fixes (from Kirill Tkhai, Geery Uytterhoeven, and Alexey Dobriyan) and some support for Fujitsu sparc64x chips (from Allen Pais)" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Export flush_ptrace_access() (needed by lustre) sparc: fix PCI device proc file mmap(2) sparc64: Remove RWSEM export leftovers sparc64: Fix off by one in trampoline TLB mapping installation loop. sparc64: Fix ITLB handler of null page esp_scsi: Fix tag state corruption when autosensing. sparc64: Fix not SRA'ed %o5 in 32-bit traced syscall sparc64: cleanup: Rename ret_from_syscall to ret_from_fork sparc32: Fix exit flag passed from traced sys_sigreturn sparc64: Fix wrong syscall return value passed to trace_sys_exit() support sparc64x chip type in cpumap.c cpu hw caps support for sparc64x	2013-09-05 15:28:17 -07:00
Trond Myklebust	0f1d260550	NFS: Don't check lock owner compatibility in writes unless file is locked If we're doing buffered writes, and there is no file locking involved, then we don't have to worry about whether or not the lock owner information is identical. By relaxing this check, we ensure that fork()ed child processes can write to a page without having to first sync dirty data that was written by the parent to disk. Reported-by: Quentin Barnes <qbarnes@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Quentin Barnes <qbarnes@gmail.com>	2013-09-05 18:11:42 -04:00
Anand Avati	46ea1562da	fuse: drop dentry on failed revalidate Drop a subtree when we find that it has moved or been delated. This can be done as long as there are no submounts under this location. If the directory was moved and we come across the same directory in a future lookup it will be reconnected by d_materialise_unique(). Signed-off-by: Anand Avati <avati@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:54 -04:00
Miklos Szeredi	e2a6b95236	fuse: clean up return in fuse_dentry_revalidate() On errors unrelated to the filesystem's state (ENOMEM, ENOTCONN) return the error itself from ->d_revalidate() insted of returning zero (invalid). Also make a common label for invalidating the dentry. This will be used by the next patch. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:54 -04:00
Miklos Szeredi	5835f3390e	fuse: use d_materialise_unique() Use d_materialise_unique() instead of d_splice_alias(). This allows dentry subtrees to be moved to a new place if there moved, even if something is referencing a dentry in the subtree (open fd, cwd, etc..). This will also allow us to drop a subtree if it is found to be replaced by something else. In this case the disconnected subtree can later be reconnected to its new location. d_materialise_unique() ensures that a directory entry only ever has one alias. We keep fc->inst_mutex around the calls for d_materialise_unique() on directories to prevent a race with mkdir "stealing" the inode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:53 -04:00
Miklos Szeredi	6497d160f6	sysfs: use check_submounts_and_drop() Do have_submounts(), shrink_dcache_parent() and d_drop() atomically. check_submounts_and_drop() can deal with negative dentries and non-directories as well. Non-directories can also be mounted on. And just like directories we don't want these to disappear with invalidation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:53 -04:00
Miklos Szeredi	13caa9fb5b	nfs: use check_submounts_and_drop() Do have_submounts(), shrink_dcache_parent() and d_drop() atomically. check_submounts_and_drop() can deal with negative dentries and non-directories as well. Non-directories can also be mounted on. And just like directories we don't want these to disappear with invalidation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:52 -04:00
Miklos Szeredi	1191a2bdf0	gfs2: use check_submounts_and_drop() Do have_submounts(), shrink_dcache_parent() and d_drop() atomically. check_submounts_and_drop() can deal with negative dentries and non-directories as well. Non-directories can also be mounted on. And just like directories we don't want these to disappear with invalidation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:51 -04:00
Miklos Szeredi	ba81238076	afs: use check_submounts_and_drop() Do have_submounts(), shrink_dcache_parent() and d_drop() atomically. check_submounts_and_drop() can deal with negative dentries as well. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:51 -04:00
Miklos Szeredi	eed8100766	vfs: check unlinked ancestors before mount We check submounts before doing d_drop() on a non-empty directory dentry in NFS (have_submounts()), but we do not exclude a racing mount. Nor do we prevent mounts to be added to the disconnected subtree using relative paths after the d_drop(). This patch fixes these issues by checking for unlinked (unhashed, non-root) ancestors before proceeding with the mount. This is done with rename seqlock taken for write and with ->d_lock grabbed on each ancestor in turn, including our dentry itself. This ensures that the only one of check_submounts_and_drop() or has_unlinked_ancestor() can succeed. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:50 -04:00
Miklos Szeredi	848ac114e8	vfs: check submounts and drop atomically We check submounts before doing d_drop() on a non-empty directory dentry in NFS (have_submounts()), but we do not exclude a racing mount. Process A: have_submounts() -> returns false Process B: mount() -> success Process A: d_drop() This patch prepares the ground for the fix by doing the following operations all under the same rename lock: have_submounts() shrink_dcache_parent() d_drop() This is actually an optimization since have_submounts() and shrink_dcache_parent() both traverse the same dentry tree separately. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: David Howells <dhowells@redhat.com> CC: Steven Whitehouse <swhiteho@redhat.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:41 -04:00
Miklos Szeredi	db14fc3abc	vfs: add d_walk() This one replaces three instances open coded tree walking (have_submounts, select_parent, d_genocide) with a common helper. In addition to slightly reducing the kernel size, this simplifies the callers and makes them less bug prone. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:22:44 -04:00
Miklos Szeredi	01ddc4ede5	vfs: restructure d_genocide() It shouldn't matter when we decrement the refcount during the walk as long as we do it exactly once. Restructure d_genocide() to do the killing on entering the dentry instead of when leaving it. This helps creating a common helper for tree walking. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:22:43 -04:00
Alexey Dobriyan	c4fe244857	sparc: fix PCI device proc file mmap(2) Commit `786d7e1612` "Fix rmmod/read/write races in /proc entries" must have broken mmapping of PCI device proc files on Sparc. Notice how it adds wrapper around ->mmap but doesn't do it around ->get_unmapped_area. Add wrapper around ->get_unmapped_area. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-09-05 12:12:51 -07:00
Linus Torvalds	45d9a2220f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...	2013-09-05 08:50:26 -07:00
Weston Andros Adamson	8897538e97	nfs4: Map NFS4ERR_WRONG_CRED to EPERM Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-05 10:51:22 -04:00
Weston Andros Adamson	8c21c62c44	nfs4.1: Add SP4_MACH_CRED write and commit support WRITE and COMMIT can use the machine credential. If WRITE is supported and COMMIT is not, make all (mach cred) writes FILE_SYNC4. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-05 10:50:45 -04:00
Weston Andros Adamson	3787d5063c	nfs4.1: Add SP4_MACH_CRED stateid support TEST_STATEID and FREE_STATEID can use the machine credential. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-05 10:49:35 -04:00
Weston Andros Adamson	8b5bee2e1b	nfs4.1: Add SP4_MACH_CRED secinfo support SECINFO and SECINFO_NONAME can use the machine credential. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-05 10:48:30 -04:00
Weston Andros Adamson	fa940720ce	nfs4.1: Add SP4_MACH_CRED cleanup support CLOSE and LOCKU can use the machine credential. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-05 10:44:17 -04:00
Weston Andros Adamson	ab4c236135	nfs4.1: Add state protection handler Add nfs4_state_protect - the function responsible for switching to the machine credential and the correct rpc client when SP4_MACH_CRED is in use. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-05 10:43:33 -04:00
Weston Andros Adamson	2031cd1af1	nfs4.1: Minimal SP4_MACH_CRED implementation This is a minimal client side implementation of SP4_MACH_CRED. It will attempt to negotiate SP4_MACH_CRED iff the EXCHANGE_ID is using krb5i or krb5p auth. SP4_MACH_CRED will be used if the server supports the minimal operations: BIND_CONN_TO_SESSION EXCHANGE_ID CREATE_SESSION DESTROY_SESSION DESTROY_CLIENTID This patch only includes the EXCHANGE_ID negotiation code because the client will already use the machine cred for these operations. If the server doesn't support SP4_MACH_CRED or doesn't support the minimal operations, the exchange id will be resent with SP4_NONE. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-05 10:40:45 -04:00
Benjamin Marzinski	0c9018097f	GFS2: dirty inode correctly in gfs2_write_end GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when it actually increased the file size. If gfs2_fsync was called without I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before returning, so any metadata or journaled data changes were not getting fsynced. This meant that writes to the middle of files were not always getting fsynced properly. This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been updated during a write. It also make gfs2_sync flush the incore log if I_DIRTY_PAGES is set, and the file is using data journalling. This will make sure that all incore logged data gets written to disk before returning from a fsync. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-09-05 09:04:24 +01:00
Bob Peterson	1d12d175ea	GFS2: Don't flag consistency error if first mounter is a spectator This patch checks for the first mounter being a specator. If so, it makes sure all the journals are clean. If there's a dirty journal, the mount fails. Testing results: # insmod gfs2.ko # mount -tgfs2 -o spectator /dev/sasdrives/scratch /mnt/gfs2 mount: permission denied # dmesg \| tail -2 [ 3390.655996] GFS2: fsid=MUSKETEER:home: Now mounting FS... [ 3390.841336] GFS2: fsid=MUSKETEER:home.s: jid=0: Journal is dirty, so the first mounter must not be a spectator. # mount -tgfs2 /dev/sasdrives/scratch /mnt/gfs2 # umount /mnt/gfs2 # mount -tgfs2 -o spectator /dev/sasdrives/scratch /mnt/gfs2 # ls /mnt/gfs2\|wc -l 352 # umount /mnt/gfs2 Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-09-05 09:03:57 +01:00
Jin Xu	a26b7c8a01	f2fs: optimize gc for better performance This patch improves the gc efficiency by optimizing the victim selection policy. With this optimization, the random re-write performance could increase up to 20%. For f2fs, when disk is in shortage of free spaces, gc will selects dirty segments and moves valid blocks around for making more space available. The gc cost of a segment is determined by the valid blocks in the segment. The less the valid blocks, the higher the efficiency. The ideal victim segment is the one that has the most garbage blocks. Currently, it searches up to 20 dirty segments for a victim segment. The selected victim is not likely the best victim for gc when there are much more dirty segments. Why not searching more dirty segments for a better victim? The cost of searching dirty segments is negligible in comparison to moving blocks. In this patch, it enlarges the MAX_VICTIM_SEARCH to 4096 to make the search more aggressively for a possible better victim. Since it also applies to victim selection for SSR, it will likely improve the SSR efficiency as well. The test case is simple. It creates as many files until the disk full. The size for each file is 32KB. Then it writes as many as 100000 records of 4KB size to random offsets of random files in sync mode. The testing was done on a 2GB partition of a SDHC card. Let's see the test result of f2fs without and with the patch. --------------------------------------- 2GB partition, SDHC create 52023 files of size 32768 bytes random re-write 100000 records of 4KB --------------------------------------- \| file creation (s) \| rewrite time (s) \| gc count \| gc garbage blocks \| [no patch] 341 4227 1174 174840 [patched] 324 2958 645 106682 It's obvious that, with the patch, f2fs finishes the test in 20+% less time than without the patch. And internally it does much less gc with higher efficiency than before. Since the performance improvement is related to gc, it might not be so obvious for other tests that do not trigger gc as often as this one ( This is because f2fs selects dirty segments for SSR use most of the time when free space is in shortage). The well-known iozone test tool was not used for benchmarking the patch becuase it seems do not have a test case that performs random re-write on a full disk. This patch is the revised version based on the suggestion from Jaegeuk Kim. Signed-off-by: Jin Xu <jinuxstyle@gmail.com> [Jaegeuk Kim: suggested simpler solution] Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-09-05 13:50:32 +09:00
Jaegeuk Kim	423e95ccbe	f2fs: merge more bios of node block writes Previously, we experience bio traces as follows when running simple sequential write test. f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500104928, size = 4K f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499922208, size = 368K f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499914752, size = 140K -> total 512K The first one is to write an indirect node block, and the others are to write direct node blocks. The reason why there are two separate bios for direct node blocks is: 0. initial state ------------------ ------------------ \| \| \|xxxxxxxx \| ------------------ ------------------ 1. write 368K ------------------ ------------------ \| \| \|xxxxxxxxWWWWWWWW\| ------------------ ------------------ 2. write 140K ------------------ ------------------ \|WWWWWWW \| \|xxxxxxxxWWWWWWWW\| ------------------ ------------------ This is because f2fs_write_node_pages tries to write just 512K totally, so that we can lose the chance to merge more bios nicely. After this patch is applied, we can get the following bio traces. f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500103168, size = 8K f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500111368, size = 4K f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500107272, size = 512K f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500108296, size = 512K f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500109320, size = 500K And finally, we can improve the sequential write performance, from 458.775 MB/s to 479.945 MB/s on SSD. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-09-05 10:17:19 +09:00
Linus Torvalds	27703bb4a6	PTR_RET() is a weird name, and led to some confusing usage. We ended up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages. This has been sitting in linux-next for a whole cycle. Thanks, Rusty. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS 97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+ XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR +0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7 e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50 GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT BiHX6YFWtDDqRlVv8Q0F =LeaW -----END PGP SIGNATURE----- Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull PTR_RET() removal patches from Rusty Russell: "PTR_RET() is a weird name, and led to some confusing usage. We ended up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages. This has been sitting in linux-next for a whole cycle" [ There are still some PTR_RET users scattered about, with some of them possibly being new, but most of them existing in Rusty's tree too. We have that #define PTR_RET(p) PTR_ERR_OR_ZERO(p) thing in <linux/err.h>, so they continue to work for now - Linus ] * tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR(). staging/zcache: don't use PTR_RET(). remoteproc: don't use PTR_RET(). pinctrl: don't use PTR_RET(). acpi: Replace weird use of PTR_RET. s390: Replace weird use of PTR_RET. PTR_RET is now PTR_ERR_OR_ZERO(): Replace most. PTR_RET is now PTR_ERR_OR_ZERO	2013-09-04 17:31:11 -07:00
Linus Torvalds	6f3bc58d84	dlm for 3.12 This set includes a workqueue cleanup and the removal of incorrect and unneeded signal blocking. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) iQIcBAABAgAGBQJSJ4DKAAoJEDgbc8f8gGmqJ3kP/2UwnKt9degadDAPGy+xORUK WDkQ3YHacNkdVaIdSh6o4pAHG0Z2JU8zykJkHOAJd+SZhEp0hxEu05rfUSZKxJH/ b3pG6dpjX8pC1TtSibFXRrpH2BHIzpPRdxiDkmMs71myExwxeMbIQuGxmft8nigB 6yPBbb8biVTCNTzdpR2b0KbVKzrkIBotwWTUEZd3HJedxRO/BlQxOgHDgnicupfH 0FsU3aa+p986wwajZBPXPFrVcBGHMXOXc3Lpm56Uh7OwqCVYe2gE7BMW3Y7rcZ1G sGiB542m5i/LAYnf8sFXm2OLVjaUjWPyrWIyrLe5pPzZtHxctdhfyOHzXBqvKkj3 VRdcQN1rob+/8FGyu3tmYjhlhkEVQqIEX6sweuuGG5sE/zzvUzcUP0qIAT11QQJ/ Bt7nv5VpfT7kxIf7ToL/NtxHulU66TxC387d/QnkXftlDbzpzEY0YcCmcIdZMVEr e6rgvp1iWTvasLh/UFjTBkoRAX4DGpFQasYcS1S+Uj3IZx/wIbN8sIEijWcVvvuz fLeF2idgNqFCseWceBJGoHPsgFPRVDdFk9+7isu4FQYOEH0+VfNqJz3YU/bTehGK +GcfORSKohw2M3EUarcRHN0dV6vy6C/ACgTiUb7CubgbiUUzNWKTVVVuzRGqzBNE +k7w/Zbbz+Eugm4Zkty8 =MaZw -----END PGP SIGNATURE----- Merge tag 'dlm-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set includes a workqueue cleanup and the removal of incorrect and unneeded signal blocking" * tag 'dlm-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: remove signal blocking dlm: WQ_NON_REENTRANT is meaningless and going away	2013-09-04 17:23:59 -07:00
Linus Torvalds	ae67d9a888	New features for 3.12: * Added aggressive extent caching using the extent status tree. This can actually decrease memory usage in read-mostly workloads since the information is much more compactly stored in the extent status tree than if we had to keep the extent tree metadata blocks in the buffer cache. This also improves Asynchronous I/O since it is it makes much less likely that we need to do metadata I/O to lookup the extent tree information. * Improve the recovery after corrupted allocation bitmaps are found when running in errors=ignore mode. Also fixed some writeback vs. truncate races when using a blocksize less than the page size. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) iQIcBAABCAAGBQJSJ1SxAAoJENNvdpvBGATw6xAP/250u0YggRHup5cxmkJ7x+EH sv/Kbe8r1ftUY7aBQP1awHlVYnOZehh+kYUj+eIVPPXKananhu99qcJy99KFm8W9 gWVP5G0+zvKD++S8yHKhyKjqUtzZwhlYJU7oyqptPr903CVlfjsKx1OtGvUlbnde Hh/e+XpbltICPIa/O6gsE3SyRakbPtI0gvC4GbsD6EvAl+Rj3l6l+Ty9IkDqGFs9 YCVA2MUly6ZFYNRS8wkOPRP8T8lLwqIa7CNc75bEJPrGQL1R0iiIez0yaoZ83SOu HMC6wo3XjfgcsuMwJo/mtYsw06rjQy5oNPD5bISRaDtocI5v5Rv8t5EmANnoJFbu gy+psJ0XcKimL1BfsQ4vFCNiAkskkCQaFr2yJbo6VTDtHS8XV39MeMZ6IvcSqO+6 DQafMcKNiltDbdsywncsee+8ecncv/ZEZDiA6pIUm0lbljPopuzf6sBvxWOFGiHM xMBD0eyhns/TzfYHzzI+fTcR+GdBDqAkNOrA9i4medffS6iJDAJ6qC6ZhgQh32oR MCfYosVQwxmCInqtCh51+od29rk7ZIuBrPjp1+uMHjHqG5jDKcANgB7g3VAeQOf0 zuEYTFvGk6cLKfuJtlnaItKXN+eRTtVtfHlLRRq1+wR9UK+dFONV0Jufzs7Y1URI LbsmGkgxTL9xZEskZXgQ =tosu -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "New features for 3.12: - Added aggressive extent caching using the extent status tree. This can actually decrease memory usage in read-mostly workloads since the information is much more compactly stored in the extent status tree than if we had to keep the extent tree metadata blocks in the buffer cache. This also improves Asynchronous I/O since it is it makes much less likely that we need to do metadata I/O to lookup the extent tree information. - Improve the recovery after corrupted allocation bitmaps are found when running in errors=ignore mode. Also fixed some writeback vs truncate races when using a blocksize less than the page size" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: allow specifying external journal by pathname mount option ext4: mark group corrupt on group descriptor checksum ext4: mark block group as corrupt on inode bitmap error ext4: mark block group as corrupt on block bitmap error ext4: fix type declaration of ext4_validate_block_bitmap ext4: error out if verifying the block bitmap fails jbd2: Fix endian mixing problems in the checksumming code ext4: isolate ext4_extents.h file ext4: Fix misspellings using 'codespell' tool ext4: convert write_begin methods to stable_page_writes semantics ext4: fix use of potentially uninitialized variables in debugging code ext4: fix lost truncate due to race with writeback ext4: simplify truncation code in ext4_setattr() ext4: fix ext4_writepages() in presence of truncate ext4: move test whether extent to map can be extended to one place ext4: fix warning in ext4_da_update_reserve_space() quota: provide interface for readding allocated space into reserved space ext4: avoid reusing recently deleted inodes in no journal mode ext4: allocate delayed allocation blocks before rename ext4: start handle at least possible moment when renaming files ...	2013-09-04 17:19:27 -07:00
Manish Sharma	e0125262a2	Squashfs: Optimized uncompressed buffer loop Merged the two for loops. We might get a little gain by overlapping wait_on_bh and the memcpy operations. Signed-off-by: Manish Sharma <manishrma@gmail.com> Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>	2013-09-05 00:13:37 +01:00
Trond Myklebust	f6de7a39c1	NFSv4: Document the recover_lost_locks kernel parameter Rename the new 'recover_locks' kernel parameter to 'recover_lost_locks' and change the default to 'false'. Document why in Documentation/kernel-parameters.txt Move the 'recover_lost_locks' kernel parameter to fs/nfs/super.c to make it easy to backport to kernels prior to 3.6.x, which don't have a separate NFSv4 module. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-04 12:26:32 -04:00
NeilBrown	ef1820f9be	NFSv4: Don't try to recover NFSv4 locks when they are lost. When an NFSv4 client loses contact with the server it can lose any locks that it holds. Currently when it reconnects to the server it simply tries to reclaim those locks. This might succeed even though some other client has held and released a lock in the mean time. So the first client might think the file is unchanged, but it isn't. This isn't good. If, when recovery happens, the locks cannot be claimed because some other client still holds the lock, then we get a message in the kernel logs, but the client can still write. So two clients can both think they have a lock and can both write at the same time. This is equally not good. There was a patch a while ago http://comments.gmane.org/gmane.linux.nfs/41917 which tried to address some of this, but it didn't seem to go anywhere. That patch would also send a signal to the process. That might be useful but for now this patch just causes writes to fail. For NFSv4 (unlike v2/v3) there is a strong link between the lock and the write request so we can fairly easily fail any IO of the lock is gone. While some applications might not expect this, it is still safer than allowing the write to succeed. Because this is a fairly big change in behaviour a module parameter, "recover_locks", is introduced which defaults to true (the current behaviour) but can be set to "false" to tell the client not to try to recover things that were lost. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-04 12:26:32 -04:00
Chuck Lever	b6a85258d8	NFS: Fix warning introduced by NFSv4.0 transport blocking patches When CONFIG_NFS_V4_1 is not enabled, gcc emits this warning: linux/fs/nfs/nfs4state.c:255:12: warning: ‘nfs4_begin_drain_session’ defined but not used [-Wunused-function] static int nfs4_begin_drain_session(struct nfs_client *clp) ^ Eventually NFSv4.0 migration recovery will invoke this function, but that has not yet been merged. Hide nfs4_begin_drain_session() behind CONFIG_NFS_V4_1 for now. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-04 12:26:30 -04:00
Chuck Lever	1cec16abf2	When CONFIG_NFS_V4_1 is not enabled, "make C=2" emits this warning: linux/fs/nfs/nfs4session.c:337:6: warning: symbol 'nfs41_set_target_slotid' was not declared. Should it be static? Move nfs41_set_target_slotid() and nfs41_update_target_slotid() back behind CONFIG_NFS_V4_1, since, in the final revision of this work, they are used only in NFSv4.1 and later. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-04 12:26:30 -04:00
Dong Fang	05726acabe	fuse: use list_for_each_entry() for list traversing Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-09-04 17:42:42 +02:00
Bob Peterson	068213f7d3	GFS2: Remove unnecessary memory barrier Function test_and_clear_bit implies a memory barrier, so subsequent memory barriers are unnecessary. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-09-04 15:58:21 +01:00
Christoph Hellwig	02afc27fae	direct-io: Handle O_(D)SYNC AIO Call generic_write_sync() from the deferred I/O completion handler if O_DSYNC is set for a write request. Also make sure various callers don't call generic_write_sync if the direct I/O code returns -EIOCBQUEUED. Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-04 09:23:46 -04:00
Christoph Hellwig	7b7a8665ed	direct-io: Implement generic deferred AIO completions Add support to the core direct-io code to defer AIO completions to user context using a workqueue. This replaces opencoded and less efficient code in XFS and ext4 (we save a memory allocation for each direct IO) and will be needed to properly support O_(D)SYNC for AIO. The communication between the filesystem and the direct I/O code requires a new buffer head flag, which is a bit ugly but not avoidable until the direct I/O code stops abusing the buffer_head structure for communicating with the filesystems. Currently this creates a per-superblock unbound workqueue for these completions, which is taken from an earlier patch by Jan Kara. I'm not really convinced about this use and would prefer a "normal" global workqueue with a high concurrency limit, but this needs further discussion. JK: Fixed ext4 part, dynamic allocation of the workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-04 09:23:46 -04:00
Linus Torvalds	f83b0a4e4c	Big part of this is the addition of compression to the generic pstore layer so that all backends can use the pitiful amounts of storage they control more effectively. Three other small fixes/cleanups too. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJSJhMJAAoJEKurIx+X31iBcpsP/1DfQDD9IHEt7E+tTObIdKOx 1yygmQu3IL2pvdUIkKRfblDP0PJRuwJmRPdp48Rb5CV+8i4WQuTCueOvNomIMZ4G 1DkAzJPD2W4C+dSBD3rvWk4totXCsqLwCkaWp/FzWM7OW+cJN3Kj3Vdh3k01CVxn PqfCoZB5y8CVVd6EQiUmAU8z/LoGE0psxxO/CL7lBC1IsAlVqSO93hqHwE9cm/f4 7EN069sHbKV5zOT1QsV7agn9TfHAHuj0fr3t8aDs/cC0rgz19VAsxg0sRkMHzKJD MGLqGCTHyQvYmo2xfHZLb1NjhL6EmHsQqhM9KQnkdDyQ3QgYOCSsglcYygSdZcN1 ZjynLbD/OSkLABefyw/ZjW1GGzIqU6jWzj+0fmbkUa0Pe1B1aDuf/OiAcYMRJdmg clWE7RbnisSlrLYl3MI36T4XvyzzHVfumOVET6foWcnVb8/t/gHhC6Yf96FSUWVj jISE8ECfjkk7SQTYHE0ThvBmqv8M/IwUEg6wA9tnsQByLMAO+VqGoXcBMKOxz+hD 7TPoQpa04Sqnplwx7HefbhErKH9TCQWUOgT5kjB7+bzbPpEG1cB3pb7izg3VFd7k l7d6vfXtAukeuYmU8PgY/ea6/Nnw7iMcDZCegzv+A4Ms/kWM9oH2ctvV6V/edYyb 0x7BgSOu69HztW5IveB6 =6H8r -----END PGP SIGNATURE----- Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore changes from Tony Luck: "A big part of this is the addition of compression to the generic pstore layer so that all backends can use the pitiful amounts of storage they control more effectively. Three other small fixes/cleanups too. * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore/ram: (really) fix undefined usage of rounddown_pow_of_two pstore/ram: Read and write to the 'compressed' flag of pstore efi-pstore: Read and write to the 'compressed' flag of pstore erst: Read and write to the 'compressed' flag of pstore powerpc/pseries: Read and write to the 'compressed' flag of pstore pstore: Add file extension to pstore file if compressed pstore: Add decompression support to pstore pstore: Introduce new argument 'compressed' in the read callback pstore: Add compression support to pstore pstore/Kconfig: Select ZLIB_DEFLATE and ZLIB_INFLATE when PSTORE is selected pstore: Add new argument 'compressed' in pstore write callback powerpc/pseries: Remove (de)compression in nvram with pstore enabled pstore: d_alloc_name() doesn't return an ERR_PTR acpi/apei/erst: Add missing iounmap() on error in erst_exec_move_data()	2013-09-03 21:14:06 -07:00
Al Viro	173c84012a	switch fchmod() to fdget Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-03 23:04:45 -04:00
Al Viro	7e3fb5842e	switch epoll_ctl() to fdget Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-03 23:04:44 -04:00
Al Viro	e95c311e17	git simplify nilfs check for busy subtree Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-03 22:52:50 -04:00
Al Viro	badcf2b7b8	constify touch_atime() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-03 22:52:45 -04:00
Jeff Layton	8033426e6b	vfs: allow umount to handle mountpoints without revalidating them Christopher reported a regression where he was unable to unmount a NFS filesystem where the root had gone stale. The problem is that d_revalidate handles the root of the filesystem differently from other dentries, but d_weak_revalidate does not. We could simply fix this by making d_weak_revalidate return success on IS_ROOT dentries, but there are cases where we do want to revalidate the root of the fs. A umount is really a special case. We generally aren't interested in anything but the dentry and vfsmount that's attached at that point. If the inode turns out to be stale we just don't care since the intent is to stop using it anyway. Try to handle this situation better by treating umount as a special case in the lookup code. Have it resolve the parent using normal means, and then do a lookup of the final dentry without revalidating it. In most cases, the final lookup will come out of the dcache, but the case where there's a trailing symlink or !LAST_NORM entry on the end complicates things a bit. Cc: Neil Brown <neilb@suse.de> Reported-by: Christopher T Vogan <cvogan@us.ibm.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-03 22:50:29 -04:00
Yan, Zheng	590fb51f1c	vfs: call d_op->d_prune() before unhashing dentry The d_prune dentry operation is used to notify filesystem when VFS about to prune a hashed dentry from the dcache. There are three code paths that prune dentries: shrink_dcache_for_umount_subtree(), prune_dcache_sb() and d_prune_aliases(). For the d_prune_aliases() case, VFS unhashes the dentry first, then call the d_prune dentry operation. This confuses ceph_d_prune() (ceph uses the d_prune dentry operation to maintain a flag indicating whether the complete contents of a directory are in the dcache, pruning unhashed dentry does not affect dir's completeness) This patch fixes the issue by calling the d_prune dentry operation in d_prune_aliases(), before unhashing the dentry. Also make VFS only call the d_prune dentry operation for hashed dentry, to avoid calling the d_prune dentry operation twice when dentry is pruned by d_prune_aliases(). Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-03 22:50:28 -04:00
Al Viro	184cacabe2	only regular files with FMODE_WRITE need to be on s_files Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-03 22:50:28 -04:00
Al Viro	301f0268b6	nfsd: racy access to ->d_name in nsfd4_encode_path() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-03 22:50:28 -04:00
Linus Torvalds	32dad03d16	Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on the cgroup front. Most changes aren't visible to userland at all at this point and are laying foundation for the planned unified hierarchy. - The biggest change is decoupling the lifetime management of css (cgroup_subsys_state) from that of cgroup's. Because controllers (cpu, memory, block and so on) will need to be dynamically enabled and disabled, css which is the association point between a cgroup and a controller may come and go dynamically across the lifetime of a cgroup. Till now, css's were created when the associated cgroup was created and stayed till the cgroup got destroyed. Assumptions around this tight coupling permeated through cgroup core and controllers. These assumptions are gradually removed, which consists bulk of patches, and css destruction path is completely decoupled from cgroup destruction path. Note that decoupling of creation path is relatively easy on top of these changes and the patchset is pending for the next window. - cgroup has its own event mechanism cgroup.event_control, which is only used by memcg. It is overly complex trying to achieve high flexibility whose benefits seem dubious at best. Going forward, new events will simply generate file modified event and the existing mechanism is being made specific to memcg. This pull request contains prepatory patches for such change. - Various fixes and cleanups" Fixed up conflict in kernel/cgroup.c as per Tejun. * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits) cgroup: fix cgroup_css() invocation in css_from_id() cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp() cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup cgroup: implement CFTYPE_NO_PREFIX cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax cgroup: fix cgroup_write_event_control() cgroup: fix subsystem file accesses on the root cgroup cgroup: change cgroup_from_id() to css_from_id() cgroup: use css_get() in cgroup_create() to check CSS_ROOT cpuset: remove an unncessary forward declaration cgroup: RCU protect each cgroup_subsys_state release cgroup: move subsys file removal to kill_css() cgroup: factor out kill_css() cgroup: decouple cgroup_subsys_state destruction from cgroup destruction cgroup: replace cgroup->css_kill_cnt with ->nr_css cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item cgroup: move cgroup->subsys[] assignment to online_css() cgroup: reorganize css init / exit paths cgroup: add __rcu modifier to cgroup->subsys[] ...	2013-09-03 18:25:03 -07:00
Dave Chinner	1d03c6fa88	xfs: XFS_MOUNT_QUOTA_ALL needed by userspace So move it to a header file shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-03 15:00:06 -05:00
Dave Chinner	50fc5f7acc	xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino So fix up the export in xfs_dir2.h that is needed by userspace. <sigh> Now xfs_dir3_sfe_put_ino has been made static. Revert `98f7462` ("xfs: xfs_dir3_sfe_put_ino can be static") to being non static so that the code shared with userspace is identical again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-03 14:51:16 -05:00
Chuck Lever	2cf8bca8b9	NFS: Update session draining barriers for NFSv4.0 transport blocking Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:38 -04:00
Chuck Lever	be05c860d7	NFS: Add nfs4_sequence calls for OPEN_CONFIRM Ensure OPEN_CONFIRM is not emitted while the transport is plugged. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:37 -04:00
Chuck Lever	fbd4bfd1d9	NFS: Add nfs4_sequence calls for RELEASE_LOCKOWNER Ensure RELEASE_LOCKOWNER is not emitted while the transport is plugged. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:37 -04:00
Chuck Lever	160881e33d	NFS: Enable nfs4_setup_sequence() for DELEGRETURN When CONFIG_NFS_V4_1 is disabled, the calls to nfs4_setup_sequence() and nfs4_sequence_done() are compiled out for the DELEGRETURN operation. To allow NFSv4.0 transport blocking to work for DELEGRETURN, these call sites have to be present all the time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:36 -04:00
Chuck Lever	3bd2384a77	NFS: NFSv4.0 transport blocking Plumb in a mechanism for plugging an NFSv4.0 mount, using the same infrastructure as NFSv4.1 sessions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:35 -04:00
Chuck Lever	abf79bb341	NFS: Add a slot table to struct nfs_client for NFSv4.0 transport blocking Anchor an nfs4_slot_table in the nfs_client for use with NFSv4.0 transport blocking. It is initialized only for NFSv4.0 nfs_client's. Introduce appropriate minor version ops to handle nfs_client initialization and shutdown requirements that differ for each minor version. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:35 -04:00
Chuck Lever	eb2a1cd3c9	NFS: Add global helper for releasing slot table resources The nfs4_destroy_slot_tables() function is renamed to avoid confusion with the new helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:34 -04:00
Chuck Lever	744aa52530	NFS: Add global helper to set up a stand-along nfs4_slot_table Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:34 -04:00
Chuck Lever	9d33059c1b	NFS: Enable slot table helpers for NFSv4.0 I'd like to re-use NFSv4.1's slot table machinery for NFSv4.0 transport blocking. Re-organize some of nfs4session.c so the slot table code is built even when NFS_V4_1 is disabled. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:33 -04:00
Chuck Lever	220e09ccd3	NFS: Remove unused call_sync minor version op Clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:33 -04:00
Chuck Lever	9915ea7e0a	NFS: Add RPC callouts to start NFSv4.0 synchronous requests Refactor nfs4_call_sync_sequence() so it is used for NFSv4.0 now. The RPC callouts will house transport blocking logic similar to NFSv4.1 sessions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:32 -04:00
Chuck Lever	a9c92d6b85	NFS: Common versions of sequence helper functions NFSv4.0 will have need for this functionality when I add the ability to block NFSv4.0 traffic before migration recovery. I'm not really clear on why nfs4_set_sequence_privileged() gets a generic name, but nfs41_init_sequence() gets a minor version-specific name. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:32 -04:00
Chuck Lever	5a580e0ae2	NFS: Clean up nfs4_setup_sequence() Clean up: Both the NFSv4.0 and NFSv4.1 version of nfs4_setup_sequence() are used only in fs/nfs/nfs4proc.c. No need to keep global header declarations for either version. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:31 -04:00
Chuck Lever	2a3eb2b97b	NFS: Rename nfs41_call_sync_data as a common data structure Clean up: rename nfs41_call_sync_data for use as a data structure common to all NFSv4 minor versions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:31 -04:00
Chuck Lever	e8d92382dd	NFS: When displaying session slot numbers, use "%u" consistently Clean up, since slot and sequence numbers are all unsigned anyway. Among other things, squelch compiler warnings: linux/fs/nfs/nfs4proc.c: In function ‘nfs4_setup_sequence’: linux/fs/nfs/nfs4proc.c:703:2: warning: signed and unsigned type in conditional expression [-Wsign-compare] and linux/fs/nfs/nfs4session.c: In function ‘nfs4_alloc_slot’: linux/fs/nfs/nfs4session.c:151:31: warning: signed and unsigned type in conditional expression [-Wsign-compare] Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:30 -04:00
Trond Myklebust	ba6c05928d	NFS: Ensure that rmdir() waits for sillyrenames to complete If an NFS client does mkdir("dir"); fd = open("dir/file"); unlink("dir/file"); close(fd); rmdir("dir"); then the asynchronous nature of the sillyrename operation means that we can end up getting EBUSY for the rmdir() in the above test. Fix that by ensuring that we wait for any in-progress sillyrenames before sending the rmdir() to the server. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:26:29 -04:00
Weston Andros Adamson	a5250def7c	NFSv4: use the mach cred for SECINFO w/ integrity Commit `5ec16a8500` introduced a regression that causes SECINFO to fail without actualy sending an RPC if: 1) the nfs_client's rpc_client was using KRB5i/p (now tried by default) 2) the current user doesn't have valid kerberos credentials This situation is quite common - as of now a sec=sys mount would use krb5i for the nfs_client's rpc_client and a user would hardly be faulted for not having run kinit. The solution is to use the machine cred when trying to use an integrity protected auth flavor for SECINFO. Older servers may not support using the machine cred or an integrity protected auth flavor for SECINFO in every circumstance, so we fall back to using the user's cred and the filesystem's auth flavor in this case. We run into another problem when running against linux nfs servers - they return NFS4ERR_WRONGSEC when using integrity auth flavor (unless the mount is also that flavor) even though that is not a valid error for SECINFO*. Even though it's against spec, handle WRONGSEC errors on SECINFO by falling back to using the user cred and the filesystem's auth flavor. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:25:10 -04:00
Andy Adamson	dc24826bfc	NFS avoid expired credential keys for buffered writes We must avoid buffering a WRITE that is using a credential key (e.g. a GSS context key) that is about to expire or has expired. We currently will paint ourselves into a corner by returning success to the applciation for such a buffered WRITE, only to discover that we do not have permission when we attempt to flush the WRITE (and potentially associated COMMIT) to disk. Use the RPC layer credential key timeout and expire routines which use a a watermark, gss_key_expire_timeo. We test the key in nfs_file_write. If a WRITE is using a credential with a key that will expire within watermark seconds, flush the inode in nfs_write_end and send only NFS_FILE_SYNC WRITEs by adding nfs_ctx_key_to_expire to nfs_need_sync_write. Note that this results in single page NFS_FILE_SYNC WRITEs. Signed-off-by: Andy Adamson <andros@netapp.com> [Trond: removed a pr_warn_ratelimited() for now] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-03 15:25:09 -04:00
Linus Torvalds	542a086ac7	Driver core patches for 3.12-rc1 Here's the big driver core pull request for 3.12-rc1. Lots of tiny changes here fixing up the way sysfs attributes are created, to try to make drivers simpler, and fix a whole class race conditions with creations of device attributes after the device was announced to userspace. All the various pieces are acked by the different subsystem maintainers. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.21 (GNU/Linux) iEYEABECAAYFAlIlIPcACgkQMUfUDdst+ynUMwCaAnITsxyDXYQ4DqEsz8EcOtMk 718AoLrgnUZs3B+70AT34DVktg4HSThk =USl9 -----END PGP SIGNATURE----- Merge tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here's the big driver core pull request for 3.12-rc1. Lots of tiny changes here fixing up the way sysfs attributes are created, to try to make drivers simpler, and fix a whole class race conditions with creations of device attributes after the device was announced to userspace. All the various pieces are acked by the different subsystem maintainers" * tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits) firmware loader: fix pending_fw_head list corruption drivers/base/memory.c: introduce help macro to_memory_block dynamic debug: line queries failing due to uninitialized local variable sysfs: sysfs_create_groups returns a value. debugfs: provide debugfs_create_x64() when disabled rbd: convert bus code to use bus_groups firmware: dcdbas: use binary attribute groups sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled driver core: add #include <linux/sysfs.h> to core files. HID: convert bus code to use dev_groups Input: serio: convert bus code to use drv_groups Input: gameport: convert bus code to use drv_groups driver core: firmware: use __ATTR_RW() driver core: core: use DEVICE_ATTR_RO driver core: bus: use DRIVER_ATTR_WO() driver core: create write-only attribute macros for devices and drivers sysfs: create __ATTR_WO() driver-core: platform: convert bus code to use dev_groups workqueue: convert bus code to use dev_groups MEI: convert bus code to use dev_groups ...	2013-09-03 11:37:15 -07:00
Linus Torvalds	fc6d0b0376	Merge branch 'lockref' (locked reference counts) Merge lockref infrastructure code by me and Waiman Long. I already merged some of the preparatory patches that didn't actually do any semantic changes earlier, but this merges the actual _reason_ for those preparatory patches. The "lockref" structure is a combination "spinlock and reference count" that allows optimized reference count accesses. In particular, it guarantees that the reference count will be updated AS IF the spinlock was held, but using atomic accesses that cover both the reference count and the spinlock words, we can often do the update without actually having to take the lock. This allows us to avoid the nastiest cases of spinlock contention on large machines under heavy pathname lookup loads. When updating the dentry reference counts on a large system, we'll still end up with the cache line bouncing around, but that's much less noticeable than actually having to spin waiting for the lock. * lockref: lockref: implement lockless reference count updates using cmpxchg() lockref: uninline lockref helper functions vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() vfs: use lockref_get_not_zero() for optimistic lockless dget_parent() lockref: add 'lockref_get_or_lock() helper	2013-09-03 08:08:21 -07:00
Miklos Szeredi	efeb9e60d4	fuse: readdir: check for slash in names Userspace can add names containing a slash character to the directory listing. Don't allow this as it could cause all sorts of trouble. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-03 14:28:38 +02:00
Maxim Patlasov	06a7c3c278	fuse: hotfix truncate_pagecache() issue The way how fuse calls truncate_pagecache() from fuse_change_attributes() is completely wrong. Because, w/o i_mutex held, we never sure whether 'oldsize' and 'attr->size' are valid by the time of execution of truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we released fc->lock in the middle of fuse_change_attributes(), we completely loose control of actions which may happen with given inode until we reach truncate_pagecache. The list of potentially dangerous actions includes mmap-ed reads and writes, ftruncate(2) and write(2) extending file size. The typical outcome of doing truncate_pagecache() with outdated arguments is data corruption from user point of view. This is (in some sense) acceptable in cases when the issue is triggered by a change of the file on the server (i.e. externally wrt fuse operation), but it is absolutely intolerable in scenarios when a single fuse client modifies a file without any external intervention. A real life case I discovered by fsx-linux looked like this: 1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends FUSE_SETATTR to the server synchronously, but before getting fc->lock ... 2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP to the server synchronously, then calls fuse_change_attributes(). The latter updates i_size, releases fc->lock, but before comparing oldsize vs attr->size.. 3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and updating attributes and i_size, but now oldsize is equal to outarg.attr.size because i_size has just been updated (step 2). Hence, fuse_do_setattr() returns w/o calling truncate_pagecache(). 4. As soon as ftruncate(2) completes, the user extends file size by write(2) making a hole in the middle of file, then reads data from the hole either by read(2) or mmap-ed read. The user expects to get zero data from the hole, but gets stale data because truncate_pagecache() is not executed yet. The scenario above illustrates one side of the problem: not truncating the page cache even though we should. Another side corresponds to truncating page cache too late, when the state of inode changed significantly. Theoretically, the following is possible: 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size changed (due to our own fuse_do_setattr()) and is going to call truncate_pagecache() for some 'new_size' it believes valid right now. But by the time that particular truncate_pagecache() is called ... 2. fuse_do_setattr() returns (either having called truncate_pagecache() or not -- it doesn't matter). 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2). 4. mmap-ed write makes a page in the extended region dirty. The result will be the lost of data user wrote on the fourth step. The patch is a hotfix resolving the issue in a simplistic way: let's skip dangerous i_size update and truncate_pagecache if an operation changing file size is in progress. This simplistic approach looks correct for the cases w/o external changes. And to handle them properly, more sophisticated and intrusive techniques (e.g. NFS-like one) would be required. I'd like to postpone it until the issue is well discussed on the mailing list(s). Changed in v2: - improved patch description to cover both sides of the issue. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-03 13:41:58 +02:00
Anand Avati	d331a415ae	fuse: invalidate inode attributes on xattr modification Calls like setxattr and removexattr result in updation of ctime. Therefore invalidate inode attributes to force a refresh. Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-03 13:41:58 +02:00
Maxim Patlasov	4a4ac4eba1	fuse: postpone end_page_writeback() in fuse_writepage_locked() The patch fixes a race between ftruncate(2), mmap-ed write and write(2): 1) An user makes a page dirty via mmap-ed write. 2) The user performs shrinking truncate(2) intended to purge the page. 3) Before fuse_do_setattr calls truncate_pagecache, the page goes to writeback. fuse_writepage_locked fills FUSE_WRITE request and releases the original page by end_page_writeback. 4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex is free. 5) Ordinary write(2) extends i_size back to cover the page. Note that fuse_send_write_pages do wait for fuse writeback, but for another page->index. 6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request. fuse_send_writepage is supposed to crop inarg->size of the request, but it doesn't because i_size has already been extended back. Moving end_page_writeback to the end of fuse_writepage_locked fixes the race because now the fact that truncate_pagecache is successfully returned infers that fuse_writepage_locked has already called end_page_writeback. And this, in turn, infers that fuse_flush_writepages has already called fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2) could not extend it because of i_mutex held by ftruncate(2). Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-03 13:41:57 +02:00
Jaegeuk Kim	222cbdc483	f2fs: avoid an overflow during utilization calculation The current f2fs uses all the block counts with 32 bit numbers, which is able to cover about 15TB volume. But in calculation of utilization, f2fs multiplies the count by 100 which can induce overflow. This patch fixes this. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-09-03 13:41:37 +09:00
Jaegeuk Kim	c34e333fd5	f2fs: trigger GC when there are prefree segments Previously, f2fs conducts SSR when free_sections() < overprovision_sections. But, even though there are a lot of prefree segments, it can consider SSR only. So, let's consider the number of prefree segments too for triggering SSR. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-09-03 10:11:20 +09:00
Linus Torvalds	15570086b5	vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c and re-implements it using the lockref infrastructure instead. It also adds a lot of comments about what is actually going on, because turning a dentry that was looked up using RCU into a long-lived reference counted entry is one of the more subtle parts of the rcu walk. We also used to be _particularly_ subtle in unlazy_walk() where we re-validate both the dentry and its parent using the same sequence count. We used to do it by nesting the locks and then verifying the sequence count just once. That was silly, because nested locking is expensive, but the sequence count check is not. So this just re-validates the dentry and the parent separately, avoiding the nested locking, and making the lockref lookup possible. Acked-by: Waiman Long <waiman.long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-02 11:38:06 -07:00
Waiman Long	df3d0bbcdb	vfs: use lockref_get_not_zero() for optimistic lockless dget_parent() A valid parent pointer is always going to have a non-zero reference count, but if we look up the parent optimistically without locking, we have to protect against the (very unlikely) race against renaming changing the parent from under us. We do that by using lockref_get_not_zero(), and then re-checking the parent pointer after getting a valid reference. [ This is a re-implementation of a chunk from the original patch by Waiman Long: "dcache: Enable lockless update of dentry's refcount". I've completely rewritten the patch-series and split it up, but I'm attributing this part to Waiman as it's close enough to his earlier patch - Linus ] Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-02 11:29:22 -07:00
Trond Myklebust	2127d82af3	NFSv4: Convert idmapper to use the new framework for pipefs dentries Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-01 11:12:42 -04:00
Filipe David Borba Manana	d7396f0735	Btrfs: optimize key searches in btrfs_search_slot When the binary search returns 0 (exact match), the target key will necessarily be at slot 0 of all nodes below the current one, so in this case the binary search is not needed because it will always return 0, and we waste time doing it, holding node locks for longer than necessary, etc. Below follow histograms with the times spent on the current approach of doing a binary search when the previous binary search returned 0, and times for the new approach, which directly picks the first item/child node in the leaf/node. Current approach: Count: 6682 Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429 Percentiles: 90th: 124.000; 95th: 145.000; 99th: 206.000 35.000 - 61.080: 1235 ################ 61.080 - 106.053: 4207 ##################################################### 106.053 - 183.606: 1122 ############## 183.606 - 317.341: 111 # 317.341 - 547.959: 6 \| 547.959 - 8370.000: 1 \| Approach proposed by this patch: Count: 6682 Range: 6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev: 7.160 Percentiles: 90th: 23.000; 95th: 27.000; 99th: 40.000 6.000 - 8.418: 58 # 8.418 - 11.670: 1149 ######################### 11.670 - 16.046: 2418 ##################################################### 16.046 - 21.934: 2098 ############################################## 21.934 - 29.854: 744 ################ 29.854 - 40.511: 154 ### 40.511 - 54.848: 41 # 54.848 - 74.136: 5 \| 74.136 - 100.087: 9 \| 100.087 - 135.000: 6 \| These samples were captured during a run of the btrfs tests 001, 002 and 004 in the xfstests, with a leaf/node size of 4Kb. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:42 -04:00
Josef Bacik	45d5fd14d2	Btrfs: don't use an async starter for most of our workers We only need an async starter if we can't make a GFP_NOFS allocation in our current path. This is the case for the endio stuff since it happens in IRQ context, but things like the caching thread workers and the delalloc flushers we can easily make this allocation and start threads right away. Also change the worker count for the caching thread pool. Traditionally we limited this to 2 since we took read locks while caching, but nowadays we do this lockless so there's no reason to limit the number of caching threads. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:41 -04:00
Josef Bacik	7f4f6e0a3f	Btrfs: only update disk_i_size as we remove extents This fixes a problem where if we fail a truncate we will leave the i_size set where we wanted to truncate to instead of where we were able to truncate to. Fix this by making btrfs_truncate_inode_items do the disk_i_size update as it removes extents, that way it will always be consistent with where its extents are. Then if the truncate fails at all we can update the in-ram i_size with what we have on disk and delete the orphan item. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:40 -04:00
Filipe David Borba Manana	f45388f387	Btrfs: fix deadlock in uuid scan kthread If there's an ongoing transaction when the uuid scan kthread attempts to create one, the kthread will block, waiting for that transaction to finish while it's keeping locks on the tree root, and in turn the existing transaction is waiting for those locks to be free. The stack trace reported by the kernel follows. [36700.671601] INFO: task btrfs-uuid:15480 blocked for more than 120 seconds. [36700.671602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [36700.671602] btrfs-uuid D 0000000000000000 0 15480 2 0x00000000 [36700.671604] ffff880710bd5b88 0000000000000046 ffff8803d36ba850 0000000000030000 [36700.671605] ffff8806d76dc530 ffff880710bd5fd8 ffff880710bd5fd8 ffff880710bd5fd8 [36700.671607] ffff8808098ac530 ffff8806d76dc530 ffff880710bd5b98 ffff8805e4508e40 [36700.671608] Call Trace: [36700.671610] [<ffffffff816f36b9>] schedule+0x29/0x70 [36700.671620] [<ffffffffa05a3bdf>] wait_current_trans.isra.33+0xbf/0x120 [btrfs] [36700.671623] [<ffffffff81066760>] ? add_wait_queue+0x60/0x60 [36700.671629] [<ffffffffa05a5b06>] start_transaction+0x3d6/0x530 [btrfs] [36700.671636] [<ffffffffa05bb1f4>] ? btrfs_get_token_32+0x64/0xf0 [btrfs] [36700.671642] [<ffffffffa05a5fbb>] btrfs_start_transaction+0x1b/0x20 [btrfs] [36700.671649] [<ffffffffa05c8a81>] btrfs_uuid_scan_kthread+0x211/0x3d0 [btrfs] [36700.671655] [<ffffffffa05c8870>] ? __btrfs_open_devices+0x2a0/0x2a0 [btrfs] [36700.671657] [<ffffffff81065fa0>] kthread+0xc0/0xd0 [36700.671659] [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0 [36700.671661] [<ffffffff816fcd1c>] ret_from_fork+0x7c/0xb0 [36700.671662] [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0 [36700.671663] INFO: task btrfs:15481 blocked for more than 120 seconds. [36700.671664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [36700.671665] btrfs D 0000000000000000 0 15481 15212 0x00000004 [36700.671666] ffff880248cbf4c8 0000000000000086 ffff8803d36ba700 ffff8801dbd5c280 [36700.671668] ffff880807815c40 ffff880248cbffd8 ffff880248cbffd8 ffff880248cbffd8 [36700.671669] ffff8805e86a0000 ffff880807815c40 ffff880248cbf4d8 ffff8801dbd5c280 [36700.671670] Call Trace: [36700.671672] [<ffffffff816f36b9>] schedule+0x29/0x70 [36700.671679] [<ffffffffa05d9b0d>] btrfs_tree_lock+0x6d/0x230 [btrfs] [36700.671680] [<ffffffff81066760>] ? add_wait_queue+0x60/0x60 [36700.671685] [<ffffffffa0582829>] btrfs_search_slot+0x999/0xb00 [btrfs] [36700.671691] [<ffffffffa05bd9de>] ? btrfs_lookup_first_ordered_extent+0x5e/0xb0 [btrfs] [36700.671698] [<ffffffffa05e3e54>] __btrfs_write_out_cache+0x8c4/0xa80 [btrfs] [36700.671704] [<ffffffffa05e4362>] btrfs_write_out_cache+0xb2/0xf0 [btrfs] [36700.671710] [<ffffffffa05c4441>] ? free_extent_buffer+0x61/0xc0 [btrfs] [36700.671716] [<ffffffffa0594c82>] btrfs_write_dirty_block_groups+0x562/0x650 [btrfs] [36700.671723] [<ffffffffa0610092>] commit_cowonly_roots+0x171/0x24b [btrfs] [36700.671729] [<ffffffffa05a4dde>] btrfs_commit_transaction+0x4fe/0xa10 [btrfs] [36700.671735] [<ffffffffa0610af3>] create_subvol+0x5c0/0x636 [btrfs] [36700.671742] [<ffffffffa05d49ff>] btrfs_mksubvol.isra.60+0x33f/0x3f0 [btrfs] [36700.671747] [<ffffffffa05d4bf2>] btrfs_ioctl_snap_create_transid+0x142/0x190 [btrfs] [36700.671752] [<ffffffffa05d4c6c>] ? btrfs_ioctl_snap_create+0x2c/0x80 [btrfs] [36700.671757] [<ffffffffa05d4c9e>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs] [36700.671759] [<ffffffff8113a764>] ? handle_pte_fault+0x84/0x920 [36700.671764] [<ffffffffa05d87eb>] btrfs_ioctl+0xf0b/0x1d00 [btrfs] [36700.671766] [<ffffffff8113c120>] ? handle_mm_fault+0x210/0x310 [36700.671768] [<ffffffff816f83a4>] ? __do_page_fault+0x284/0x4e0 [36700.671770] [<ffffffff81180aa6>] do_vfs_ioctl+0x96/0x550 [36700.671772] [<ffffffff81170fe3>] ? __sb_end_write+0x33/0x70 [36700.671774] [<ffffffff81180ff1>] SyS_ioctl+0x91/0xb0 [36700.671775] [<ffffffff816fcdc2>] system_call_fastpath+0x16/0x1b Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:39 -04:00
Ilya Dryomov	795a332139	Btrfs: stop refusing the relocation of chunk 0 AFAICT chunk 0 is no longer special, and so it should be restriped just like every other chunk. One reason for this change is us refusing the relocation can lead to filesystems that can only be mounted ro, and never rw -- see the bugzilla [1] for details. The other reason is that device removal code is already doing this: it will happily relocate chunk 0 is part of shrinking the device. [1] https://bugzilla.kernel.org/show_bug.cgi?id=60594 Reported-by: Xavier Bassery <xavier@bartica.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:38 -04:00
Filipe David Borba Manana	d8f980391f	Btrfs: fix memory leak of uuid_root in free_fs_info Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:37 -04:00
Andy Shevchenko	ed84885d1e	btrfs: reuse kbasename helper To get name of the file from a pathname let's use kbasename() helper. It allows to simplify code a bit. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:36 -04:00
Anand Jain	e57138b3e9	btrfs: return btrfs error code for dev excl ops err now threads can return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS as defined in btrfs.h for the dev excl operation error in the FS, which means with this kernel would stop logging (almost an user error) into the /var/log/messages v2: accepts Josef' comment Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:35 -04:00
Josef Bacik	77cef2ec54	Btrfs: allow partial ordered extent completion We currently have this problem where you can truncate pages that have not yet been written for an ordered extent. We do this because the truncate will be coming behind to clean us up anyway so what's the harm right? Well if truncate fails for whatever reason we leave an orphan item around for the file to be cleaned up later. But if the user goes and truncates up the file and tries to read from the area that had been discarded previously they will get a csum error because we never actually wrote that data out. This patch fixes this by allowing us to either discard the ordered extent completely, by which I mean we just free up the space we had allocated and not add the file extent, or adjust the length of the file extent we write. We do this by setting the length we truncated down to in the ordered extent, and then we set the file extent length and ram bytes to this length. The total disk space stays unchanged since we may be compressed and we can't just chop off the disk space, but at least this way the file extent only points to the valid data. Then when the file extent is free'd the extent and csums will be freed normally. This patch is needed for the next series which will give us more graceful recovery of failed truncates. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:34 -04:00
Josef Bacik	b12d6869f6	Btrfs: convert all bug_ons in free-space-cache.c All of these are logic checks to make sure we're not breaking anything, so convert them over to ASSERT(). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:33 -04:00
Josef Bacik	2e17c7c65e	Btrfs: add support for asserts One of the complaints we get a lot is how many BUG_ON()'s we have. So to help with this I'm introducing a kconfig option to enable/disable a new ASSERT() mechanism much like what XFS does. This will allow us developers to still get our nice panics but allow users/distros to compile them out. With this we can go through and convert any BUG_ON()'s that we have to catch actual programming mistakes to the new ASSERT() and then fix everybody else to return errors. This will also allow developers to leave sanity checks in their new code to make sure we don't trip over problems while testing stuff and vetting new features. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:32 -04:00
Josef Bacik	726551ebc7	Btrfs: adjust the fs_devices->missing count on unmount I noticed that if I tried to mount a file system with -o degraded after having done it once already we would fail to mount. This is because the fs_devices->missing count was getting bumped everytime we mounted, but not getting reset whenever we unmounted. To fix this we just drop the missing count as we're closing devices to make sure this doesn't happen. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:31 -04:00
Stefan Behrens	23fa76b0ba	Btrf: cleanup: don't check for root_refs == 0 twice btrfs_read_fs_root_no_name() already checks if btrfs_root_refs() is zero and returns ENOENT in this case. There is no need to do it again in three more places. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:29 -04:00
Stefan Behrens	4847547172	Btrfs: fix for patch "cleanup: don't check the same thing twice" Mitch Harder noticed that the patch `3c64a1a` mentioned in the subject line was causing a kernel BUG() on snapshot deletion. The patch was wrong. It did not handle cached roots correctly. The check for root_refs == 0 was removed everywhere where btrfs_read_fs_root_no_name() had been used to retrieve the root, because this check was already dealt with in btrfs_read_fs_root_no_name(). But in the case when the root was found in the cache, there was no such check. This patch adds the missing check in the case where the root is found in the cache. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:29 -04:00
Stefan Behrens	9d565ba433	Btrfs: get rid of one BUG() in write_all_supers() The second round uses btrfs_error() and return -EIO, the first round can handle write errors the same way. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:28 -04:00
Wang Shilong	b9e9a6cbc6	Btrfs: allocate prelim_ref with a slab allocater struct __prelim_ref is allocated and freed frequently when walking backref tree, using slab allocater can not only speed up allocating but also detect memory leaks. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:27 -04:00
Wang Shilong	742916b885	Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC Currently, only add_delayed_refs have to allocate with GFP_ATOMIC, So just pass arg 'gfp_t' to decide which allocation mode. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:26 -04:00
Filipe David Borba Manana	f717175024	Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl The handler for the ioctl BTRFS_IOC_FS_INFO was reading the number of devices before acquiring the device list mutex. This could lead to inconsistent results because the update of the device list and the number of devices counter (amongst other counters related to the device list) are updated in volumes.c while holding the device list mutex - except for 2 places, one was volumes.c:btrfs_prepare_sprout() and the other was volumes.c:device_list_add(). For example, if we have 2 devices, with IDs 1 and 2 and then add a new device, with ID 3, and while adding the device is in progress an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of devices of 2 and a max dev id of 3. This would be incorrect. Also, this ioctl handler was reading the fsid while it can be updated concurrently. This can happen when while a new device is being added and the current filesystem is in seeding mode. Example: $ mkfs.btrfs -f /dev/sdb1 $ mkfs.btrfs -f /dev/sdb2 $ btrfstune -S 1 /dev/sdb1 $ mount /dev/sdb1 /mnt/test $ btrfs device add /dev/sdb2 /mnt/test If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it could read an fsid that was never valid (some bits part of the old fsid and others part of the new fsid). Also, it could read a number of devices that doesn't match the number of devices in the list and the max device id, as explained before. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:25 -04:00
Filipe David Borba Manana	d730680184	Btrfs: fix race between removing a dev and writing sbs This change fixes an issue when removing a device and writing all super blocks run simultaneously. Here's the steps necessary for the issue to happen: 1) disk-io.c:write_all_supers() gets a number of N devices from the super_copy, so it will not panic if it fails to write super blocks for N - 1 devices; 2) Then it tries to acquire the device_list_mutex, but blocks because volumes.c:btrfs_rm_device() got it first; 3) btrfs_rm_device() removes the device from the list, then unlocks the mutex and after the unlock it updates the number of devices in super_copy to N - 1. 4) write_all_supers() finally acquires the mutex, iterates over all the devices in the list and gets N - 1 errors, that is, it failed to write super blocks to all the devices; 5) Because write_all_supers() thinks there are a total of N devices, it considers N - 1 errors to be ok, and therefore won't panic. So this change just makes sure that write_all_supers() reads the number of devices from super_copy after it acquires the device_list_mutex. Conversely, it changes btrfs_rm_device() to update the number of devices in super_copy before it releases the device list mutex. The code path to add a new device (volumes.c:btrfs_init_new_device), already has the right behaviour: it updates the number of devices in super_copy while holding the device_list_mutex. The only code path that doesn't lock the device list mutex before updating the number of devices in the super copy is disk-io.c:next_root_backup(), called by open_ctree() during mount time where concurrency issues can't happen. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:24 -04:00
Josef Bacik	b8d0c69b94	Btrfs: remove ourselves from the cluster list under lock A user was reporting weird warnings from btrfs_put_delayed_ref() and I noticed that we were doing this list_del_init() on our head ref outside of delayed_refs->lock. This is a problem if we have people still on the list, we could end up modifying old pointers and such. Fix this by removing us from the list before we do our run_delayed_ref on our head ref. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:23 -04:00
Josef Bacik	e8e7cff667	Btrfs: do not clear our orphan item runtime flag on eexist We were unconditionally clearing our runtime flag on the inode on error when trying to insert an orphan item. This is wrong in the case of -EEXIST since we obviously have an orphan item. This was causing us to not do the correct cleanup of our orphan items which caused issues on cleanup. This happens because currently when truncate fails we just leave the orphan item on there so it can be cleaned up, so if we go to remove the file later we will hit this issue. What we do for truncate isn't right either, but we shouldn't screw this sort of thing up on error either, so fix this and then I'll fix truncate in a different patch. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:22 -04:00
Josef Bacik	57cfd46270	Btrfs: fix send to deal with sparse files properly Send was just sending everything it found, even if the extent was a hole. This is unpleasant for users, so just skip holes when we are sending. This will also skip sending prealloc extents since the send spec doesn't have a prealloc command. Eventually we will add a prealloc command and rev the send version so we can send down the prealloc info. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:21 -04:00
Filipe David Borba Manana	bdab49d760	Btrfs: fix printing of non NULL terminated string The name buffer is not terminated by a '\0' character, therefore it needs to be printed with %.*s and use the length of the buffer. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:20 -04:00
Geert Uytterhoeven	8d78eb1663	Btrfs: Use %z to format size_t Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:19 -04:00
Geert Uytterhoeven	fce29364e5	Btrfs: Do not truncate sector_t on 32-bit with CONFIG_LBDAF=y sector_t may be either "u64" (always 64 bit) or "unsigned long" (32 or 64 bit). Casting it to "unsigned long" will truncate it on 32-bit platforms where CONFIG_LBDAF=y. Cast to "unsigned long long" and format using "ll" instead. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:18 -04:00
Geert Uytterhoeven	778746b53b	Btrfs: PAGE_CACHE_SIZE is already unsigned long PAGE_CACHE_SIZE == PAGE_SIZE is "unsigned long" everywhere, so there's no need to cast it to "unsigned long". Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:17 -04:00
Geert Uytterhoeven	b308bc2f05	Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:16 -04:00
Geert Uytterhoeven	fba6aa7565	Btrfs: Make btrfs_header_fsid() return unsigned long Internally, btrfs_header_fsid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:15 -04:00
Geert Uytterhoeven	231e88f410	Btrfs: Make btrfs_dev_extent_chunk_tree_uuid() return unsigned long Internally, btrfs_dev_extent_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:14 -04:00
Geert Uytterhoeven	1473b24ee0	Btrfs: Make btrfs_device_fsid() return unsigned long All callers of btrfs_device_fsid() cast its return type to unsigned long. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:13 -04:00
Geert Uytterhoeven	410ba3a291	Btrfs: Make btrfs_device_uuid() return unsigned long All callers of btrfs_device_uuid() cast its return type to unsigned long. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:12 -04:00
Geert Uytterhoeven	118a0a2514	Btrfs: Format mirror_num as int mirror_num is always "int", hence don't cast it to "unsigned long long" and format it as a 64-bit number. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:11 -04:00
Geert Uytterhoeven	27f9f02357	Btrfs: Format PAGE_SIZE as unsigned long PAGE_SIZE is "unsigned long" everywhere, so there's no need to cast it to "unsigned long long" and format it as a 64-bit number. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:10 -04:00
Geert Uytterhoeven	6e71c47afe	Btrfs: Make BTRFS_DEV_REPLACE_DEVID an unsigned long long constant The internal btrfs device id is a u64, hence make the constant BTRFS_DEV_REPLACE_DEVID "unsigned long long" as well, so we no longer need a cast to print it. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:09 -04:00
Geert Uytterhoeven	c1c9ff7c94	Btrfs: Remove superfluous casts from u64 to unsigned long long u64 is "unsigned long long" on all architectures now, so there's no need to cast it when formatting it using the "ll" length modifier. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:08 -04:00
Filipe David Borba Manana	1cb048f596	Btrfs: fix memory leak of orphan block rsv This issue is simple to reproduce and observe if kmemleak is enabled. Two simple ways to reproduce it: 1 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ btrfs balance start /mnt/btrfs $ umount /mnt/btrfs 2 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ touch /mnt/btrfs/foobar $ rm -f /mnt/btrfs/foobar $ umount /mnt/btrfs After a while, kmemleak reports the leak: $ cat /sys/kernel/debug/kmemleak unreferenced object 0xffff880402b13e00 (size 128): comm "btrfs", pid 19621, jiffies 4341648183 (age 70057.844s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de .............N.. backtrace: [<ffffffff817275a6>] kmemleak_alloc+0x26/0x50 [<ffffffff8117832b>] kmem_cache_alloc_trace+0xeb/0x1d0 [<ffffffffa04db499>] btrfs_alloc_block_rsv+0x39/0x70 [btrfs] [<ffffffffa04f8bad>] btrfs_orphan_add+0x13d/0x1b0 [btrfs] [<ffffffffa04e2b13>] btrfs_remove_block_group+0x143/0x500 [btrfs] [<ffffffffa0518158>] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs] [<ffffffffa051bc27>] btrfs_balance+0x8f7/0xe90 [btrfs] [<ffffffffa05240a0>] btrfs_ioctl_balance+0x250/0x550 [btrfs] [<ffffffffa05269ca>] btrfs_ioctl+0xdfa/0x25f0 [btrfs] [<ffffffff8119c936>] do_vfs_ioctl+0x96/0x570 [<ffffffff8119cea1>] SyS_ioctl+0x91/0xb0 [<ffffffff81750242>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff This affects btrfs-next, revision be8e3cd00d7293dd177e3f8a4a1645ce09ca3acb (Btrfs: separate out tests into their own directory). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:07 -04:00
Ilya Dryomov	a1e8780a89	Btrfs: rollback btrfs_device fields on umount It turns out we don't properly rollback in-core btrfs_device state on umount. We zero out ->bdev, ->in_fs_metadata and that's about it. In particular, we don't zero out ->generation, and this can lead to us refusing a mount -- a non-NULL fs_devices->latest_bdev is essential, but btrfs_close_extra_devices will happily assign NULL to ->latest_bdev if the first device on the dev_list happens to be missing and consequently has no bdev attached. This happens because since commit `a6b0d5c8` btrfs_close_extra_devices adjusts ->latest_bdev, and in doing that, relies on the ->generation. Fix this, and possibly other problems, by zeroing out everything except for what device_list_add sets, so that a mount right after insmod and 'btrfs dev scan' is no different from any later mount in this respect. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:06 -04:00
Ilya Dryomov	2208a378f3	Btrfs: add alloc_fs_devices and switch to it In the spirit of btrfs_alloc_device, add a helper for allocating and doing some common initialization of btrfs_fs_devices struct. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:05 -04:00
Ilya Dryomov	12bd2fc0d2	Btrfs: add btrfs_alloc_device and switch to it Currently btrfs_device is allocated ad-hoc in a few different places, and as a result not all fields are initialized properly. In particular, readahead state is only initialized in device_list_add (at scan time), and not in btrfs_init_new_device (when the new device is added with 'btrfs dev add'). Fix this by adding an allocation helper and switch everybody but __btrfs_close_devices to it. (__btrfs_close_devices is dealt with in a later commit.) Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:04 -04:00
Ilya Dryomov	53f10659f9	Btrfs: find_next_devid: root -> fs_info find_next_devid() knows which root to search, so it should take an fs_info instead of an arbitrary root. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:03 -04:00
Stefan Behrens	bbb651e469	Btrfs: don't allow the replace procedure on read only filesystems If you start the replace procedure on a read only filesystem, at the end the procedure fails to write the updated dev_items to the chunk tree. The problem is that this error is not indicated except for a WARN_ON(). If the user now thinks that everything was done as expected and destroys the source device (with mkfs or with a hammer). The next mount fails with "failed to read chunk root" and the filesystem is gone. This commit adds code to fail the attempt to start the replace procedure if the filesystem is mounted read-only. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Cc: <stable@vger.kernel.org> # 3.10+ Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:02 -04:00
Filipe David Borba Manana	633085c79c	Btrfs: reset force_compress on btrfs_file_defrag failure After we set force_compress with a new value (which was not being done while holding the inode mutex), if an error happens and we jump to the label out_ra, the force_compress property of the inode is not set to BTRFS_COMPRESS_NONE (unlike in the case where no errors happen). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:00 -04:00
Stefan Behrens	f420ee1e92	Btrfs: add mount option to force UUID tree checking This should never be needed, but since all functions are there to check and rebuild the UUID tree, a mount option is added that allows to force this check and rebuild procedure. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:59 -04:00
Stefan Behrens	70f8017547	Btrfs: check UUID tree during mount if required If the filesystem was mounted with an old kernel that was not aware of the UUID tree, this is detected by looking at the uuid_tree_generation field of the superblock (similar to how the free space cache is doing it). If a mismatch is detected at mount time, a thread is started that does two things: 1. Iterate through the UUID tree, check each entry, delete those entries that are not valid anymore (i.e., the subvol does not exist anymore or the value changed). 2. Iterate through the root tree, for each found subvolume, add the UUID tree entries for the subvolume (if they are not already there). This mechanism is also used to handle and repair errors that happened during the initial creation and filling of the tree. The update of the uuid_tree_generation field (which indicates that the state of the UUID tree is up to date) is blocked until all create and repair operations are successfully completed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:58 -04:00
Stefan Behrens	26432799c9	Btrfs: introduce uuid-tree-gen field In order to be able to detect the case that a filesystem is mounted with an old kernel, add a uuid-tree-gen field like the free space cache is doing it. It is part of the super block and written with each commit. Old kernels do not know this field and don't update it. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:57 -04:00
Stefan Behrens	803b2f54fb	Btrfs: fill UUID tree initially When the UUID tree is initially created, a task is spawned that walks through the root tree. For each found subvolume root_item, the uuid and received_uuid entries in the UUID tree are added. This is such a quick operation so that in case somebody wants to unmount the filesystem while the task is still running, the unmount is delayed until the UUID tree building task is finished. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:56 -04:00
Stefan Behrens	dd5f9615fc	Btrfs: maintain subvolume items in the UUID tree When a new subvolume or snapshot is created, a new UUID item is added to the UUID tree. Such items are removed when the subvolume is deleted. The ioctl to set the received subvolume UUID is also touched and will now also add this received UUID into the UUID tree together with the ID of the subvolume. The latter is also done when read-only snapshots are created which inherit all the send/receive information from the parent subvolume. User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and read in the UUID tree. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:55 -04:00
Stefan Behrens	f7a81ea4cc	Btrfs: create UUID tree if required This tree is not created by mkfs.btrfs. Therefore when a filesystem is mounted writable and the UUID tree does not exist, this tree is created if required. The tree is also added to the fs_info structure and initialized, but this commit does not yet read or write UUID tree elements. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:54 -04:00
Stefan Behrens	8f8ae8e213	Btrfs: support printing UUID tree elements This commit adds support to print UUID tree elements to print-tree.c. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:53 -04:00
Stefan Behrens	07b30a49da	Btrfs: introduce a tree for items that map UUIDs to something Mapping UUIDs to subvolume IDs is an operation with a high effort today. Today, the algorithm even has quadratic effort (based on the number of existing subvolumes), which means, that it takes minutes to send/receive a single subvolume if 10,000 subvolumes exist. But even linear effort would be too much since it is a waste. And these data structures to allow mapping UUIDs to subvolume IDs are created every time a btrfs send/receive instance is started. It is much more efficient to maintain a searchable persistent data structure in the filesystem, one that is updated whenever a subvolume/snapshot is created and deleted, and when the received subvolume UUID is set by the btrfs-receive tool. Therefore kernel code is added with this commit that is able to maintain data structures in the filesystem that allow to quickly search for a given UUID and to retrieve data that is assigned to this UUID, like which subvolume ID is related to this UUID. This commit adds a new tree to hold UUID-to-data mapping items. The key of the items is the full UUID plus the key type BTRFS_UUID_KEY. Multiple data blocks can be stored for a given UUID, a type/length/ value scheme is used. Now follows the lengthy justification, why a new tree was added instead of using the existing root tree: The first approach was to not create another tree that holds UUID items. Instead, the items should just go into the top root tree. Unfortunately this confused the algorithm to assign the objectid of subvolumes and snapshots. The reason is that btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for the first created subvol or snapshot after mounting a filesystem, and this function simply searches for the largest used objectid in the root tree keys to pick the next objectid to assign. Of course, the UUID keys have always been the ones with the highest offset value, and the next assigned subvol ID was wastefully huge. To use any other existing tree did not look proper. To apply a workaround such as setting the objectid to zero in the UUID item key and to implement collision handling would either add limitations (in case of a btrfs_extend_item() approach to handle the collisions) or a lot of complexity and source code (in case a key would be looked up that is free of collisions). Adding new code that introduces limitations is not good, and adding code that is complex and lengthy for no good reason is also not good. That's the justification why a completely new tree was introduced. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:52 -04:00
Sergei Trofimovich	171170c1c5	btrfs: mark some local function as 'static' Cc: Josef Bacik <jbacik@fusionio.com> Cc: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:51 -04:00
Stefan Behrens	35a3621beb	Btrfs: get rid of sparse warnings make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__ I tried to filter out the warnings for which patches have already been sent to the mailing list, pending for inclusion in btrfs-next. All these changes should be obviously safe. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:50 -04:00
Filipe David Borba Manana	18674c6cc1	Btrfs: don't miss inode ref items in BTRFS_IOC_INO_LOOKUP If the inode ref key was not found and the current leaf slot was 0 (first item in the leaf) the code would always return -ENOENT. This was not correct because the desired inode ref item might be the last item in the previous leaf. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:49 -04:00
Filipe David Borba Manana	a696cf3529	Btrfs: add missing error code to BTRFS_IOC_INO_LOOKUP handler If the path doesn't fit in the input buffer, return ENAMETOOLONG instead of returning with a success code (0) and a partially filled and right justified buffer. Also removed useless buffer pointer check outside the while loop. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:48 -04:00
Wang Shilong	b006b2e4f9	Btrfs: remove reduplicate check when disabling quota We have checked 'quota_root' with qgroup_ioctl_lock held before,So here the check is reduplicate, remove it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:47 -04:00
Wang Shilong	e685da14af	Btrfs: move btrfs_free_qgroup_config() out of spin_lock and fix comments btrfs_free_qgroup_config() is not only called by open/close_ctree(),but also btrfs_disable_quota().And for btrfs_disable_quota(),we have set 'quota_root' to be null before calling btrfs_free_qgroup_config(),so it is safe to cleanup in-memory structures without lock held. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:46 -04:00
Wang Shilong	4082bd3d73	Btrfs: fix oops when writing dirty qgroups to disk When disabling quota, we should clear out list 'dirty_qgroups',otherwise, we will get oops if enabling quota again. Fix this by abstracting similar code from del_qgroup_rb(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:45 -04:00
Josef Bacik	ba5e8f2e2d	Btrfs: fix send issues related to inode number reuse If you are sending a snapshot and specifying a parent snapshot we will walk the trees and figure out where they differ and send the differences only. The way we check for differences are if the leaves aren't the same and if the keys are not the same within the leaves. So if neither leaf is the same (ie the leaf has been cow'ed from the parent snapshot) we walk each item in the send root and check it against the parent root. If the items match exactly then we don't do anything. This doesn't quite work for inode refs, since they will just have the name and the parent objectid. If you move the file from a directory and then remove that directory and re-create a directory with the same inode number as the old directory and then move that file back into that directory we will assume that nothing changed and you will get errors when you try to receive. In order to fix this we need to do extra checking to see if the inode ref really is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the items match. Then if the key type is an inode ref we can do some extra checking, otherwise we just keep processing. The extra checking is to look up the generation of the directory in the parent volume and compare it to the generation of the send volume. If they match then they are the same directory and we are good to go. If they don't we have to add them to the changed refs list. This means we have to track the generation of the ref we're trying to lookup when we iterate all the refs for a particular inode. So in the case of looking for new refs we have to get the generation from the parent volume, and in the case of looking for deleted refs we have to get the generation from the send volume to compare with. There was also the issue of using a ulist to keep track of the directories we needed to check. Because we can get a deleted ref and a new ref for the same inode number the ulist won't work since it indexes based on the value. So instead just dup any directory ref we find and add it to a local list, and then process that list as normal and do away with using a ulist for this altogether. Before we would fail all of the tests in the far-progs that related to moving directories (test group 32). With this patch we now pass these tests, and all of the tests in the far-progs send testing suite. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:44 -04:00
Josef Bacik	dc11dd5d70	Btrfs: separate out tests into their own directory The plan is to have a bunch of unit tests that run when btrfs is loaded when you build with the appropriate config option. My ultimate goal is to have a test for every non-static function we have, but at first I'm going to focus on the things that cause us the most problems. To start out with this just adds a tests/ directory and moves the existing free space cache tests into that directory and sets up all of the infrastructure. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:38 -04:00
Josef Bacik	00361589d2	Btrfs: avoid starting a transaction in the write path I noticed while looking at a deadlock that we are always starting a transaction in cow_file_range(). This isn't really needed since we only need a transaction if we are doing an inline extent, or if the allocator needs to allocate a chunk. So push down all the transaction start stuff to be closer to where we actually need a transaction in all of these cases. This will hopefully reduce our write latency when we are committing often. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:05 -04:00
Josef Bacik	9ffba8cda9	Btrfs: fix heavy delalloc related deadlock I added a patch where we started taking the ordered operations mutex when we waited on ordered extents. We need this because we splice the list and process it, so if a flusher came in during this scenario it would think the list was empty and we'd usually get an early ENOSPC. The problem with this is that this lock is used in transaction committing. So we end up with something like this Transaction commit -> wait on writers Delalloc flusher -> run_ordered_operations (holds mutex) ->wait for filemap-flush to do its thing flush task -> cow_file_range ->wait on btrfs_join_transaction because we're commiting some other task -> commit_transaction because we notice trans->transaction->flush is set -> run_ordered_operations (hang on mutex) We need to disentangle the ordered operations flushing from the delalloc flushing, since they are separate things. This solves the deadlock issue I was seeing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:04 -04:00
Josef Bacik	4ef31a45a0	Btrfs: fix the error handling wrt orphan items There are several places where we BUG_ON() if we fail to remove the orphan items and such, which is not ok, so remove those and either abort or just carry on. This also fixes a problem where if we couldn't start a transaction we wouldn't actually remove the orphan item reserve for the inode. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:03 -04:00
Josef Bacik	175a2b871f	Btrfs: don't allow a subvol to be deleted if it is the default subovl Eric pointed out that btrfs will happily allow you to delete the default subvol. This is a problem obviously since the next time you go to mount the file system it will freak out because it can't find the root. Fix this by adding a check to see if our default subvol points to the subvol we are trying to delete, and if it does not allowing it to happen. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:02 -04:00
Josef Bacik	a05254143c	Btrfs: skip subvol entries when checking if we've created a dir already We have logic to see if we've already created a parent directory by check to see if an inode inside of that directory has a lower inode number than the one we are currently processing. The logic is that if there is a lower inode number then we would have had to made sure the directory was created at that previous point. The problem is that subvols inode numbers count from the lowest objectid in the root tree, which may be less than our current progress. So just skip if our dir item key is a root item. This fixes the original test and the xfstest version I made that added an extra subvol create. Thanks, Reported-by: Emil Karlson <jekarlson@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:01 -04:00
Mark Fasheh	416161db9b	btrfs: offline dedupe This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to de-duplicate a list of extents across a range of files. Internally, the ioctl re-uses code from the clone ioctl. This avoids rewriting a large chunk of extent handling code. Userspace passes in an array of file, offset pairs along with a length argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison of the user data before deduping the extent. Status and number of bytes deduped are returned for each operation. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:00 -04:00
Mark Fasheh	4b384318a7	btrfs: Introduce extent_read_full_page_nolock() We want this for btrfs_extent_same. Basically readpage and friends do their own extent locking but for the purposes of dedupe, we want to have both files locked down across a set of readpage operations (so that we can compare data). Introduce this variant and a flag which can be set for extent_read_full_page() to indicate that we are already locked. Partial credit for this patch goes to Gabriel de Perthuis <g2p.code@gmail.com> as I have included a fix from him to the original patch which avoids a deadlock on compressed extents. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:59 -04:00
Mark Fasheh	32b7c687c5	btrfs_ioctl_clone: Move clone code into it's own function There's some 250+ lines here that are easily encapsulated into their own function. I don't change how anything works here, just create and document the new btrfs_clone() function from btrfs_ioctl_clone() code. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:58 -04:00
Mark Fasheh	77fe20dc62	btrfs: abtract out range locking in clone ioctl() The range locking in btrfs_ioctl_clone is trivially broken out into it's own function. This reduces the complexity of btrfs_ioctl_clone() by a small bit and makes that locking code available to future functions in fs/btrfs/ioctl.c Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:57 -04:00
Wang Shilong	a4fdb61e81	Btrfs: fix possible memory leak in find_parent_nodes() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:56 -04:00
Filipe David Borba Manana	09fb99a696	Btrfs: return ENOSPC when target space is full In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success) when the target space was full and when chunk allocation is needed. However, later on in that same function we return ENOSPC if btrfs_alloc_chunk() fails (and chunk allocation was needed) and set the space's full flag. This was inconsistent, as -ENOSPC should be returned if the space is full and a chunk allocation needs to performed. If the space is full but no chunk allocation is needed, just return 0 (success). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:55 -04:00
Filipe David Borba Manana	ada9af215c	Btrfs: don't ignore errors from btrfs_run_delayed_items tree-log.c was ignoring the return value from btrfs_run_delayed_items() in several places. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:54 -04:00
Filipe David Borba Manana	2bac325ea8	Btrfs: fix inode leak on kmalloc failure in tree-log.c In tree-log.c:replay_one_name(), if memory allocation for the name fails, ensure we iput the dir inode we got before before we return. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:53 -04:00
Liu Bo	116e0024c4	Btrfs: allow compressed extents to be merged during defragment The rule originally comes from nocow writing, but snapshot-aware defrag is a different case, the extent has been writen and we're not going to change the extent but add a reference on the data. So we're able to allow such compressed extents to be merged into one bigger extent if they're pointing to the same data. Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:52 -04:00
David Sterba	8b87dc17fb	btrfs: add mount option to set commit interval I'ts hardcoded to 30 seconds which is fine for most users. Higher values defer data being synced to permanent storage with obvious consequences when the system crashes. The upper bound is not forced, but a warning is printed if it's more than 300 seconds (5 minutes). Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:51 -04:00

... 3 4 5 6 7 ...

33525 Commits