Commit Graph

36874 Commits

Author SHA1 Message Date
J. Bruce Fields
5513a510fa nfsd4: fix corruption on setting an ACL.
As of 06f9cc12ca "nfsd4: don't create
unnecessary mask acl", any non-trivial ACL will be left with an
unitialized entry, and a trivial ACL may write one entry beyond what's
allocated.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-15 15:36:04 -04:00
Dave Chinner
2d6dcc6d7e Merge branch 'xfs-attr-cleanup' into for-next
Conflicts:
	fs/xfs/xfs_attr.c
2014-05-15 09:39:28 +10:00
Dave Chinner
ff14ee42a0 Merge branch 'xfs-misc-fixes-1-for-3.16' into for-next 2014-05-15 09:38:15 +10:00
Dave Chinner
b76769294b Merge branch 'xfs-free-inode-btree' into for-next 2014-05-15 09:37:44 +10:00
Dave Chinner
232c2f5c65 Merge branch 'xfs-filestreams-lookup' into for-next 2014-05-15 09:36:59 +10:00
Dave Chinner
fdd3a2ae2e Merge branch 'xfs-unused-args-cleanup' into for-next 2014-05-15 09:36:35 +10:00
Dave Chinner
ee4eec478b xfs: list_lru_init returns a negative error
And we don't invert it properly when initialising the dquot lru
list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:23:24 +10:00
Dave Chinner
bc147822d5 xfs: negate xfs_icsb_init_counters error value
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:23:07 +10:00
Dave Chinner
45687642e4 xfs: negate mount workqueue init error value
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:53 +10:00
Dave Chinner
6670232b48 xfs: fix wrong err sign on xfs_set_acl()
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:37 +10:00
Dave Chinner
a5a14de22e xfs: fix wrong errno from xfs_initxattrs
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:21 +10:00
Dave Chinner
65149e3fab xfs: correct error sign on COLLAPSE_RANGE errors
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:07 +10:00
Dave Chinner
b38a134b22 xfs: xfs_commit_metadata returns wrong errno
Invert it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:21:52 +10:00
Dave Chinner
8ff1e6705a xfs: fix incorrect error sign in xfs_file_aio_read
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:21:37 +10:00
Dave Chinner
43ec1460a2 xfs: xfs_dir_fsync() returns positive errno
And it should be negative.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:21:11 +10:00
James Hogan
d71f290b4e metag: Reduce maximum stack size to 256MB
Specify the maximum stack size for arches where the stack grows upward
(parisc and metag) in asm/processor.h rather than hard coding in
fs/exec.c so that metag can specify a smaller value of 256MB rather than
1GB.

This fixes a BUG on metag if the RLIMIT_STACK hard limit is increased
beyond a safe value by root. E.g. when starting a process after running
"ulimit -H -s unlimited" it will then attempt to use a stack size of the
maximum 1GB which is far too big for metag's limited user virtual
address space (stack_top is usually 0x3ffff000):

BUG: failure at fs/exec.c:589/shift_arg_pages()!

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: linux-parisc@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Cc: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # only needed for >= v3.9 (arch/metag)
2014-05-15 00:00:35 +01:00
Fabian Frederick
afd54c6881 reiserfs: remove obsolete __constant_cpu_to_le32
__constant_cpu_to_le32 converted to cpu_to_le32

Cc: reiserfs-devel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-14 23:57:46 +02:00
Benjamin Marzinski
24972557b1 GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.

This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.

When a node wants to freeze the filesystem, it grabs this glock
exclusively.  When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush.  gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again.  Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.

However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.

In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem.  The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.

The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-14 10:04:34 +01:00
Jeff Mahoney
83a3a56936 reiserfs: balance_leaf refactor, split up balance_leaf_when_delete
Splut up balance_leaf_when_delete into:
balance_leaf_when_delete_del
balance_leaf_when_cut
balance_leaf_when_delete_left

Also reformat to adhere to CodingStyle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-13 16:08:54 +02:00
Jeff Mahoney
441378c2bf reiserfs: balance_leaf refactor, format balance_leaf_finish_node
Split out balance_leaf_finish_node_dirent from balance_leaf_paste_finish_node.

Also reformat to adhere to CodingStyle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-13 16:08:25 +02:00
Jeff Mahoney
b54b8c9184 reiserfs: balance_leaf refactor, format balance_leaf_new_nodes_paste
Break up balance_leaf_paste_new_nodes into:
balance_leaf_paste_new_nodes_shift
balance_leaf_paste_new_nodes_shift_dirent
balance_leaf_paste_new_nodes_whole

and keep balance_leaf_paste_new_nodes as a handler to select which
is appropriate.

Also reformat to adhere to CodingStyle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-13 16:06:35 +02:00
Tejun Heo
555724a831 kernfs, sysfs, cgroup: restrict extra perm check on open to sysfs
The kernfs open method - kernfs_fop_open() - inherited extra
permission checks from sysfs.  While the vfs layer allows ignoring the
read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
sysfs explicitly denied open regardless of the cap if the file doesn't
have any of the UGO perms of the requested access or doesn't implement
the requested operation.  It can be debated whether this was a good
idea or not but the behavior is too subtle and dangerous to change at
this point.

After cgroup got converted to kernfs, this extra perm check also got
applied to cgroup breaking libcgroup which opens write-only files with
O_RDWR as root.  This patch gates the extra open permission check with
a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
For sysfs, nothing changes.  For cgroup, root now can perform any
operation regardless of the permissions as it was before kernfs
conversion.  Note that kernfs still fails unimplemented operations
with -EINVAL.

While at it, add comments explaining KERNFS_ROOT flags.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrey Wagin <avagin@gmail.com>
Tested-by: Andrey Wagin <avagin@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
Fixes: 2bd59d48eb ("cgroup: convert to kernfs")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13 13:21:40 +02:00
hujianyang
0da846f42f UBIFS: Remove unused variables in ubifs_budget_space
I found two variables in ubifs_budget_space declared but not
use. This state remains since the first commit 1e5176. So just
remove them.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-05-13 13:45:16 +03:00
hujianyang
691a7c6f28 UBIFS: fix an mmap and fsync race condition
There is a race condition in UBIFS:

Thread A (mmap)                        Thread B (fsync)

->__do_fault                           ->write_cache_pages
   -> ubifs_vm_page_mkwrite
       -> budget_space
       -> lock_page
       -> release/convert_page_budget
       -> SetPagePrivate
       -> TestSetPageDirty
       -> unlock_page
                                       -> lock_page
                                           -> TestClearPageDirty
                                           -> ubifs_writepage
                                               -> do_writepage
                                                   -> release_budget
                                                   -> ClearPagePrivate
                                                   -> unlock_page
   -> !(ret & VM_FAULT_LOCKED)
   -> lock_page
   -> set_page_dirty
       -> ubifs_set_page_dirty
           -> TestSetPageDirty (set page dirty without budgeting)
   -> unlock_page

This leads to situation where we have a diry page but no budget allocated for
this page, so further write-back may fail with -ENOSPC.

In this fix we return from page_mkwrite without performing unlock_page. We
return VM_FAULT_LOCKED instead. After doing this, the race above will not
happen.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Tested-by: Laurence Withers <lwithers@guralp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-05-13 13:45:15 +03:00
Christoph Hellwig
6c888af0b4 xfs: pass struct da_args to xfs_attr_calc_size
And remove a very confused comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:40:19 +10:00
Christoph Hellwig
67fd718f30 xfs: simplify attr name setup
Replace xfs_attr_name_to_xname with a new xfs_attr_args_init helper that
sets up the basic da_args structure without using a temporary xfs_name
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:34:43 +10:00
Christoph Hellwig
1bc426a76b xfs: fold xfs_attr_remove_int into xfs_attr_remove
Also remove a useless ilock roundtrip for the first attr fork check, it's
racy anyway and we redo it later under the ilock before we start the removal.

Plus various minor style fixes to the new xfs_attr_remove.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:34:33 +10:00
Christoph Hellwig
b87d022c27 xfs: fold xfs_attr_get_int into xfs_attr_get
This allows doing an unlocked check if an attr for is present at all and
slightly reduce the lock hold time if we actually do an attr get.

Plus various minor style fixes to the new xfs_attr_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:34:24 +10:00
Christoph Hellwig
c5b4ac39a4 xfs: fold xfs_attr_set_int into xfs_attr_set
Plus various minor style fixes to the new xfs_attr_set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:34:14 +10:00
Linus Torvalds
14186fea0c File locking related changes for v3.15 (pile #4)
- fix for regression in handling of F_GETLK commands
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTcRqbAAoJEAAOaEEZVoIVu4wQALShwSEh77qTKLA7rhJC0AHT
 AF/P5u7MW8/r6VDkweproB2jEllBKWaOsehhHB+GGpSDYXEyaPVHn8YjfMxP9aI7
 ivOCtR0dxJYTAlU+n3WH2fjLnWtyu8/eKFAUpHqFrU0gZnUUtiWSRZwPeKKSggPD
 WXbOCMIZKHSM5ccyJx93wYxLLto1gLRRGU7jKOamQhkdYfLD+W0N26vpz8JfMqL4
 RshdqbHrjHzRKGrerWAHun/jjOcjErb6qBjSEpJSHYad1KuMFm4ViOR0QE9LRJnA
 4tt1HovYmeZMtsv8f/sQzArbujyZYZzsosbod22pUG1ms3OYnrOXf9yp3gaj1ocf
 lRQ9XinnKVkzVi8tdJNchTxx123v8rV1dmU1qC4o/ivR8lhZi+MzdYj66zJi+hlw
 11VYcQf+ZfwJxQtOH1lDXMX060QAMTQxDa5v/Cedk2I1oIdG7iYYXW/suyBzwZQs
 BlBB/eZGxC4iUELu7WN5AWoDei3ubSJTCR6u3KUXrj5CBtQyf2ZUl7nQl6eaqjnb
 LXETL8YjhZSZXRN8O4B/nTwMnIgVSbgXgpmuSL8l0BfuGTD+YHd+Kopqw2BJp6ic
 bABwn8NX0Mhaiajd1Qx2QDl4gt23aQM4EUxXnuNk8FZ1f7z9Nc+byetYgOrWvIkn
 mAACD3ZvWF+w2IwDp9wZ
 =Pr3z
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.15-4' of git://git.samba.org/jlayton/linux

Pull file locking fix from Jeff Layton:
 "Fix for regression in handling of F_GETLK commands"

* tag 'locks-v3.15-4' of git://git.samba.org/jlayton/linux:
  locks: only validate the lock vs. f_mode in F_SETLK codepaths
2014-05-13 11:33:09 +09:00
Lukas Czerner
0baaea6400 ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
In ext4_es_can_be_merged() when checking whether we can merge two
extents we should use EXT_MAX_BLOCKS instead of defining it manually.
Also if it is really the case we should notify userspace because clearly
there is a bug in extent status tree implementation since this should
never happen.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2014-05-12 22:21:43 -04:00
Linus Torvalds
19726630c6 Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
 "Small cifs fix for metadata caching"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix actimeo=0 corner case when cifs_i->time == jiffies
2014-05-13 11:19:32 +09:00
liang xie
5d60125530 ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 22:06:43 -04:00
Lukas Czerner
c8b459f492 ext4: remove unnecessary double parentheses
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 12:55:07 -04:00
Andrey Tsyvarev
029b10c5a8 ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
Caches from 'ext4_groupinfo_caches' may be in use by other mounts,
which have already existed.  So, it is incorrect to destroy them when
newly requested mount fails.

Found by Linux File System Verification project (linuxtesting.org).

Signed-off-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-05-12 12:34:21 -04:00
Stephen Hemminger
c197855ea1 ext4: make local functions static
I have been running make namespacecheck to look for unneeded globals, and
found these in ext4.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:50:23 -04:00
Darrick J. Wong
e674e5cbd0 ext4: fix block bitmap validation when bigalloc, ^flex_bg
On a bigalloc,^flex_bg filesystem, the ext4_valid_block_bitmap
function fails to convert from blocks to clusters when spot-checking
the validity of the bitmap block that we've just read from disk.  This
causes ext4 to think that the bitmap is garbage, which results in the
block group being taken offline when it's not necessary.  Add in the
necessary EXT4_B2C() calls to perform the conversions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:17:55 -04:00
Darrick J. Wong
1beeef1b56 ext4: fix block bitmap initialization under sparse_super2
The ext4_bg_has_super() function doesn't know about the new rules for
where backup superblocks go on a sparse_super2 filesystem.  Therefore,
block bitmap initialization doesn't know that it shouldn't reserve
space for backups in groups that are never going to contain backups.
The result of this is e2fsck complaining about the block bitmap being
incorrect (fortunately not in a way that results in cross-linked
files), so fix the whole thing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:16:06 -04:00
Darrick J. Wong
bd63f6b0cd ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
On a filesystem with a 1k block size, the group descriptors live in
block 2, not block 1.  If the filesystem has bigalloc,meta_bg set,
however, the calculation of the group descriptor table location does
not take this into account and returns the wrong block number.  Fix
the calculation to return the correct value for this case.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:06:27 -04:00
Zhang Zhen
230b8c1a7b ext4: avoid unneeded lookup when xattr name is invalid
In ext4_xattr_set_handle() we have checked the xattr name's length. So
we should also check it in ext4_xattr_get() to avoid unneeded lookup
caused by invalid name.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 09:57:59 -04:00
Namjae Jeon
1c8349a171 ext4: fix data integrity sync in ordered mode
When we perform a data integrity sync we tag all the dirty pages with
PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.  Later we check
for this tag in write_cache_pages_da and creates a struct
mpage_da_data containing contiguously indexed pages tagged with this
tag and sync these pages with a call to mpage_da_map_and_submit.  This
process is done in while loop until all the PAGECACHE_TAG_TOWRITE
pages are synced. We also do journal start and stop in each iteration.
journal_stop could initiate journal commit which would call
ext4_writepage which in turn will call ext4_bio_write_page even for
delayed OR unwritten buffers. When ext4_bio_write_page is called for
such buffers, even though it does not sync them but it clears the
PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
are also not synced by the currently running data integrity sync. We
will end up with dirty pages although sync is completed.

This could cause a potential data loss when the sync call is followed
by a truncate_pagecache call, which is exactly the case in
collapse_range.  (It will cause generic/127 failure in xfstests)

To avoid this issue, we can use set_page_writeback_keepwrite instead of
set_page_writeback, which doesn't clear TOWRITE tag.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-05-12 08:12:25 -04:00
Linus Torvalds
7e338c9991 Merge branch 'for-3.15' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields.

* 'for-3.15' of git://linux-nfs.org/~bfields/linux:
  NFSD: Call ->set_acl with a NULL ACL structure if no entries
  NFSd: call rpc_destroy_wait_queue() from free_client()
  NFSd: Move default initialisers from create_client() to alloc_client()
2014-05-11 18:06:13 +09:00
Jeff Layton
cf01f4eef9 locks: only validate the lock vs. f_mode in F_SETLK codepaths
v2: replace missing break in switch statement (as pointed out by Dave
    Jones)

commit bce7560d49 (locks: consolidate checks for compatible
filp->f_mode values in setlk handlers) introduced a regression in the
F_GETLK handler.

flock64_to_posix_lock is a shared codepath between F_GETLK and F_SETLK,
but the f_mode checks should only be applicable to the F_SETLK codepaths
according to POSIX.

Instead of just reverting the patch, add a new function to do this
checking and have the F_SETLK handlers call it.

Cc: Dave Jones <davej@redhat.com>
Reported-and-Tested-by: Reuben Farrelly <reuben@reub.net>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
2014-05-09 11:41:54 -04:00
Linus Torvalds
afcf0a2d92 Fixes for 3.15-rc5:
- fix a remote attribute size calculation bug that leads to a
   transaction overrun
 - add default ACLs to O_TMPFILE files
 - Remove the EXPERIMENTAL tag from filesystems with metadata CRC
   support
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJTa+0vAAoJEK3oKUf0dfodQGkQAI4IzVYr4K02Rj7UcbZaGlUM
 R9OAATojaQ8/hPDshzrhyLK7/KvF2tJvH9PhP6zj3bko3fmhofJ5/mB6LYIsQtt+
 3rNd/jij9Icmq4+ouZPRDl00nJdnCZjJcfYys6N/tXwLNIvwKP04vjB4QoC1rxVv
 j6L85yUkMpPohA0Wbf+PKrTVJDrtTOe+YpczciYgGKHr0YF27Bdy6iYSU3KvTvd+
 wuqXvGAc9ARZDsrVHt8t6eh9OKRRk1RAV5vdwGwucBrVlnxGaspvia/85JyU3Kv0
 F2EQ3fWcGQs5ydQjpvSZlEIttDqBDn/LiuncNctXIUvHpr+MQ73XMVrNLoNY1m6d
 wQqXFQXT4e/vzJTXyQz/jYgzGl5t9Lvf/1Z5lFHliqhaBm1aNMhdjfCZhEpehoaQ
 09JSVj8ZKLHZt3yRgwkZdOmM0bl4thJmY1Wf5O2EPMrk3NE3nZKiNG+W2U/sSFti
 i12M4uVgInmeHoDIWFNL9kXp3fs+gr6HF5BNQOulm0ywzG3U1ozWGyKsnRmpPFQr
 995voVKZKDP410wzp98UKpjXalmonYuTFLNUDEEjr2UKUWq6fRpvDdSeBSRirGxP
 kdwfpgCZHDJlZEY7d4lv4Pv6L84KgYYHQpmbaFcPEAmJmlMZ4web1KqHl8TDy1hT
 Z+STYvTImpXV9sP5TZYT
 =79c6
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.15-rc5' of git://oss.sgi.com/xfs/xfs

Pull xfs fixes from Dave Chinner:
 "The main fix is adding support for default ACLs on O_TMPFILE opened
  inodes to bring XFS into line with other filesystems.  Metadata CRCs
  are now also considered well enough tested to be fully supported, so
  we're removing the shouty warnings issued at mount time for
  filesystems with that format.  And there's transaction block
  reservation overrun fix.

  Summary:
   - fix a remote attribute size calculation bug that leads to a
     transaction overrun
   - add default ACLs to O_TMPFILE files
   - Remove the EXPERIMENTAL tag from filesystems with metadata CRC
     support"

* tag 'xfs-for-linus-3.15-rc5' of git://oss.sgi.com/xfs/xfs:
  xfs: remote attribute overwrite causes transaction overrun
  xfs: initialize default acls for ->tmpfile()
  xfs: fully support v5 format filesystems
2014-05-08 19:20:45 -07:00
Kinglong Mee
9fa1959e97 NFSD: Get rid of empty function nfs4_state_init
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-08 14:59:52 -04:00
Kinglong Mee
f3e41ec5ef NFSD: Use simple_read_from_buffer for coping data to userspace
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-08 14:59:52 -04:00
J. Bruce Fields
dd15073a26 Merge 3.15 bugfix for 3.16 2014-05-08 14:59:06 -04:00
Christoph Hellwig
5409e46f1b nfsd: clean up fh_auth usage
Use fh_fsid when reffering to the fsid part of the filehandle.  The
variable length auth field envisioned in nfsfh wasn't ever implemented.
Also clean up some lose ends around this and document the file handle
format better.

Btw, why do we even export nfsfh.h to userspace?  The file handle very
much is kernel private, and nothing in nfs-utils include the header
either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-08 12:43:03 -04:00
Kinglong Mee
ecc7455d8e NFSD: cleanup unneeded including linux/export.h
commit 4ac7249ea5 have remove all EXPORT_SYMBOL,
linux/export.h is not needed, just clean it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-08 12:43:02 -04:00
Kinglong Mee
aa07c713ec NFSD: Call ->set_acl with a NULL ACL structure if no entries
After setting ACL for directory, I got two problems that caused
by the cached zero-length default posix acl.

This patch make sure nfsd4_set_nfs4_acl calls ->set_acl
with a NULL ACL structure if there are no entries.

Thanks for Christoph Hellwig's advice.

First problem:
............ hang ...........

Second problem:
[ 1610.167668] ------------[ cut here ]------------
[ 1610.168320] kernel BUG at /root/nfs/linux/fs/nfsd/nfs4acl.c:239!
[ 1610.168320] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 1610.168320] Modules linked in: nfsv4(OE) nfs(OE) nfsd(OE)
rpcsec_gss_krb5 fscache ip6t_rpfilter ip6t_REJECT cfg80211 xt_conntrack
rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw ip6table_filter
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
auth_rpcgss nfs_acl snd_intel8x0 ppdev lockd snd_ac97_codec ac97_bus
snd_pcm snd_timer e1000 pcspkr parport_pc snd parport serio_raw joydev
i2c_piix4 sunrpc(OE) microcode soundcore i2c_core ata_generic pata_acpi
[last unloaded: nfsd]
[ 1610.168320] CPU: 0 PID: 27397 Comm: nfsd Tainted: G           OE
3.15.0-rc1+ #15
[ 1610.168320] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[ 1610.168320] task: ffff88005ab653d0 ti: ffff88005a944000 task.ti:
ffff88005a944000
[ 1610.168320] RIP: 0010:[<ffffffffa034d5ed>]  [<ffffffffa034d5ed>]
_posix_to_nfsv4_one+0x3cd/0x3d0 [nfsd]
[ 1610.168320] RSP: 0018:ffff88005a945b00  EFLAGS: 00010293
[ 1610.168320] RAX: 0000000000000001 RBX: ffff88006700bac0 RCX:
0000000000000000
[ 1610.168320] RDX: 0000000000000000 RSI: ffff880067c83f00 RDI:
ffff880068233300
[ 1610.168320] RBP: ffff88005a945b48 R08: ffffffff81c64830 R09:
0000000000000000
[ 1610.168320] R10: ffff88004ea85be0 R11: 000000000000f475 R12:
ffff880068233300
[ 1610.168320] R13: 0000000000000003 R14: 0000000000000002 R15:
ffff880068233300
[ 1610.168320] FS:  0000000000000000(0000) GS:ffff880077800000(0000)
knlGS:0000000000000000
[ 1610.168320] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1610.168320] CR2: 00007f5bcbd3b0b9 CR3: 0000000001c0f000 CR4:
00000000000006f0
[ 1610.168320] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 1610.168320] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 1610.168320] Stack:
[ 1610.168320]  ffffffff00000000 0000000b67c83500 000000076700bac0
0000000000000000
[ 1610.168320]  ffff88006700bac0 ffff880068233300 ffff88005a945c08
0000000000000002
[ 1610.168320]  0000000000000000 ffff88005a945b88 ffffffffa034e2d5
000000065a945b68
[ 1610.168320] Call Trace:
[ 1610.168320]  [<ffffffffa034e2d5>] nfsd4_get_nfs4_acl+0x95/0x150 [nfsd]
[ 1610.168320]  [<ffffffffa03400d6>] nfsd4_encode_fattr+0x646/0x1e70 [nfsd]
[ 1610.168320]  [<ffffffff816a6e6e>] ? kmemleak_alloc+0x4e/0xb0
[ 1610.168320]  [<ffffffffa0327962>] ?
nfsd_setuser_and_check_port+0x52/0x80 [nfsd]
[ 1610.168320]  [<ffffffff812cd4bb>] ? selinux_cred_prepare+0x1b/0x30
[ 1610.168320]  [<ffffffffa0341caa>] nfsd4_encode_getattr+0x5a/0x60 [nfsd]
[ 1610.168320]  [<ffffffffa0341e07>] nfsd4_encode_operation+0x67/0x110
[nfsd]
[ 1610.168320]  [<ffffffffa033844d>] nfsd4_proc_compound+0x21d/0x810 [nfsd]
[ 1610.168320]  [<ffffffffa0324d9b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[ 1610.168320]  [<ffffffffa00850cd>] svc_process_common+0x46d/0x6d0 [sunrpc]
[ 1610.168320]  [<ffffffffa0085433>] svc_process+0x103/0x170 [sunrpc]
[ 1610.168320]  [<ffffffffa032472f>] nfsd+0xbf/0x130 [nfsd]
[ 1610.168320]  [<ffffffffa0324670>] ? nfsd_destroy+0x80/0x80 [nfsd]
[ 1610.168320]  [<ffffffff810a5202>] kthread+0xd2/0xf0
[ 1610.168320]  [<ffffffff810a5130>] ? insert_kthread_work+0x40/0x40
[ 1610.168320]  [<ffffffff816c1ebc>] ret_from_fork+0x7c/0xb0
[ 1610.168320]  [<ffffffff810a5130>] ? insert_kthread_work+0x40/0x40
[ 1610.168320] Code: 78 02 e9 e7 fc ff ff 31 c0 31 d2 31 c9 66 89 45 ce
41 8b 04 24 66 89 55 d0 66 89 4d d2 48 8d 04 80 49 8d 5c 84 04 e9 37 fd
ff ff <0f> 0b 90 0f 1f 44 00 00 55 8b 56 08 c7 07 00 00 00 00 8b 46 0c
[ 1610.168320] RIP  [<ffffffffa034d5ed>] _posix_to_nfsv4_one+0x3cd/0x3d0
[nfsd]
[ 1610.168320]  RSP <ffff88005a945b00>
[ 1610.257313] ---[ end trace 838254e3e352285b ]---

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-08 12:42:21 -04:00
Chao Yu
70ff5dfeb6 f2fs: use inode_init_owner() to simplify codes
This patch uses exported inode_init_owner() to simplify codes in
f2fs_new_inode().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-08 18:23:21 +09:00
Chao Yu
adf8d90b6a f2fs: avoid to use slab memory in f2fs_issue_flush for efficiency
If we use slab memory in f2fs_issue_flush(), we will face memory pressure and
latency time caused by racing of kmem_cache_{alloc,free}.

Let's alloc memory in stack instead of slab.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-08 18:23:21 +09:00
Jeff Mahoney
2369ecd8e8 reiserfs: balance_leaf refactor, format balance_leaf_paste_right
Break up balance_leaf_paste_right into:
balance_leaf_paste_right_shift
balance_leaf_paste_right_shift_dirent
balance_leaf_paste_right_whole

and keep balance_leaf_paste_right as a handler to select which is appropriate.

Also reformat to adhere to CodingStyle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 19:23:04 +02:00
Jeff Mahoney
8dbf0d8c9b reiserfs: balance_leaf refactor, format balance_leaf_insert_right
Reformat balance_leaf_insert_right to adhere to CodingStyle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 19:14:22 +02:00
Jeff Mahoney
3fade9377f reiserfs: balance_leaf refactor, format balance_leaf_paste_left
Break up balance_leaf_paste_left into:
balance_leaf_paste_left_shift
balance_leaf_paste_left_shift_dirent
balance_leaf_paste_left_whole

and keep balance_leaf_paste_left as a handler to select which is appropriate.

Also reformat to adhere to CodingStyle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 19:13:34 +02:00
Jeff Mahoney
4bf4de6bc4 reiserfs: balance_leaf refactor, format balance_leaf_insert_left
Reformat balance_leaf_insert_left to adhere to CodingStyle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 19:05:57 +02:00
Jeff Mahoney
0080e9f9d3 reiserfs: balance_leaf refactor, pull out balance_leaf{left, right, new_nodes, finish_node}
Break out the code that splits paste/insert for each phase.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 19:05:28 +02:00
Jeff Mahoney
7f447ba825 reiserfs: balance_leaf refactor, pull out balance_leaf_finish_node_paste
This patch factors out a new balance_leaf_finish_node_paste from the code
in balance_leaf responsible for pasting new content into existing items
held in S[0].

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 19:03:35 +02:00
Jeff Mahoney
8c480ea194 reiserfs: balance_leaf refactor pull out balance_leaf_finish_node_insert
This patch factors out a new balance_leaf_finish_node_insert from the code
in balance_leaf responsible for inserting new items into S[0]

It has not been reformatted yet.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 19:01:39 +02:00
Jeff Mahoney
9d496552b9 reiserfs: balance_leaf refactor, pull out balance_leaf_new_nodes_paste
This patch factors out a new balance_leaf_new_nodes_insert from the code
in balance_leaf responsible for pasting new content into existing items
that may have been shifted into new nodes in the tree.

It has not been reformatted yet.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 19:01:02 +02:00
Jeff Mahoney
65ab18cb73 reiserfs: balance_leaf refactor, pull out balance_leaf_new_nodes_insert
This patch factors out a new balance_leaf_new_nodes_insert from the code
in balance_leaf responsible for inserting new items into new nodes in
the tree.

It has not been reformatted yet.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 18:50:41 +02:00
Jeff Mahoney
3f0eb27655 reiserfs: balance_leaf refactor, pull out balance_leaf_paste_right
This patch factors out a new balance_leaf_paste_right from the code in
balance_leaf responsible for pasting new contents into an existing item
located in the node to the right of S[0] in the tree.

It has not been reformatted yet.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 18:47:53 +02:00
Jeff Mahoney
e80ef3d148 reiserfs: balance_leaf refactor, pull out balance_leaf_insert_right
This patch factors out a new balance_leaf_insert_right from the code in
balance_leaf responsible for inserting new items into the node to
the right of S[0] in the tree.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 18:34:44 +02:00
Jeff Mahoney
cf22df182b reiserfs: balance_leaf refactor, pull out balance_leaf_paste_left
This patch factors out a new balance_leaf_paste_left from the code in
balance_leaf responsible for pasting new content into an existing item
located in the node to the left of S[0] in the tree.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 18:34:13 +02:00
Jeff Mahoney
f1f007c308 reiserfs: balance_leaf refactor, pull out balance_leaf_insert_left
This patch factors out a new balance_leaf_insert_left from the code in
balance_leaf responsible for inserting new items into the node to
the left of S[0] in the tree.

It is not yet formatted correctly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 18:33:17 +02:00
Jeff Mahoney
b49fb112d4 reiserfs: balance_leaf refactor, move state variables into tree_balance
This patch pushes the rest of the state variables in balance_leaf into
the tree_balance structure so we can use them when we split balance_leaf
into separate functions.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 18:31:06 +02:00
Jeff Mahoney
97fd4b97a9 reiserfs: balance_leaf refactor, reformat balance_leaf comments
The comments in balance_leaf are as bad as the code. This patch shifts
them around to fit in 80 columns and be easier to read.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 17:52:14 +02:00
Jeff Mahoney
c48138c227 reiserfs: cleanup, make hash detection saner
The hash detection code uses long ugly macros multiple times to get the same
value. This patch cleans it up to be easier to read.

[JK: Fixed up path leak in find_hash_out()]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-07 17:49:45 +02:00
Trond Myklebust
14bcab1a39 NFSd: Clean up nfs4_preprocess_stateid_op
Move the state locking and file descriptor reference out from the
callers and into nfs4_preprocess_stateid_op() itself.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-07 11:05:48 -04:00
Ingo Molnar
2fe5de9ce7 Merge branch 'sched/urgent' into sched/core, to avoid conflicts
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:15:46 +02:00
Chao Yu
c20e89cde6 f2fs: add a tracepoint for f2fs_read_data_page
This patch adds a tracepoint for f2fs_read_data_page to trace when page is
readed by user.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Chao Yu
e574843438 f2fs: add a tracepoint for f2fs_write_{meta,node,data}_pages
This patch adds a tracepoint for f2fs_write_{meta,node,data}_pages to trace when
pages are fsyncing/flushing.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Chao Yu
ecda0de343 f2fs: add a tracepoint for f2fs_write_{meta,node,data}_page
This patch adds a tracepoint for f2fs_write_{meta,node,data}_page to trace when
page is writting out.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Chao Yu
dfb2bf38bf f2fs: add a tracepoint for f2fs_write_end
This patch adds a tracepoint for f2fs_write_end to trace write op of user.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Chao Yu
62aed044ea f2fs: add a tracepoint for f2fs_write_begin
This patch adds a tracepoint for f2fs_write_begin to trace write op of user.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Zhang Zhen
8b376249e7 f2fs: fix checkpatch warning
fix the following checkpatch warning:
WARNING: do {} while (0) macros should not be semicolon terminated

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Jaegeuk Kim
8198899b94 f2fs: deactivate inode page if the inode is evicted
If the inode page is clean during its inode eviction, it'd better drop the page
to reduce further memory pressure.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Jaegeuk Kim
d5f66990bb f2fs: decrease the lock granularity during write_begin
This patch reduces the lock granularity during write_begin.
When the system is under memory pressure, it would be better to reduce
the locking time for the data pages.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Jaegeuk Kim
bde446866c f2fs: no need to wait on page writebck to meta pages
This patch removes grab_cache_page_write_begin for meta pages.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Jaegeuk Kim
9ac1349ad7 f2fs: avoid grab_cache_page_write_begin for data pages
We don't need to wait on page writeback for these cases.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Jaegeuk Kim
54b591dfda f2fs: split grab_cache_page and wait_on_page_writeback for node pages
This patch splits grab_cache_page_write_begin into grab_cache_page and
wait_on_page_writeback for node pages.

This patch intends to enhance the latency to get node pages by alleviating
unnecessary wait_on_page_writeback.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Chao Yu
8aa6f1c5bd f2fs: fix to truncate inline data in inode page when setattr
Previous we do not truncate inline data in inode page when setattr, so following
case could still read the inline data which has already truncated:

1.write inline data
2.ftruncate size to 0
3.ftruncate size to max inline data size
4.read from offset 0

This patch introduces truncate_inline_data() to fix this problem.

change log from v1:
 o fix a bug and do not truncate first page data after truncate inline data.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Chao Yu
817202d937 f2fs: readahead multi pages of directory for performance
We have no so such readahead mechanism in ->iterate() path as the one in
->read() path, it cause low performance when we read large directory.
This patch add readahead in f2fs_readdir() for better performance.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Chao Yu
5c1f9927ec f2fs: set errno when f2fs_iget failed in recover_dentry
We should set the error number correctly when we fail in recover_dentry(), so
the recover flow could stop for the reason as error number shows instead of
continuing.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Jaegeuk Kim
7f7670fe9f f2fs: consider fallocated space for SEEK_DATA
If an amount of data are allocated though fallocate and user writes a couple of
data among the space, f2fs should return the data offset made by user when
SEEK_DATA is requested.

For example, (N: NEW_ADDR by fallocate, X: NEW_ADDR by user)
1) fallocate 0 ~ 10MB
f -> N N N N N N N N N N N N ... N

2) write 4KB at 5MB offset
f -> N N N N N X N N N N N N ... N

3) SEEK_DATA from 0 should return 5MB offset

So, this patch adds a routine to search the first dirty page to handle that.
Then, the SEEK_DATA flow skips NEW_ADDR offsets until any dirty page is found.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Jaegeuk Kim
fe369bc8ba f2fs: return i_size if the hole is outside of i_size
When SEEK_HOLE is requeted, it should return i_size if the hole position is
found outside of i_size.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Chao Yu
267378d4de f2fs: introduce f2fs_seek_block to support SEEK_{DATA, HOLE} in llseek
In This patch we introduce f2fs_seek_block to support SEEK_{DATA,HOLE} of
lseek(2).

change log from v1:
 o fix bug when lseek from middle of page and fix wrong calculation of
PGOFS_OF_NEXT_DNODE macro.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Gu Zheng
2163d19815 f2fs: introduce help function {create,destroy}_flush_cmd_control
Introduce help function {create,destroy}_flush_cmd_control to clean up
the create/destory flush merge operation.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Gu Zheng
a688b9d9e5 f2fs: introduce struct flush_cmd_control to wrap the flush_merge fields
Split the flush_merge fields from sm_i, and use the new struct flush_cmd_control
to wrap it, so that we can igonre these fileds if flush_merge is disable, and
it alse can the structs more neat.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Chao Yu
6403eb1f64 f2fs: introduce help macro ADDRS_PER_PAGE()
Introduce help macro ADDRS_PER_PAGE() to get the number of address pointers in
direct node or inode.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim
2aea39eca6 f2fs: submit bio at the reclaim path
If f2fs_write_data_page is called through the reclaim path, we should submit
the bio right away.

This patch resolves the following issue that Marc Dietrich reported.
"It took me a while to bisect a problem which causes my ARM (tegra2) netbook to
frequently stall for 5-10 seconds when I enable EXA acceleration (opentegra
experimental ddx)."
And this patch fixes that.

Reported-by: Marc Dietrich <marvin24@gmx.de>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim
916decbf39 f2fs: return errors right after checking them
This patch adds two error conditions early in the setxattr operations.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim
c02745ef68 f2fs: pass flags field to setxattr functions
This patch passes the "flags" field to the low level setxattr functions
to use XATTR_REPLACE in the following patches.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim
e112326805 f2fs: clean up long variable names
This patch includes simple clean-ups to reduce unnecessary long variable names.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Chao Yu
454ae7e519 f2fs: handle inline data independently in f2fs_bmap
We'd better handle inline data case independently in f2fs_bmap().
It can reduce our handling time in f2fs_bmap().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim
6fb03f3a40 f2fs: adjust free mem size to flush dentry blocks
If so many dirty dentry blocks are cached, not reached to the flush condition,
we should fall into livelock in balance_dirty_pages.
So, let's consider the mem size for the condition.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Jaegeuk Kim
e8271fa390 f2fs: avoid BUG_ON when mouting corrupted image having garbage blocks
If the disk has some garbage blocks, F2FS is able to face with BUG_ON when
recovering direct node blocks.
This patch detects the error case and avoids that prior to reaching BUG_ON.

Alexey Khoroshilov addressed the potential security issues as follows.
"An ability to trigger a BUG_ON assert by mounting a crafted image is
usually considered as a local denial of service [1-3]. As far as I
understand, the reason is that some kernel data may become inconsistent
that can lead to further problems.

[1] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-3353
[2] http://www.openwall.com/lists/oss-security/2011/06/24/4
[3] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-2928
etc."

Reported-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Jaegeuk Kim
7ee0eeabcd f2fs: add available_nids to fix handling max_nid correctly
This patch introduces available_nids for alloc_nids() and fixes max_nid for
build_free_nids() and scan_nat_pages().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Fabian Frederick
b49ad51e6d f2fs: add static to get_max_meta_blks
inline get_max_meta_blks is only used in checkpoint.c
Use standard static inline format.

Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Chao Yu
94dac22e72 f2fs: introduce raw_nat_from_node_info() to simplfy codes
This patch introduce raw_nat_from_node_info() to simplfy some codes, and also
use exist function node_info_from_raw_nat() to do the same job.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Gu Zheng
876dc59eb1 f2fs: add the flush_merge handle in the remount flow
Add the *remount* handle of flush_merge option, so that the users
can enable flush_merge in the runtime, such as the underlying device
handles the cache_flush command relatively slowly.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Zhang Zhen
8abfb36ab3 f2fs: atomically set inode->i_flags in f2fs_set_inode_flags()
Use set_mask_bits() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
FS_IMMUTABLE_FL, FS_APPEND_FL, etc. flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jingoo Han
b156d54241 f2fs: make recover_inline_xattr() static
Make recover_inline_xattr() static, because this function is
used only in this file.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jaegeuk Kim
ed57c27f73 f2fs: remove costly dirty_dir_inode operations
This patch removes list opeations in handling dirty dir inodes.
Previously, F2FS traverses whole the list of dirty dir inodes to check whether
there is an existing inode or not, resulting in heavy CPU overheads.

So this patch removes such the traverse operations by adding FI_DIRTY_DIR to
indicate the inode lies on the list or not.
Through this simple flag, we can remove redundant operations gracefully.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jaegeuk Kim
15c6e3aae6 f2fs: fix to unlock f2fs_lock at the omitted error case
If it occurs an error, we should call f2fs_unlock_op.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jaegeuk Kim
76f60268e7 f2fs: call redirty_page_for_writepage
This patch replace some general codes with redirty_page_for_writepage, which
can be enabled after consideration on additional procedure like counting dirty
pages appropriately.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jaegeuk Kim
1e87a78d95 f2fs: avoid to conduct roll-forward due to the remained garbage blocks
The f2fs always scans the next chain of direct node blocks.
But some garbage blocks are able to be remained due to no discard support or
SSR triggers.
This occasionally wreaks recovering wrong inodes that were used or BUG_ONs
due to reallocating node ids as follows.

When mount this f2fs image:
http://linuxtesting.org/downloads/f2fs_fault_image.zip
BUG_ON is triggered in f2fs driver (messages below are generated on
kernel 3.13.2; for other kernels output is similar):

kernel BUG at fs/f2fs/node.c:215!
 Call Trace:
 [<ffffffffa032ebad>] recover_inode_page+0x1fd/0x3e0 [f2fs]
 [<ffffffff811446e7>] ? __lock_page+0x67/0x70
 [<ffffffff81089990>] ? autoremove_wake_function+0x50/0x50
 [<ffffffffa0337788>] recover_fsync_data+0x1398/0x15d0 [f2fs]
 [<ffffffff812b9e5c>] ? selinux_d_instantiate+0x1c/0x20
 [<ffffffff811cb20b>] ? d_instantiate+0x5b/0x80
 [<ffffffffa0321044>] f2fs_fill_super+0xb04/0xbf0 [f2fs]
 [<ffffffff811b861e>] ? mount_bdev+0x7e/0x210
 [<ffffffff811b8769>] mount_bdev+0x1c9/0x210
 [<ffffffffa0320540>] ? validate_superblock+0x210/0x210 [f2fs]
 [<ffffffffa031cf8d>] f2fs_mount+0x1d/0x30 [f2fs]
 [<ffffffff811b9497>] mount_fs+0x47/0x1c0
 [<ffffffff81166e00>] ? __alloc_percpu+0x10/0x20
 [<ffffffff811d4032>] vfs_kern_mount+0x72/0x110
 [<ffffffff811d6763>] do_mount+0x493/0x910
 [<ffffffff811615cb>] ? strndup_user+0x5b/0x80
 [<ffffffff811d6c70>] SyS_mount+0x90/0xe0
 [<ffffffff8166f8d9>] system_call_fastpath+0x16/0x1b

Found by Linux File System Verification project (linuxtesting.org).

Reported-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Gu Zheng
b270ad6f0a f2fs: enable flush_merge only in f2fs is not read-only
Enable flush_merge only in f2fs is not read-only, so does the mount
option show.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Gu Zheng
197d46476c f2fs: use __GFP_ZERO to avoid appending set-NULL
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:53 +09:00
Gu Zheng
a4ed23f2f1 f2fs: put the bio when issue_flush completed
Put the bio when the flush cmd issued, it also can fix the following
kmemleak:
unreferenced object 0xffff8800270c73c0 (size 200):
  comm "f2fs_flush-7:0", pid 27161, jiffies 4312127988 (age 988.503s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 40 07 81 19 01 88 ff ff  ........@.......
    01 00 00 00 00 00 00 f0 11 14 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81559866>] kmemleak_alloc+0x72/0x96
    [<ffffffff81156f7e>] slab_post_alloc_hook+0x28/0x2a
    [<ffffffff811595b1>] kmem_cache_alloc+0xec/0x157
    [<ffffffff8111924d>] mempool_alloc_slab+0x15/0x17
    [<ffffffff81119513>] mempool_alloc+0x71/0x138
    [<ffffffff81193548>] bio_alloc_bioset+0x93/0x18c
    [<ffffffffa040f857>] issue_flush_thread+0x8d/0x145 [f2fs]
    [<ffffffff8107ac16>] kthread+0xba/0xc2
    [<ffffffff81571b2c>] ret_from_fork+0x7c/0xb0
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:53 +09:00
Dave Chinner
8cfcc3e565 xfs: fix directory readahead offset off-by-one
Directory readahead can throw loud scary but harmless warnings
when multiblock directories are in use a specific pattern of
discontiguous blocks are found in the directory. That is, if a hole
follows a discontiguous block, it will throw a warning like:

XFS (dm-1): xfs_da_do_buf: bno 637 dir: inode 34363923462
XFS (dm-1): [00] br_startoff 637 br_startblock 1917954575 br_blockcount 1 br_state 0
XFS (dm-1): [01] br_startoff 638 br_startblock -2 br_blockcount 1 br_state 0

And dump a stack trace.

This is because the readahead offset increment loop does a double
increment of the block index - it does an increment for the loop
iteration as well as increase the loop counter by the number of
blocks in the extent. As a result, the readahead offset does not get
incremented correctly for discontiguous blocks and hence can ask for
readahead of a directory block from an offset part way through a
directory block.  If that directory block is followed by a hole, it
will trigger a mapping warning like the above.

The bad readahead will be ignored, though, because the main
directory block read loop uses the correct mapping offsets rather
than the readahead offset and so will ignore the bad readahead
altogether.

Fix the warning by ensuring that the readahead offset is correctly
incremented.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-07 08:05:52 +10:00
Dave Chinner
ac983517ec xfs: don't sleep in xlog_cil_force_lsn on shutdown
Reports of a shutdown hang when fsyncing a directory have surfaced,
such as this:

[ 3663.394472] Call Trace:
[ 3663.397199]  [<ffffffff815f1889>] schedule+0x29/0x70
[ 3663.402743]  [<ffffffffa01feda5>] xlog_cil_force_lsn+0x185/0x1a0 [xfs]
[ 3663.416249]  [<ffffffffa01fd3af>] _xfs_log_force_lsn+0x6f/0x2f0 [xfs]
[ 3663.429271]  [<ffffffffa01a339d>] xfs_dir_fsync+0x7d/0xe0 [xfs]
[ 3663.435873]  [<ffffffff811df8c5>] do_fsync+0x65/0xa0
[ 3663.441408]  [<ffffffff811dfbc0>] SyS_fsync+0x10/0x20
[ 3663.447043]  [<ffffffff815fc7d9>] system_call_fastpath+0x16/0x1b

If we trigger a shutdown in xlog_cil_push() from xlog_write(), we
will never wake waiters on the current push sequence number, so
anything waiting in xlog_cil_force_lsn() for that push sequence
number to come up will not get woken and hence stall the shutdown.

Fix this by ensuring we call wake_up_all(&cil->xc_commit_wait) in
the push abort handling, in the log shutdown code when waking all
waiters, and adding a shutdown check in the sequence completion wait
loops to ensure they abort when a wakeup due to a shutdown occurs.

Reported-by: Boris Ranto <branto@redhat.com>
Reported-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-07 08:05:50 +10:00
Dave Chinner
49abc3a8f8 xfs: truncate_setsize should be outside transactions
truncate_setsize() removes pages from the page cache, and hence
requires page locks to be held. It is not valid to lock a page cache
page inside a transaction context as we can hold page locks when we
we reserve space for a transaction. If we do, then we expose an ABBA
deadlock between log space reservation and page locks.

That is, both the write path and writeback lock a page, then start a
transaction for block allocation, which means they can block waiting
for a log reservation with the page lock held. If we hold a log
reservation and then do something that locks a page (e.g.
truncate_setsize in xfs_setattr_size) then that page lock can block
on the page locked and waiting for a log reservation. If the
transaction that is waiting for the page lock is the only active
transaction in the system that can free log space via a commit,
then writeback will never make progress and so log space will never
free up.

This issue with xfs_setattr_size() was introduced back in 2010 by
commit fa9b227 ("xfs: new truncate sequence") which moved the page
cache truncate from outside the transaction context (what was
xfs_itruncate_data()) to inside the transaction context as a call to
truncate_setsize().

The reason truncate_setsize() was located where in this place was
that we can't shouldn't change the file size until after we are in
the transaction context and the operation will either succeed or
shut down the filesystem on failure. However, block_truncate_page()
already modifies the file contents before we enter the transaction
context, so we can't really fulfill this guarantee in any way. Hence
we may as well ensure that on success or failure, the in-memory
inode and data is truncated away and that the application cleans up
the mess appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-07 08:05:45 +10:00
Trond Myklebust
50cc62317d NFSd: Mark nfs4_free_lockowner and nfs4_free_openowner as static functions
They do not need to be used outside fs/nfsd/nfs4state.c

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 17:54:57 -04:00
Christoph Hellwig
6f226e2ab1 nfsd: remove <linux/nfsd/debug.h>
There is almost nothing left it in, just merge it into the only file
that includes it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 17:54:56 -04:00
Christoph Hellwig
7f94423e8f nfsd: move <linux/nfsd/stats.h> to fs/nfsd
There are no legitimate users outside of fs/nfsd, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 17:54:55 -04:00
Christoph Hellwig
d430e8d530 nfsd: move <linux/nfsd/export.h> to fs/nfsd
There are no legitimate users outside of fs/nfsd, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 17:54:54 -04:00
Christoph Hellwig
9c69de4c94 nfsd: remove <linux/nfsd/nfsfh.h>
The only real user of this header is fs/nfsd/nfsfh.h, so merge the
two.  Various lockѕ source files used it to indirectly get other
sunrpc or nfs headers, so fix those up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 17:54:53 -04:00
Trond Myklebust
4dd86e150f NFSd: Remove 'inline' designation for free_client()
It is large, it is used in more than one place, and it is not performance
critical. Let gcc figure out whether it should be inlined...

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 17:54:53 -04:00
Kees Cook
12dd7ecf23 lockd: avoid warning when CONFIG_SYSCTL undefined
When building without CONFIG_SYSCTL, the compiler saw an unused
label. This moves the label into the #ifdef it is used under.

fs/lockd/svc.c: In function ‘init_nlm’:
fs/lockd/svc.c:626:1: warning: label ‘err_sysctl’ defined but not used [-Wunused-label]

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 17:54:52 -04:00
Al Viro
62a8067a7f bio_vec-backed iov_iter
New variant of iov_iter - ITER_BVEC in iter->type, backed with
bio_vec array instead of iovec one.  Primitives taught to deal
with such beasts, __swap_write() switched to using that kind
of iov_iter.

Note that bio_vec is just a <page, offset, length> triple - there's
nothing block-specific about it.  I've left the definition where it
was, but took it from under ifdef CONFIG_BLOCK.

Next target: ->splice_write()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:45 -04:00
Al Viro
4908b822b3 ceph: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:43 -04:00
Al Viro
64c3131161 ceph_sync_direct_write: stop poking into iov_iter guts
all needed primitives are there...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:43 -04:00
Al Viro
2b777c9dd9 ceph_sync_read: stop poking into iov_iter guts
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:42 -04:00
Al Viro
f0d1bec9d5 new helper: copy_page_from_iter()
parallel to copy_page_to_iter().  pipe_write() switched to it (and became
->write_iter()).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:42 -04:00
Al Viro
84c3d55cc4 fuse: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:41 -04:00
Al Viro
b30ac0fc41 btrfs: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:41 -04:00
Al Viro
3ef045c3d8 ocfs2: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:40 -04:00
Al Viro
bf97f3bc0c xfs: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:40 -04:00
Al Viro
50b5551d17 afs: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:39 -04:00
Al Viro
da56e45b6e gfs2: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:39 -04:00
Al Viro
edaf436948 nfs: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:38 -04:00
Al Viro
f5674c31ee ubifs: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:38 -04:00
Al Viro
3dae8750c3 cifs: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:37 -04:00
Al Viro
d4637bc18f udf: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:36 -04:00
Al Viro
9b884164d5 convert ext4 to ->write_iter()
unfortunately, Ted's changes to ext4_file_write() are *still* an
incomplete fix - playing with rlimits can let you smuggle an
unaligned request past the checks.  So there almost certainly
will be more merge PITA around that place...

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:36 -04:00
Al Viro
a832475488 Merge ext4 changes in ext4_file_write() into for-next
From ext4.git#dev, needed for switch of ext4 to ->write_iter() ;-/
2014-05-06 17:38:41 -04:00
Al Viro
1456c0a87c blkdev_aio_write() - turn into blkdev_write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:38:01 -04:00
Al Viro
8174202b34 write_iter variants of {__,}generic_file_aio_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:38:00 -04:00
Al Viro
3644424dc6 ceph: switch to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:38:00 -04:00
Al Viro
3aa2d199f8 nfs: switch to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:59 -04:00
Al Viro
a886038baa fs/block_dev.c: switch to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:59 -04:00
Al Viro
fb9096a344 pipe: switch to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:58 -04:00
Al Viro
e6a7bcb4c4 cifs: switch to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:58 -04:00
Al Viro
37c20f16e7 fuse_file_aio_read(): convert to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:57 -04:00
Al Viro
3cd9ad5a30 ocfs2: switch to ->read_iter()
tracepoints are evil, exhibit #6969...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:57 -04:00
Al Viro
027978295d ecryptfs: switch to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:56 -04:00
Al Viro
b4f5d2c6d1 xfs: switch to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:56 -04:00
Al Viro
aad4f8bb42 switch simple generic_file_aio_read() users to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:55 -04:00
Al Viro
293bc9822f new methods: ->read_iter() and ->write_iter()
Beginning to introduce those.  Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated.  File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:36:00 -04:00
Al Viro
7f7f25e82d replace checking for ->read/->aio_read presence with check in ->f_mode
Since we are about to introduce new methods (read_iter/write_iter), the
tests in a bunch of places would have to grow inconveniently.  Check
once (at open() time) and store results in ->f_mode as FMODE_CAN_READ
and FMODE_CAN_WRITE resp.  It might end up being a temporary measure -
once everything switches from ->aio_{read,write} to ->{read,write}_iter
it might make sense to return to open-coded checks.  We'll see...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:55 -04:00
Al Viro
b318891929 xfs: trim the argument lists of xfs_file_{dio,buffered}_aio_write()
pos is redundant (it's iocb->ki_pos), and iov/nr_segs/count are taken
care of by lifting iov_iter into the caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:55 -04:00
Al Viro
3793846354 blkdev_aio_read(): switch to generic_file_read_iter(), get rid of iov_shorten()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:54 -04:00
Al Viro
0c949334a9 iov_iter_truncate()
Now It Can Be Done(tm) - we don't need to do iov_shorten() in
generic_file_direct_write() anymore, now that all ->direct_IO()
instances are converted to proper iov_iter methods and honour
iter->count and iter->iov_offset properly.

Get rid of count/ocount arguments of generic_file_direct_write(),
while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:54 -04:00
Al Viro
28060d5d9b btrfs: switch check_direct_IO() to iov_iter
... and don't open-code iov_iter_alignment() there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:53 -04:00
Al Viro
91f79c43d1 new helper: iov_iter_get_pages_alloc()
same as iov_iter_get_pages(), except that pages array is allocated
(kmalloc if possible, vmalloc if that fails) and left for caller to
free.  Lustre and NFS ->direct_IO() switched to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:53 -04:00
Al Viro
f67da30c1d new helper: iov_iter_npages()
counts the pages covered by iov_iter, up to given limit.
do_block_direct_io() and fuse_iter_npages() switched to
it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:52 -04:00
Al Viro
5b46f25ddc f2fs: switch to iov_iter_alignment()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:52 -04:00
Al Viro
c9c37e2e63 fuse: switch to iov_iter_get_pages()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:51 -04:00
Al Viro
d22a943f44 fuse: pull iov_iter initializations up
... to fuse_direct_{read,write}().  ->direct_IO() path uses the
iov_iter passed by the caller instead.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:51 -04:00
Al Viro
7b2c99d155 new helper: iov_iter_get_pages()
iov_iter_get_pages(iter, pages, maxsize, &start) grabs references pinning
the pages of up to maxsize of (contiguous) data from iter.  Returns the
amount of memory grabbed or -error.  In case of success, the requested
area begins at offset start in pages[0] and runs through pages[1], etc.
Less than requested amount might be returned - either because the contiguous
area in the beginning of iterator is smaller than requested, or because
the kernel failed to pin that many pages.

direct-io.c switched to using iov_iter_get_pages()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:50 -04:00
Al Viro
3320c60b3a dio: take updating ->result into do_direct_IO()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:50 -04:00
Al Viro
71d8e532b1 start adding the tag to iov_iter
For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment.  Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:49 -04:00
Al Viro
ed978a811e new helper: generic_file_read_iter()
iov_iter-using variant of generic_file_aio_read().  Some callers
converted.  Note that it's still not quite there for use as ->read_iter() -
we depend on having zero iter->iov_offset in O_DIRECT case.  Fortunately,
that's true for all converted callers (and for generic_file_aio_read() itself).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:49 -04:00
Al Viro
23faa7b8db fuse_file_aio_write(): merge initializations of iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:48 -04:00
Al Viro
05bb2e0bc7 ceph_aio_read(): keep iov_iter across retries
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:48 -04:00
Al Viro
886a391150 new primitive: iov_iter_alignment()
returns the value aligned as badly as the worst remaining segment
in iov_iter is.  Use instead of open-coded equivalents.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:47 -04:00
Al Viro
31b140398c switch {__,}blockdev_direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:46 -04:00
Al Viro
a6cbcd4a4a get rid of pointless iov_length() in ->direct_IO()
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:45 -04:00
Al Viro
16b1f05d7f ext4: switch the guts of ->direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:45 -04:00
Al Viro
619d30b4b8 convert the guts of nfs_direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:44 -04:00
Al Viro
d8d3d94b80 pass iov_iter to ->direct_IO()
unmodified, for now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:44 -04:00
Al Viro
cb66a7a1f1 kill generic_segment_checks()
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
checked - generic_segment_checks() done after that is just an odd way
to spell iov_length().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:43 -04:00
Al Viro
0ae5e4d370 __btrfs_direct_write(): switch to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:43 -04:00
Al Viro
f8579f8673 generic_file_direct_write(): switch to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:42 -04:00
Al Viro
e7c24607b5 kill iov_iter_copy_from_user()
all callers can use copy_page_from_iter() and it actually simplifies
them.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:42 -04:00
Al Viro
f6c0a1920e fs/file.c: don't open-code kvfree()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:31:10 -04:00
Jeff Mahoney
a228bf8f0a reiserfs: cleanup, remove unnecessary parens
The reiserfs code is littered with extra parens in places where the authors
may not have been certain about precedence of & vs ->. This patch cleans them
out.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 23:18:16 +02:00
Jeff Mahoney
cf776a7a4d reiserfs: cleanup, remove leading whitespace from labels
This patch moves reiserfs closer to adhering to the style rules by
removing leading whitespace from labels.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 23:18:09 +02:00
Jeff Mahoney
16da167c16 reiserfs: cleanup, remove unnecessary parens in dirent creation
make_empty_dir_item_v1 and make_empty_dir_item also needed a bit of cleanup
but it's clearer to use separate pointers rather than the array positions
for just two items.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 23:14:24 +02:00
Jeff Mahoney
b491dd1769 reiserfs: cleanup, remove blocks arg from journal_join
journal_join is always called with a block count of 1. Let's just get
rid of the argument.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 23:11:15 +02:00
Jeff Mahoney
09f1b80ba8 reiserfs: cleanup, remove sb argument from journal_mark_dirty
journal_mark_dirty doesn't need a separate sb argument; It's provided
by the transaction handle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 23:10:37 +02:00
Jeff Mahoney
58d854265c reiserfs: cleanup, remove sb argument from journal_end
journal_end doesn't need a separate sb argument; it's provided by the
transaction handle.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 23:08:00 +02:00
Jeff Mahoney
706a532338 reiserfs: cleanup, remove nblocks argument from journal_end
journal_end takes a block count argument but doesn't actually use it
for anything. We can remove it.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 23:05:40 +02:00
Jeff Mahoney
098297b27d reiserfs: cleanup, reformat comments to normal kernel style
This patch reformats comments in the reiserfs code to fit in 80 columns and
to follow the style rules.

There is no functional change but it helps make my eyes bleed less.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 22:52:19 +02:00
Jeff Mahoney
4cf5f7addf reiserfs: cleanup, rename key and item accessors to more friendly names
This patch does a quick search and replace:
B_N_PITEM_HEAD() -> item_head()
B_N_PDELIM_KEY() -> internal_key()
B_N_PKEY() -> leaf_key()
B_N_PITEM() -> item_body()

And the item_head version:
B_I_PITEM() -> ih_item_body()
I_ENTRY_COUNT() -> ih_entry_count()

And the treepath variants:
get_ih() -> tp_item_head()
PATH_PITEM_HEAD() -> tp_item_head()
get_item() -> tp_item_body()

... which makes the code much easier on the eyes.

I've also removed a few unused macros.

Checkpatch will complain about the 80 character limit for do_balan.c.
I've addressed that in a later patchset to split up balance_leaf().

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 22:51:44 +02:00
Jeff Mahoney
797d9016ce reiserfs: use per-fs commit workqueues
The reiserfs write lock hasn't been the BKL for some time. There's no
need to have different file systems queued up on the same workqueue.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-05-06 22:44:45 +02:00
Linus Torvalds
38583f095c Merge branch 'akpm' (incoming from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  agp: info leak in agpioc_info_wrap()
  fs/affs/super.c: bugfix / double free
  fanotify: fix -EOVERFLOW with large files on 64-bit
  slub: use sysfs'es release mechanism for kmem_cache
  revert "mm: vmscan: do not swap anon pages just because free+file is low"
  autofs: fix lockref lookup
  mm: filemap: update find_get_pages_tag() to deal with shadow entries
  mm/compaction: make isolate_freepages start at pageblock boundary
  MAINTAINERS: zswap/zbud: change maintainer email address
  mm/page-writeback.c: fix divide by zero in pos_ratio_polynom
  hugetlb: ensure hugepage access is denied if hugepages are not supported
  slub: fix memcg_propagate_slab_attrs
  drivers/rtc/rtc-pcf8523.c: fix month definition
2014-05-06 13:07:41 -07:00
Fabian Frederick
d353efd023 fs/affs/super.c: bugfix / double free
Commit 842a859db2 ("affs: use ->kill_sb() to simplify ->put_super()
and failure exits of ->mount()") adds .kill_sb which frees sbi but
doesn't remove sbi free in case of parse_options error causing double
free+random crash.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>	[3.14.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06 13:05:00 -07:00
Will Woods
1e2ee49f7f fanotify: fix -EOVERFLOW with large files on 64-bit
On 64-bit systems, O_LARGEFILE is automatically added to flags inside
the open() syscall (also openat(), blkdev_open(), etc).  Userspace
therefore defines O_LARGEFILE to be 0 - you can use it, but it's a
no-op.  Everything should be O_LARGEFILE by default.

But: when fanotify does create_fd() it uses dentry_open(), which skips
all that.  And userspace can't set O_LARGEFILE in fanotify_init()
because it's defined to 0.  So if fanotify gets an event regarding a
large file, the read() will just fail with -EOVERFLOW.

This patch adds O_LARGEFILE to fanotify_init()'s event_f_flags on 64-bit
systems, using the same test as open()/openat()/etc.

Addresses https://bugzilla.redhat.com/show_bug.cgi?id=696821

Signed-off-by: Will Woods <wwoods@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06 13:04:59 -07:00
Ian Kent
6b6751f7fe autofs: fix lockref lookup
autofs needs to be able to see private data dentry flags for its dentrys
that are being created but not yet hashed and for its dentrys that have
been rmdir()ed but not yet freed.  It needs to do this so it can block
processes in these states until a status has been returned to indicate
the given operation is complete.

It does this by keeping two lists, active and expring, of dentrys in
this state and uses ->d_release() to keep them stable while it checks
the reference count to determine if they should be used.

But with the recent lockref changes dentrys being freed sometimes don't
transition to a reference count of 0 before being freed so autofs can
occassionally use a dentry that is invalid which can lead to a panic.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06 13:04:59 -07:00
Nishanth Aravamudan
457c1b27ed hugetlb: ensure hugepage access is denied if hugepages are not supported
Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:

  Unable to handle kernel paging request for data at address 0x00000031
  Faulting instruction address: 0xc000000000245710
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA pSeries
  ....

In KVM guests on Power, in a guest not backed by hugepages, we see the
following:

  AnonHugePages:         0 kB
  HugePages_Total:       0
  HugePages_Free:        0
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:         64 kB

HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init().  Extract the check to a helper function, and use it in a
few relevant places.

This does make hugetlbfs not supported (not registered at all) in this
environment.  I believe this is fine, as there are no valid hugepages
and that won't change at runtime.

[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06 13:04:58 -07:00
Linus Torvalds
8169d3005e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl
  bugfix from hch"

The dcache fixes are for a subtle LRU list corruption bug reported by
Miklos Szeredi, where people inside IBM saw list corruptions with the
LTP/host01 test.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nick kvfree() from apparmor
  posix_acl: handle NULL ACL in posix_acl_equiv_mode
  dcache: don't need rcu in shrink_dentry_list()
  more graceful recovery in umount_collect()
  don't remove from shrink list in select_collect()
  dentry_kill(): don't try to remove from shrink list
  expand the call of dentry_lru_del() in dentry_kill()
  new helper: dentry_free()
  fold try_prune_one_dentry()
  fold d_kill() and d_free()
  fix races between __d_instantiate() and checks of dentry flags
2014-05-06 12:22:20 -07:00
Christoph Hellwig
50c6e282bd posix_acl: handle NULL ACL in posix_acl_equiv_mode
Various filesystems don't bother checking for a NULL ACL in
posix_acl_equiv_mode, and thus can dereference a NULL pointer when it
gets passed one. This usually happens from the NFS server, as the ACL tools
never pass a NULL ACL, but instead of one representing the mode bits.

Instead of adding boilerplat to all filesystems put this check into one place,
which will allow us to remove the check from other filesystems as well later
on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ben Greear <greearb@candelatech.com>
Reported-by: Marco Munderloh <munderl@tnt.uni-hannover.de>,
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 13:58:42 -04:00
Trond Myklebust
4cb57e3032 NFSd: call rpc_destroy_wait_queue() from free_client()
Mainly to ensure that we don't leave any hanging timers.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 12:38:49 -04:00
Trond Myklebust
5694c93e6c NFSd: Move default initialisers from create_client() to alloc_client()
Aside from making it clearer what is non-trivial in create_client(), it
also fixes a bug whereby we can call free_client() before idr_init()
has been called.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-05-06 12:38:46 -04:00
Linus Torvalds
256cf4c438 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "This adds ctime update in the new cached writeback mode and also
  fixes/simplifies the mtime update handling.  Support for rename flags
  (aka renameat2) is also added to the userspace API"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: add renameat2 support
  fuse: clear MS_I_VERSION
  fuse: clear FUSE_I_CTIME_DIRTY flag on setattr
  fuse: trust kernel i_ctime only
  fuse: remove .update_time
  fuse: allow ctime flushing to userspace
  fuse: fuse: add time_gran to INIT_OUT
  fuse: add .write_inode
  fuse: clean up fsync
  fuse: fuse: fallocate: use file_update_time()
  fuse: update mtime on open(O_TRUNC) in atomic_o_trunc mode
  fuse: update mtime on truncate(2)
  fuse: do not use uninitialized i_mode
  fuse: fix mtime update error in fsync
  fuse: check fallocate mode
  fuse: add __exit to fuse_ctl_cleanup
2014-05-06 09:09:35 -07:00
Linus Torvalds
5575eeb7b9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "First, there is a critical fix for the new primary-affinity function
  that went into -rc1.

  The second batch of patches from Zheng fix a range of problems with
  directory fragmentation, readdir, and a few odds and ends for cephfs"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: reserve caps for file layout/lock MDS requests
  ceph: avoid releasing caps that are being used
  ceph: clear directory's completeness when creating file
  libceph: fix non-default values check in apply_primary_affinity()
  ceph: use fpos_cmp() to compare dentry positions
  ceph: check directory's completeness before emitting directory entry
2014-05-05 15:17:02 -07:00
Dave Chinner
8275cdd0e7 xfs: remote attribute overwrite causes transaction overrun
Commit e461fcb ("xfs: remote attribute lookups require the value
length") passes the remote attribute length in the xfs_da_args
structure on lookup so that CRC calculations and validity checking
can be performed correctly by related code. This, unfortunately has
the side effect of changing the args->valuelen parameter in cases
where it shouldn't.

That is, when we replace a remote attribute, the incoming
replacement stores the value and length in args->value and
args->valuelen, but then the lookup which finds the existing remote
attribute overwrites args->valuelen with the length of the remote
attribute being replaced. Hence when we go to create the new
attribute, we create it of the size of the existing remote
attribute, not the size it is supposed to be. When the new attribute
is much smaller than the old attribute, this results in a
transaction overrun and an ASSERT() failure on a debug kernel:

XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 331

Fix this by keeping the remote attribute value length separate to
the attribute value length in the xfs_da_args structure. The enables
us to pass the length of the remote attribute to be removed without
overwriting the new attribute's length.

Also, ensure that when we save remote block contexts for a later
rename we zero the original state variables so that we don't confuse
the state of the attribute to be removes with the state of the new
attribute that we just added. [Spotted by Brain Foster.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-06 07:37:31 +10:00
Brian Foster
d540e43b0a xfs: initialize default acls for ->tmpfile()
The current tmpfile handler does not initialize default ACLs. Doing so
within xfs_vn_tmpfile() makes it roughly equivalent to xfs_vn_mknod(),
which is already used as a common create handler.

xfs_vn_mknod() does not currently have a mechanism to determine whether
to link the file into the namespace. Therefore, further abstract
xfs_vn_mknod() into a new xfs_generic_create() handler with a tmpfile
parameter. This new handler calls xfs_create_tmpfile() and d_tmpfile()
on the dentry when called via ->tmpfile().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-06 07:34:28 +10:00
From: Tuomas Tynkkynen
b28fd7b5fe xfs: Fix wrong error codes being returned
xfs_{compat_,}attrmulti_by_handle could return an errno with incorrect
sign in some cases. While at it, make sure ENOMEM is returned instead of
E2BIG if kmalloc fails.

Signed-off-by: Tuomas Tynkkynen <tuomas.tynkkynen@iki.fi>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 17:30:20 +10:00
Dave Chinner
3c35337576 xfs: remove dquot hints
group and project quota hints are currently stored on the user
dquot. If we are attaching quotas to the inode, then the group and
project dquots are stored as hints on the user dquot to save having
to look them up again later.

The thing is, the hints are not used for that inode for the rest of
the life of the inode - the dquots are attached directly to the
inode itself - so the only time the hints are used is when an inode
first has dquots attached.

When the hints on the user dquot don't match the dquots being
attache dto the inode, they are then removed and replaced with the
new hints. If a user is concurrently modifying files in different
group and/or project contexts, then this leads to thrashing of the
hints attached to user dquot.

If user quotas are not enabled, then hints are never even used.

So, if the hints are used to avoid the cost of the lookup, is the
cost of the lookup significant enough to justify the hint
infrstructure? Maybe it was once, when there was a global quota
manager shared between all XFS filesystems and was hash table based.

However, lookups are now much simpler, requiring only a single lock and
radix tree lookup local to the filesystem and no hash or LRU
manipulations to be made. Hence the cost of lookup is much lower
than when hints were implemented. Turns out that benchmarks show
that, too, with thir being no differnce in performance when doing
file creation workloads as a single user with user, group and
project quotas enabled - the hints do not make the code go any
faster. In fact, removing the hints shows a 2-3% reduction in the
time it takes to create 50 million inodes....

So, let's just get rid of the hints and the complexity around them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 17:30:15 +10:00
Eric Sandeen
f58522c5a4 xfs: bulletfproof xfs_qm_scall_trunc_qfiles()
Coverity noticed that if we sent junk into
xfs_qm_scall_trunc_qfiles(), we could get back an
uninitialized error value.  So sanitize the flags we
will accept, and initialize error anyway for good measure.

(This bug may have been introduced via c61a9e39).

Should resolve Coverity CID 1163872.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 17:27:06 +10:00
Eric Sandeen
9da93f9b7c xfs: fix Q_XQUOTARM ioctl
The Q_XQUOTARM quotactl was not working properly, because
we weren't passing around proper flags.  The xfs_fs_set_xstate()
ioctl handler used the same flags for Q_XQUOTAON/OFF as
well as for Q_XQUOTARM, but Q_XQUOTAON/OFF look for
XFS_UQUOTA_ACCT, XFS_UQUOTA_ENFD, XFS_GQUOTA_ACCT etc,
i.e. quota type + state, while Q_XQUOTARM looks only for
the type of quota, i.e. XFS_DQ_USER, XFS_DQ_GROUP etc.

Unfortunately these flag spaces overlap a bit, so we
got semi-random results for Q_XQUOTARM; i.e. the value
for XFS_DQ_USER == XFS_UQUOTA_ACCT, etc.  yeargh.

Add a new quotactl op vector specifically for the QUOTARM
operation, since it operates with a different flag space.

This has been broken more or less forever, AFAICT.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 17:25:50 +10:00
Artem Bityutskiy
fcdd57c890 UBIFS: fix remount error path
Dan's "smatch" checker found out that there was a bug in the error path of the
'ubifs_remount_rw()' function. Instead of jumping to the "out" label which
cleans-things up, we just returned.

This patch fixes the problem.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-05-05 09:31:33 +03:00
Dave Chinner
c99d609a16 xfs: fully support v5 format filesystems
We have had this code in the kernel for over a year now and have
shaken all the known issues out of the code over the past few
releases. It's now time to remove the experimental warnings during
mount and fully support the new filesystem format in production
systems.

Remove the experimental warning, and add a version number to the
initial "mounting filesystem" message to tell use what type of
filesystem is being mounted. Also, remove the temporary inode
cluster size output at mount time now we know that this code works
fine.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 16:18:37 +10:00
Miklos Szeredi
60942f2f23 dcache: don't need rcu in shrink_dentry_list()
Since now the shrink list is private and nobody can free the dentry while
it is on the shrink list, we can remove RCU protection from this.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-03 16:46:16 -04:00
Al Viro
9c8c10e262 more graceful recovery in umount_collect()
Start with shrink_dcache_parent(), then scan what remains.

First of all, BUG() is very much an overkill here; we are holding
->s_umount, and hitting BUG() means that a lot of interesting stuff
will be hanging after that point (sync(2), for example).  Moreover,
in cases when there had been more than one leak, we'll be better
off reporting all of them.  And more than just the last component
of pathname - %pd is there for just such uses...

That was the last user of dentry_lru_del(), so kill it off...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-03 16:46:13 -04:00
Al Viro
fe91522a7b don't remove from shrink list in select_collect()
If we find something already on a shrink list, just increment
data->found and do nothing else.  Loops in shrink_dcache_parent() and
check_submounts_and_drop() will do the right thing - everything we
did put into our list will be evicted and if there had been nothing,
but data->found got non-zero, well, we have somebody else shrinking
those guys; just try again.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-03 16:45:06 -04:00
Linus Torvalds
98794f9321 Merge git://git.kvack.org/~bcrl/aio-fixes
Pull aio fixes from Ben LaHaise:
 "The first change from Anatol fixes a regression where io_destroy() no
  longer waits for outstanding aios to complete.  The second corrects a
  memory leak in an error path for vectored aio operations.

  Both of these bug fixes should be queued up for stable as well"

* git://git.kvack.org/~bcrl/aio-fixes:
  aio: fix potential leak in aio_run_iocb().
  aio: block io_destroy() until all context requests are completed
2014-05-01 08:54:03 -07:00
Al Viro
41edf278fc dentry_kill(): don't try to remove from shrink list
If the victim in on the shrink list, don't remove it from there.
If shrink_dentry_list() manages to remove it from the list before
we are done - fine, we'll just free it as usual.  If not - mark
it with new flag (DCACHE_MAY_FREE) and leave it there.

Eventually, shrink_dentry_list() will get to it, remove the sucker
from shrink list and call dentry_kill(dentry, 0).  Which is where
we'll deal with freeing.

Since now dentry_kill(dentry, 0) may happen after or during
dentry_kill(dentry, 1), we need to recognize that (by seeing
DCACHE_DENTRY_KILLED already set), unlock everything
and either free the sucker (in case DCACHE_MAY_FREE has been
set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-01 10:30:00 -04:00
Leon Yu
754320d6e1 aio: fix potential leak in aio_run_iocb().
iovec should be reclaimed whenever caller of rw_copy_check_uvector() returns,
but it doesn't hold when failure happens right after aio_setup_vectored_rw().

Fix that in a such way to avoid hairy goto.

Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
2014-05-01 08:37:43 -04:00
Al Viro
01b6035190 expand the call of dentry_lru_del() in dentry_kill()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-30 18:02:52 -04:00
Al Viro
b4f0354e96 new helper: dentry_free()
The part of old d_free() that dealt with actual freeing of dentry.
Taken out of dentry_kill() into a separate function.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-30 18:02:52 -04:00
Al Viro
5c47e6d0ad fold try_prune_one_dentry()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-30 18:02:51 -04:00
Al Viro
03b3b889e7 fold d_kill() and d_free()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-30 18:02:51 -04:00
Benjamin LaHaise
fa88b6f880 aio: cleanup: flatten kill_ioctx()
There is no need to have most of the code in kill_ioctx() indented.  Flatten
it.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-04-29 12:55:48 -04:00
Benjamin LaHaise
fb2d448383 aio: report error from io_destroy() when threads race in io_destroy()
As reported by Anatol Pomozov, io_destroy() fails to report an error when
it loses the race to destroy a given ioctx.  Since there is a difference in
behaviour between the thread that wins the race (which blocks on outstanding
io requests) versus lthe thread that loses (which returns immediately), wire
up a return code from kill_ioctx() to the io_destroy() syscall.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Anatol Pomozov <anatol.pomozov@gmail.com>
2014-04-29 12:45:17 -04:00
Yan, Zheng
3bd58143ba ceph: reserve caps for file layout/lock MDS requests
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-28 12:55:41 -07:00
Yan, Zheng
fd7b95cd1b ceph: avoid releasing caps that are being used
To avoid releasing caps that are being used, encode_inode_release()
should send implemented caps to MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-28 12:55:01 -07:00
Yan, Zheng
0a8a70f96f ceph: clear directory's completeness when creating file
When creating a file, ceph_set_dentry_offset() puts the new dentry
at the end of directory's d_subdirs, then set the dentry's offset
based on directory's max offset. The offset does not reflect the
real postion of the dentry in directory. Later readdir reply from
MDS may change the dentry's position/offset. This inconsistency
can cause missing/duplicate entries in readdir result if readdir
is partly satisfied by dcache_readdir().

The fix is clear directory's completeness after creating/renaming
file. It prevents later readdir from using dcache_readdir().

Fixes: http://tracker.ceph.com/issues/8025
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-28 12:54:44 -07:00
Yan, Zheng
6da5246dd4 ceph: use fpos_cmp() to compare dentry positions
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-28 12:53:52 -07:00
Yan, Zheng
0081bd83c0 ceph: check directory's completeness before emitting directory entry
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-28 12:53:43 -07:00
Miklos Szeredi
1560c974dc fuse: add renameat2 support
Support RENAME_EXCHANGE and RENAME_NOREPLACE flags on the userspace ABI.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 16:43:44 +02:00
Miklos Szeredi
4ace1f85a7 fuse: clear MS_I_VERSION
Fuse doesn't support i_version (yet).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:25 +02:00
Maxim Patlasov
3ad22c62dd fuse: clear FUSE_I_CTIME_DIRTY flag on setattr
The patch addresses two use-cases when the flag may be safely cleared:

1. fuse_do_setattr() is called with ATTR_CTIME flag set in attr->ia_valid.
In this case attr->ia_ctime bears actual value. In-kernel fuse must send it
to the userspace server and then assign the value to inode->i_ctime.

2. fuse_do_setattr() is called with ATTR_SIZE flag set in attr->ia_valid,
whereas ATTR_CTIME is not set (truncate(2)).
In this case in-kernel fuse must sent "now" to the userspace server and then
assign the value to inode->i_ctime.

In both cases we could clear I_DIRTY_SYNC, but that needs more thought.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:25 +02:00
Maxim Patlasov
31f3267b4b fuse: trust kernel i_ctime only
Let the kernel maintain i_ctime locally: update i_ctime explicitly on
truncate, fallocate, open(O_TRUNC), setxattr, removexattr, link, rename,
unlink.

The inode flag I_DIRTY_SYNC serves as indication that local i_ctime should
be flushed to the server eventually.  The patch sets the flag and updates
i_ctime in course of operations listed above.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:24 +02:00
Miklos Szeredi
8b47e73e91 fuse: remove .update_time
This implements updating ctime as well as mtime on file_update_time().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:24 +02:00
Maxim Patlasov
ab9e13f7c7 fuse: allow ctime flushing to userspace
The patch extends fuse_setattr_in, and extends the flush procedure
(fuse_flush_times()) called on ->write_inode() to send the ctime as well as
mtime.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:24 +02:00
Miklos Szeredi
e27c9d3877 fuse: fuse: add time_gran to INIT_OUT
Allow userspace fs to specify time granularity.

This is needed because with writeback_cache mode the kernel is responsible
for generating mtime and ctime, but if the underlying filesystem doesn't
support nanosecond granularity then the cache will contain a different
value from the one stored on the filesystem resulting in a change of times
after a cache flush.

Make the default granularity 1s.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:23 +02:00
Miklos Szeredi
1e18bda86e fuse: add .write_inode
...and flush mtime from this.  This allows us to use the kernel
infrastructure for writing out dirty metadata (mtime at this point, but
ctime in the next patches and also maybe atime).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:23 +02:00
Miklos Szeredi
22401e7b7a fuse: clean up fsync
Don't need to start I/O twice (once without i_mutex and one within).

Also make sure that even if the userspace filesystem doesn't support FSYNC
we do all the steps other than sending the message.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:23 +02:00
Miklos Szeredi
93d2269d2f fuse: fuse: fallocate: use file_update_time()
in preparation for getting rid of FUSE_I_MTIME_DIRTY.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:22 +02:00
Maxim Patlasov
75caeecdf9 fuse: update mtime on open(O_TRUNC) in atomic_o_trunc mode
In case of fc->atomic_o_trunc is set, fuse does nothing in
fuse_do_setattr() while handling open(O_TRUNC). Hence, i_mtime must be
updated explicitly in fuse_finish_open(). The patch also adds extra locking
encompassing open(O_TRUNC) operation to avoid races between the truncation
and updating i_mtime.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:22 +02:00
Maxim Patlasov
009dd694e8 fuse: update mtime on truncate(2)
Handling truncate(2), VFS doesn't set ATTR_MTIME bit in iattr structure;
only ATTR_SIZE bit is set. In-kernel fuse must handle the case by setting
mtime fields of struct fuse_setattr_in to "now" and set FATTR_MTIME bit
even though ATTR_MTIME was not set.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:22 +02:00
Maxim Patlasov
d31433c8b0 fuse: do not use uninitialized i_mode
When inode is in I_NEW state, inode->i_mode is not initialized yet. Do not
use it before fuse_init_inode() is called.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:21 +02:00
Miklos Szeredi
aeb4eb6b55 fuse: fix mtime update error in fsync
Bad case of shadowing.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:21 +02:00
Miklos Szeredi
4adb83029d fuse: check fallocate mode
Don't allow new fallocate modes until we figure out what (if anything) that
takes.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:21 +02:00
Fabian Frederick
7736e8cc51 fuse: add __exit to fuse_ctl_cleanup
fuse_ctl_cleanup is only called by __exit fuse_exit

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28 14:19:21 +02:00
Fabian Frederick
5a7c6690c2 GFS2: lops.c: replace 0 by NULL for pointers
Sparse warning: fs/gfs2/lops.c:78:29:
"warning: Using plain integer as NULL pointer"

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-04-28 09:41:55 +01:00
Greg Kroah-Hartman
d35cc56ddf Merge 3.15-rc3 into staging-next 2014-04-27 21:36:39 -07:00
Linus Torvalds
33c0022f0e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: limit the path size in send to PATH_MAX
  Btrfs: correctly set profile flags on seqlock retry
  Btrfs: use correct key when repeating search for extent item
  Btrfs: fix inode caching vs tree log
  Btrfs: fix possible memory leaks in open_ctree()
  Btrfs: avoid triggering bug_on() when we fail to start inode caching task
  Btrfs: move btrfs_{set,clear}_and_info() to ctree.h
  btrfs: replace error code from btrfs_drop_extents
  btrfs: Change the hole range to a more accurate value.
  btrfs: fix use-after-free in mount_subvol()
2014-04-27 13:26:28 -07:00
Linus Torvalds
005fbcd034 Driver core fixes for 3.15-rc3
Here are some kernfs fixes for 3.15-rc3 that resolve some reported
 problems.  Nothing huge, but all needed.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlNcTwEACgkQMUfUDdst+yktKQCeOZdKHq6J2od49bnwsPIlne1J
 h2kAoKs1LpEBHI/2KH/6etP5Qjks5iuB
 =5BPH
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are some kernfs fixes for 3.15-rc3 that resolve some reported
  problems.  Nothing huge, but all needed"

* tag 'driver-core-3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  s390/ccwgroup: Fix memory corruption
  kernfs: add back missing error check in kernfs_fop_mmap()
  kernfs: fix a subdir count leak
2014-04-27 10:28:34 -07:00
Chris Mason
cfd4a535b6 Btrfs: limit the path size in send to PATH_MAX
fs_path_ensure_buf is used to make sure our path buffers for
send are big enough for the path names as we construct them.
The buffer size is limited to 32K by the length field in
the struct.

But bugs in the path construction can end up trying to build
a huge buffer, and we'll do invalid memmmoves when the
buffer length field wraps.

This patch is step one, preventing the overflows.

Signed-off-by: Chris Mason <clm@fb.com>
2014-04-26 05:02:03 -07:00
Linus Torvalds
625bba662c File locking related bugfixes for v3.15 (pile #2)
- fix for a long-standing bug in __break_lease that can cause soft lockups
 - renaming of file-private locks to "open file description" locks, and the
   command macros to more visually distinct names.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTWB/iAAoJEAAOaEEZVoIV2tUP/A1c9YUmgt+LdOJIA2k3Uh9C
 nNdZss2hj8s91qCRe1Mb7L9UjzTEEiYILYqmXMRW9yUpPI7Oxr5sjqZEqlK5lTso
 447QEow93wSE/WIKwwzdbKS+CMRNvIba6EjzQ7h0kU3ExMnFMXwD2QK7eGT2pEko
 kaQMq5BbxxIaTYmp/tKioacBPbpO3TQpS6ZWv2kZDCk4l1wCdsBNL7h3eqM63L/L
 A05zA88e3//wxVSPLA5JpQJ5fYkZrz7sqZYd+H80VXn34YQY7/7Kq16fiCprhntq
 tZb9LWOIJmSruN7r39KJgf43++fpSrv5XPfqsL4TDdwGcYwBAznhItrfOUC0Ja1+
 ZY227gHbxBwSeN9jj3zc4peOpzNPdIMnw0CEZVGn/AgssFFh/Ja8PrIQCxjI5djP
 eLqiiBBznt9HaZWPslWxaKqhdINFyuMp9LbEJ71nXwLQVYY32rOS828FAna982F3
 i0A48tPbrGpA1elGnVcsiAmJtAbZA9X6Y5M+gQGU2vWgX5GxiLeXOmEd+kVOaTmu
 2WVlwvEc3jTlxg9naGAKsfXwaOKqEIPJDoahWTpSRtNOntNwiPjg0cW80abq+Ybx
 WaPFhDLyd7290QyOASjyC4TwXMA2XvtQMQ8P+SMWkc2ZscjtuMBfEK9TBalg8tZV
 vHNrZpqnftIX7u6Y/fuT
 =rrtj
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.15-2' of git://git.samba.org/jlayton/linux

Pull file locking fixes from Jeff Layton:
 "File locking related bugfixes for v3.15 (pile #2)

   - fix for a long-standing bug in __break_lease that can cause soft
     lockups
   - renaming of file-private locks to "open file description" locks,
     and the command macros to more visually distinct names

  The fix for __break_lease is also in the pile of patches for which
  Bruce sent a pull request, but I assume that your merge procedure will
  handle that correctly.

  For the other patches, I don't like the fact that we need to rename
  this stuff at this late stage, but it should be settled now
  (hopefully)"

* tag 'locks-v3.15-2' of git://git.samba.org/jlayton/linux:
  locks: rename FL_FILE_PVT and IS_FILE_PVT to use "*_OFDLCK" instead
  locks: rename file-private locks to "open file description locks"
  locks: allow __break_lease to sleep even when break_time is 0
2014-04-25 12:40:32 -07:00
Linus Torvalds
b8e6dece37 Merge branch 'for-3.15' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "Three small nfsd bugfixes (including one locks.c fix for a bug
  triggered only from nfsd).

  Jeff's patches are for long-existing problems that became easier to
  trigger since the addition of vfs delegation support"

* 'for-3.15' of git://linux-nfs.org/~bfields/linux:
  Revert "nfsd4: fix nfs4err_resource in 4.1 case"
  nfsd: set timeparms.to_maxval in setup_callback_client
  locks: allow __break_lease to sleep even when break_time is 0
2014-04-25 12:39:05 -07:00
Tejun Heo
b44b214026 kernfs: add back missing error check in kernfs_fop_mmap()
While updating how mmap enabled kernfs files are handled by lockdep,
9b2db6e189 ("sysfs: bail early from kernfs_file_mmap() to avoid
spurious lockdep warning") inadvertently dropped error return check
from kernfs_file_mmap().  The intention was just dropping "if
(ops->mmap)" check as the control won't reach the point if the mmap
callback isn't implemented, but I mistakenly removed the error return
check together with it.

This led to Xorg crash on i810 which was reported and bisected to the
commit and then to the specific change by Tobias.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-bisected-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Tested-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
References: http://lkml.kernel.org/g/533D01BD.1010200@googlemail.com
Cc: stable <stable@vger.kernel.org> # 3.14
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-25 12:25:13 -07:00
Jianyu Zhan
c1befb8859 kernfs: fix a subdir count leak
Currently kernfs_link_sibling() increates parent->dir.subdirs before
adding the node into parent's chidren rb tree.

Because it is possible that kernfs_link_sibling() couldn't find
a suitable slot and bail out, this leads to a mismatch between
elevated subdir count with actual children node numbers.

This patches fix this problem, by moving the subdir accouting
after the actual addtion happening.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-25 12:25:13 -07:00
Tejun Heo
d911d98748 kernfs: make kernfs_notify() trigger inotify events too
kernfs_notify() is used to indicate either new data is available or
the content of a file has changed.  It currently only triggers poll
which may not be the most convenient to monitor especially when there
are a lot to monitor.  Let's hook it up to fsnotify too so that the
events can be monitored via inotify too.

fsnotify_modify() requires file * but kernfs_notify() doesn't have any
specific file associated; however, we can walk all super_blocks
associated with a kernfs_root and as kernfs always associate one ino
with inode and one dentry with an inode, it's trivial to look up the
dentry associated with a given kernfs_node.  As any active monitor
would pin dentry, just looking up existing dentry is enough.  This
patch looks up the dentry associated with the specified kernfs_node
and generates events equivalent to fsnotify_modify().

Note that as fsnotify doesn't provide fsnotify_modify() equivalent
which can be called with dentry, kernfs_notify() directly calls
fsnotify_parent() and fsnotify().  It might be better to add a wrapper
in fsnotify.h instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-25 11:43:31 -07:00
Tejun Heo
7d568a8383 kernfs: implement kernfs_root->supers list
Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root.  Let's implement it - the planned
inotify extension to kernfs_notify() needs it.

Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-25 11:43:31 -07:00
Jeff Layton
a87c9ad956 cifs: fix actimeo=0 corner case when cifs_i->time == jiffies
actimeo=0 is supposed to be a special case that ensures that inode
attributes are always refetched from the server instead of trusting the
cache. The cifs code however uses time_in_range() to determine whether
the attributes have timed out. In the case where cifs_i->time equals
jiffies, this leads to the cifs code not refetching the inode attributes
when it should.

Fix this by explicitly testing for actimeo=0, and handling it as a
special case.

Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-04-24 22:37:03 -05:00
Filipe Manana
f8213bdc89 Btrfs: correctly set profile flags on seqlock retry
If we had to retry on the profiles seqlock (due to a concurrent write), we
would set bits on the input flags that corresponded both to the current
profile and to previous values of the profile.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:33 -07:00
Filipe Manana
9ce49a0b4f Btrfs: use correct key when repeating search for extent item
If skinny metadata is enabled and our first tree search fails to find a
skinny extent item, we may repeat a tree search for a "fat" extent item
(if the previous item in the leaf is not the "fat" extent we're looking
for). However we were not setting the new key's objectid to the right
value, as we previously used the same key variable to peek at the previous
item in the leaf, which has a different objectid. So just set the right
objectid to avoid modifying/deleting a wrong item if we repeat the tree
search.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:33 -07:00
Miao Xie
1c70d8fb4d Btrfs: fix inode caching vs tree log
Currently, with inode cache enabled, we will reuse its inode id immediately
after unlinking file, we may hit something like following:

|->iput inode
|->return inode id into inode cache
|->create dir,fsync
|->power off

An easy way to reproduce this problem is:

mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt -o inode_cache,commit=100
dd if=/dev/zero of=/mnt/data bs=1M count=10 oflag=sync
inode_id=`ls -i /mnt/data | awk '{print $1}'`
rm -f /mnt/data

i=1
while [ 1 ]
do
        mkdir /mnt/dir_$i
        test1=`stat /mnt/dir_$i | grep Inode: | awk '{print $4}'`
        if [ $test1 -eq $inode_id ]
        then
		dd if=/dev/zero of=/mnt/dir_$i/data bs=1M count=1 oflag=sync
		echo b > /proc/sysrq-trigger
	fi
	sleep 1
        i=$(($i+1))
done

mount /dev/sdb /mnt
umount /dev/sdb
btrfs check /dev/sdb

We fix this problem by adding unlinked inode's id into pinned tree,
and we can not reuse them until committing transaction.

Cc: stable@vger.kernel.org
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:33 -07:00
Wang Shilong
28c16cbbc3 Btrfs: fix possible memory leaks in open_ctree()
Fix possible memory leaks in the following error handling paths:

read_tree_block()
btrfs_recover_log_trees
btrfs_commit_super()
btrfs_find_orphan_roots()
btrfs_cleanup_fs_roots()

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Wang Shilong
e60efa8425 Btrfs: avoid triggering bug_on() when we fail to start inode caching task
When running stress test(including snapshots,balance,fstress), we trigger
the following BUG_ON() which is because we fail to start inode caching task.

[  181.131945] kernel BUG at fs/btrfs/inode-map.c:179!
[  181.137963] invalid opcode: 0000 [#1] SMP
[  181.217096] CPU: 11 PID: 2532 Comm: btrfs Not tainted 3.14.0 #1
[  181.240521] task: ffff88013b621b30 ti: ffff8800b6ada000 task.ti: ffff8800b6ada000
[  181.367506] Call Trace:
[  181.371107]  [<ffffffffa036c1be>] btrfs_return_ino+0x9e/0x110 [btrfs]
[  181.379191]  [<ffffffffa038082b>] btrfs_evict_inode+0x46b/0x4c0 [btrfs]
[  181.387464]  [<ffffffff810b5a70>] ? autoremove_wake_function+0x40/0x40
[  181.395642]  [<ffffffff811dc5fe>] evict+0x9e/0x190
[  181.401882]  [<ffffffff811dcde3>] iput+0xf3/0x180
[  181.408025]  [<ffffffffa03812de>] btrfs_orphan_cleanup+0x1ee/0x430 [btrfs]
[  181.416614]  [<ffffffffa03a6abd>] btrfs_mksubvol.isra.29+0x3bd/0x450 [btrfs]
[  181.425399]  [<ffffffffa03a6cd6>] btrfs_ioctl_snap_create_transid+0x186/0x190 [btrfs]
[  181.435059]  [<ffffffffa03a6e3b>] btrfs_ioctl_snap_create_v2+0xeb/0x130 [btrfs]
[  181.444148]  [<ffffffffa03a9656>] btrfs_ioctl+0xf76/0x2b90 [btrfs]
[  181.451971]  [<ffffffff8117e565>] ? handle_mm_fault+0x475/0xe80
[  181.459509]  [<ffffffff8167ba0c>] ? __do_page_fault+0x1ec/0x520
[  181.467046]  [<ffffffff81185b35>] ? do_mmap_pgoff+0x2f5/0x3c0
[  181.474393]  [<ffffffff811d4da8>] do_vfs_ioctl+0x2d8/0x4b0
[  181.481450]  [<ffffffff811d5001>] SyS_ioctl+0x81/0xa0
[  181.488021]  [<ffffffff81680b69>] system_call_fastpath+0x16/0x1b

We should avoid triggering BUG_ON() here, instead, we output warning messages
and clear inode_cache option.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Wang Shilong
9d89ce6587 Btrfs: move btrfs_{set,clear}_and_info() to ctree.h
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
David Sterba
3f9e3df8da btrfs: replace error code from btrfs_drop_extents
There's a case which clone does not handle and used to BUG_ON instead,
(testcase xfstests/btrfs/035), now returns EINVAL. This error code is
confusing to the ioctl caller, as it normally signifies errorneous
arguments.

Change it to ENOPNOTSUPP which allows a fall back to copy instead of
clone. This does not affect the common reflink operation.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Qu Wenruo
c5f7d0bb29 btrfs: Change the hole range to a more accurate value.
Commit 3ac0d7b96a fixed the btrfs expanding
write problem but the hole punched is sometimes too large for some
iovec, which has unmapped data ranges.
This patch will change to hole range to a more accurate value using the
counts checked by the write check routines.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Brian Foster
53801fd97a xfs: enable the finobt feature on v5 superblocks
Add the finobt feature bit to the list of known features. As of
this point, the kernel code knows how to mount and manage both
finobt and non-finobt formatted filesystems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:01:42 +10:00
Brian Foster
0c153c1e43 xfs: report finobt status in fs geometry
Define the XFS_FSOP_GEOM_FLAGS_FINOBT fs geometry flag and set the
associated bit if the filesystem supports the free inode btree.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:01:41 +10:00
Brian Foster
a3fa516dd8 xfs: add finobt support to growfs
Add finobt support to growfs. Initialize the agi root/level fields
and the root finobt block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:01:39 +10:00
Brian Foster
3efa4ffd58 xfs: update the finobt on inode free
An inode free operation can have several effects on the finobt. If
all inodes have been freed and the chunk deallocated, we remove the
finobt record. If the inode chunk was previously full, we must
insert a new record based on the existing inobt record. Otherwise,
we modify the record in place.

Create the xfs_difree_finobt() function to identify the potential
scenarios and update the finobt appropriately.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:53 +10:00
Brian Foster
2b64ee5cdc xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
Refactor xfs_difree() in preparation for the finobt. xfs_difree()
performs the validity checks against the ag and reads the agi
header. The work of physically updating the inode allocation btree
is pushed down into the new xfs_difree_inobt() helper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:53 +10:00
Brian Foster
6dd8638e4e xfs: use and update the finobt on inode allocation
Replace xfs_dialloc_ag() with an implementation that looks for a
record in the finobt. The finobt only tracks records with at least
one free inode. This eliminates the need for the intra-ag scan in
the original algorithm. Once the inode is allocated, update the
finobt appropriately (possibly removing the record) as well as the
inobt.

Move the original xfs_dialloc_ag() algorithm to
xfs_dialloc_ag_inobt() and fall back as such if finobt support is
not enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:53 +10:00
Brian Foster
0aa0a756ec xfs: insert newly allocated inode chunks into the finobt
A newly allocated inode chunk, by definition, has at least one
free inode, so a record is always inserted into the finobt.

Create the xfs_inobt_insert() helper from existing code to insert
a record in an inobt based on the provided BTNUM. Update
xfs_ialloc_ag_alloc() to invoke the helper for the existing
XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:53 +10:00
Brian Foster
9d43b180af xfs: update inode allocation/free transaction reservations for finobt
Create the xfs_calc_finobt_res() helper to calculate the finobt log
reservation for inode allocation and free. Update
XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
reserve blocks for the potential finobt record insertion on inode
free (i.e., if an inode chunk was previously fully allocated).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:52 +10:00
Brian Foster
aafc3c2465 xfs: support the XFS_BTNUM_FINOBT free inode btree type
Define the AGI fields for the finobt root/level and add magic
numbers. Update the btree code to add support for the new
XFS_BTNUM_FINOBT inode btree.

The finobt root block is reserved immediately following the inobt
root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
starting AG data block based on whether finobt support is enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:52 +10:00
Brian Foster
8e2c84df20 xfs: reserve v5 superblock read-only compat. feature bit for finobt
Reserve a v5 read-only compatibility feature bit for the finobt and
create the xfs_sb_version_hasfinobt() helper to determine whether
an fs has the feature enabled.

The finobt does not change existing on-disk structures, but must
remain consistent with the ialloc btree. Modifications from older
kernels would violate that constrant. Therefore, we restrict older
kernels to read-only mounts of finobt-enabled filesystems.

Note that this does not yet enable the ability to rw mount a finobt
fs (by setting the feature bit in the XFS_SB_FEAT_RO_COMPAT_ALL
mask).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:52 +10:00
Brian Foster
57bd3dbe40 xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
The introduction of the free inode btree (finobt) requires that
xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c
so the caller specifies the btree type on cursor initialization to
prepare for addition of the finobt.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:50 +10:00
Jeff Layton
cff2fce58b locks: rename FL_FILE_PVT and IS_FILE_PVT to use "*_OFDLCK" instead
File-private locks have been re-christened as "open file description"
locks.  Finish the symbol name cleanup in the internal implementation.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-04-23 16:17:03 -04:00
Christoph Hellwig
b94acd4786 xfs: add filestream allocator tracepoints
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:52 +10:00
Christoph Hellwig
3b8d90766a xfs: remove xfs_filestream_associate
There is no good reason to create a filestream when a directory entry
is created.  Delay it until the first allocation happens to simply
the code and reduce the amount of mru cache lookups we do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:52 +10:00
Christoph Hellwig
1919adda07 xfs: don't create a slab cache for filestream items
We only have very few of these around, and allocation isn't that
much of a hot path.  Remove the slab cache to simplify the code,
and to not waste any resources for the usual case of not having
any inodes that use the filestream allocator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:51 +10:00
Christoph Hellwig
2cd2ef6a30 xfs: rewrite the filestream allocator using the dentry cache
In Linux we will always be able to find a parent inode for file that are
undergoing I/O.  Use this to simply the file stream allocator by only
keeping track of parent inodes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-04-23 07:11:51 +10:00
Christoph Hellwig
f37211c336 xfs: remove XFS_IFILESTREAM
We never test the flag except in xfs_inode_is_filestream, but that
function already tests the on-disk flag or filesystem wide flags,
and is used to decide if we want to set XFS_IFILESTREAM in the
first place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:51 +10:00
Christoph Hellwig
22328d712d xfs: embedd mru_elem into parent structure
There is no need to do a separate allocation for each mru element, just
embedd the structure into the parent one in the user.  Besides saving
a memory allocation and the infrastructure required for it this also
simplifies the API.

While we do major surgery on xfs_mru_cache.c also de-typedef it and
make struct mru_cache private to the implementation file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:51 +10:00
Christoph Hellwig
ce695c6551 xfs: handle duplicate entries in xfs_mru_cache_insert
The radix tree code can detect and reject duplicate keys at insert
time.  Make xfs_mru_cache_insert handle this case so that future
changes to the filestream allocator can take advantage of this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:50 +10:00
Christoph Hellwig
c977eb1065 xfs: split xfs_bmap_btalloc_nullfb
Split xfs_bmap_btalloc_nullfb into one function for filestream allocations
and one for everything else that share a few helpers.  This dramatically
simplifies the control flow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-04-23 07:11:41 +10:00
Fabian Frederick
7410b3c6c5 fs/bio.c: remove nr_segs (unused function parameter)
nr_segs is no longer used in bio_alloc_map_data since c8db444820
("block: Don't save/copy bvec array anymore")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-22 15:09:07 -06:00
Fabian Frederick
a6c39cb4f7 fs/bio: remove bs paramater in biovec_create_pool
bs is no longer used in biovec_create_pool since 9f060e2231 ("block:
Convert integrity to bvec_alloc_bs()")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-22 15:09:05 -06:00
Fabian Frederick
d52a8f9ead fs/aio.c: Remove ctx parameter in kiocb_cancel
ctx is no longer used in kiocb_cancel since

57282d8fd7 ("aio: Kill ki_users")

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-04-22 12:27:24 -04:00
Jeff Layton
0d3f7a2dd2 locks: rename file-private locks to "open file description locks"
File-private locks have been merged into Linux for v3.15, and *now*
people are commenting that the name and macro definitions for the new
file-private locks suck.

...and I can't even disagree. The names and command macros do suck.

We're going to have to live with these for a long time, so it's
important that we be happy with the names before we're stuck with them.
The consensus on the lists so far is that they should be rechristened as
"open file description locks".

The name isn't a big deal for the kernel, but the command macros are not
visually distinct enough from the traditional POSIX lock macros. The
glibc and documentation folks are recommending that we change them to
look like F_OFD_{GETLK|SETLK|SETLKW}. That lessens the chance that a
programmer will typo one of the commands wrong, and also makes it easier
to spot this difference when reading code.

This patch makes the following changes that I think are necessary before
v3.15 ships:

1) rename the command macros to their new names. These end up in the uapi
   headers and so are part of the external-facing API. It turns out that
   glibc doesn't actually use the fcntl.h uapi header, but it's hard to
   be sure that something else won't. Changing it now is safest.

2) make the the /proc/locks output display these as type "OFDLCK"

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Stefan Metzmacher <metze@samba.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Frank Filz <ffilzlnx@mindspring.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-04-22 08:23:58 -04:00
Dmitry Monakhov
236f5ecb4a ext4: remove obsoleted check
BH can not be NULL at this point, ext4_read_dirblock() always return
non null value, and we already have done all necessery checks.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-04-21 14:38:14 -04:00
Theodore Ts'o
202ee5df38 ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
To avoid potential data races, use a spinlock which protects the raw
(on-disk) inode.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:37:55 -04:00
Theodore Ts'o
f5ccfe1ddb ext4: fix locking for O_APPEND writes
Al Viro pointed out that locking for O_APPEND writes was problematic,
since the location of the write isn't known until after we take the
i_mutex, which impacts the ext4_unaligned_aio() and s_bitmap_maxbytes
check.

For O_APPEND always assume that the write is unaligned so call
ext4_unwritten_wait().  And to solve the second problem, take the
i_mutex earlier before we start the s_bitmap_maxbytes check.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-21 14:37:52 -04:00
Theodore Ts'o
7ed07ba8c3 ext4: factor out common code in ext4_file_write()
This shouldn't change any logic flow; just delete duplicated code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:36:30 -04:00
Theodore Ts'o
8ad2850f44 ext4: move ext4_file_dio_write() into ext4_file_write()
This commit doesn't actually change anything; it just moves code
around in preparation for some code simplification work.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:26:57 -04:00
Theodore Ts'o
7608e61044 ext4: inline generic_file_aio_write() into ext4_file_write()
Copy generic_file_aio_write() into ext4_file_write().  This is part of
a patch series which allows us to simplify ext4_file_write() and
ext4_file_dio_write(), by calling __generic_file_aio_write() directly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:26:28 -04:00
Randy Dunlap
1051a902fe fs: fix new kernel-doc warnings in fs/bio.c
Fix new kernel-doc warnings in fs/bio.c:

Warning(fs/bio.c:316): No description found for parameter 'bio'
Warning(fs/bio.c:316): No description found for parameter 'parent'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-21 10:39:00 -06:00
Lukas Czerner
556615dcbf ext4: rename uninitialized extents to unwritten
Currently in ext4 there is quite a mess when it comes to naming
unwritten extents. Sometimes we call it uninitialized and sometimes we
refer to it as unwritten.

The right name for the extent which has been allocated but does not
contain any written data is _unwritten_. Other file systems are
using this name consistently, even the buffer head state refers to it as
unwritten. We need to fix this confusion in ext4.

This commit changes every reference to an uninitialized extent (meaning
allocated but unwritten) to unwritten extent. This includes comments,
function names and variable names. It even covers abbreviation of the
word uninitialized (such as uninit) and some misspellings.

This commit does not change any of the code paths at all. This has been
confirmed by comparing md5sums of the assembly code of each object file
after all the function names were stripped from it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20 23:45:47 -04:00
Lukas Czerner
090f32ee4e ext4: get rid of EXT4_MAP_UNINIT flag
Currently EXT4_MAP_UNINIT is used in dioread_nolock case to mark the
cases where we're using dioread_nolock and we're writing into either
unallocated, or unwritten extent, because we need to make sure that
any DIO write into that inode will wait for the extent conversion.

However EXT4_MAP_UNINIT is not only entirely misleading name but also
unnecessary because we can check for EXT4_MAP_UNWRITTEN in the
dioread_nolock case instead.

This commit removes EXT4_MAP_UNINIT flag.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20 23:44:47 -04:00
Linus Torvalds
9ac0367501 These are regression and bug fixes for ext4.
We had a number of new features in ext4 during this merge window
 (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
 there were many more regression and bug fixes this time around.  It
 didn't help that xfstests hadn't been fully updated to fully stress
 test COLLAPSE_RANGE until after -rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTVIEUAAoJENNvdpvBGATwnKkQANlzQv6BhgzCa0b5Iu0SkHeD
 OuLAtPFYE5OVEK22oWT0H76gBi71RHLboHwThd+ZfEeEPvyfs42wY0J/PV/R9dHx
 kwhU+MaDDzugfVj3gg29DpYNLQkL/evq0vlNbrRk5je877c2I8JbXV/aAoTVFZoH
 NGOsagwBqWCsgL5nSOk/nEZSRX2AzSCkgmOVxylLzFoyTUkX3vZx8G8XtS1zRgbH
 b1yOWIK1Ifj7tmBZ4HLpNiK6/NpHAHeHRFiaCQxY0hkLjUeMyVNJfZzXS/Fzp8DP
 p1/nm5z9PaFj4nyBC1Wvh9Z6Lj0zQ0ap73LV+w4fHM1SZub3XY+hvyXj/8qMNaSc
 lLIGwa2AZFpurbKKn6MZTi5CubVLZs6PZKzDgYURnEcJCgeMujMOxbKekcL5sP9E
 Gb6Hh9I/f08HagCRox5O0W7f0/TBY5bFryx5kQQZUtpcRmnY3m7cohSkn6WriwTZ
 zYApOZMZkFX5spSeYsfyi8K8wHij/5mXvm7qeqQ0Rj4Ehycd+7jwltOCVXAYN29+
 zSKaBaxH2+V7zuGHSxjDFbOOlPotTFNzGmFh08DPTF4Vgnc9uMlLo0Oz8ADFDcT2
 JZ4pAFTEREnHOATNl5bAEi8wNrU/Ln9IGhlYCYI9X5BQXjf9oPXcYwQT/lKCb07s
 ks8ujfry1R/gjQGuv+LH
 =gi42
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "These are regression and bug fixes for ext4.

  We had a number of new features in ext4 during this merge window
  (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
  there were many more regression and bug fixes this time around.  It
  didn't help that xfstests hadn't been fully updated to fully stress
  test COLLAPSE_RANGE until after -rc1"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
  ext4: disable COLLAPSE_RANGE for bigalloc
  ext4: fix COLLAPSE_RANGE failure with 1KB block size
  ext4: use EINVAL if not a regular file in ext4_collapse_range()
  ext4: enforce we are operating on a regular file in ext4_zero_range()
  ext4: fix extent merging in ext4_ext_shift_path_extents()
  ext4: discard preallocations after removing space
  ext4: no need to truncate pagecache twice in collapse range
  ext4: fix removing status extents in ext4_collapse_range()
  ext4: use filemap_write_and_wait_range() correctly in collapse range
  ext4: use truncate_pagecache() in collapse range
  ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE
  ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
  ext4: always check ext4_ext_find_extent result
  ext4: fix error handling in ext4_ext_shift_extents
  ext4: silence sparse check warning for function ext4_trim_extent
  ext4: COLLAPSE_RANGE only works on extent-based files
  ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
  ext4: use i_size_read in ext4_unaligned_aio()
  fs: disallow all fallocate operation on active swapfile
  fs: move falloc collapse range check into the filesystem methods
  ...
2014-04-20 20:43:47 -07:00
Namjae Jeon
0a04b24853 ext4: disable COLLAPSE_RANGE for bigalloc
Once COLLAPSE RANGE is be disable for ext4 with bigalloc feature till finding
root-cause of problem. It will be enable with fixing that regression of
xfstest(generic 075 and 091) again.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-19 16:38:21 -04:00
Namjae Jeon
a8680e0d5e ext4: fix COLLAPSE_RANGE failure with 1KB block size
When formatting with 1KB or 2KB(not aligned with PAGE SIZE) block
size, xfstests generic/075 and 091 are failing. The offset supplied to
function truncate_pagecache_range is block size aligned. In this
function start offset is re-aligned to PAGE_SIZE by rounding_up to the
next page boundary.  Due to this rounding up, old data remains in the
page cache when blocksize is less than page size and start offset is
not aligned with page size.  In case of collapse range, we need to
align start offset to page size boundary by doing a round down
operation instead of round up.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-19 16:37:31 -04:00
Eric Dumazet
404ca80eb5 coredump: fix va_list corruption
A va_list needs to be copied in case it needs to be used twice.

Thanks to Hugh for debugging this issue, leading to various panics.

Tested:

  lpq84:~# echo "|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern

'produce_core' is simply : main() { *(int *)0 = 1;}

  lpq84:~# ./produce_core
  Segmentation fault (core dumped)
  lpq84:~# dmesg | tail -1
  [  614.352947] Core dump to |/foobar12345 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 (null) pipe failed

Notice the last argument was replaced by a NULL (we were lucky enough to
not crash, but do not try this on your production machine !)

After fix :

  lpq83:~# echo "|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern
  lpq83:~# ./produce_core
  Segmentation fault
  lpq83:~# dmesg | tail -1
  [  740.800441] Core dump to |/foobar12345 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 pipe failed

Fixes: 5fe9d8ca21 ("coredump: cn_vprintf() has no reason to call vsnprintf() twice")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Hugh Dickins <hughd@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-19 13:23:31 -07:00
Al Viro
22213318af fix races between __d_instantiate() and checks of dentry flags
in non-lazy walk we need to be careful about dentry switching from
negative to positive - both ->d_flags and ->d_inode are updated,
and in some places we might see only one store.  The cases where
dentry has been obtained by dcache lookup with ->i_mutex held on
parent are safe - ->d_lock and ->i_mutex provide all the barriers
we need.  However, there are several places where we run into
trouble:
	* do_last() fetches ->d_inode, then checks ->d_flags and
assumes that inode won't be NULL unless d_is_negative() is true.
Race with e.g. creat() - we might have fetched the old value of
->d_inode (still NULL) and new value of ->d_flags (already not
DCACHE_MISS_TYPE).  Lin Ming has observed and reported the resulting
oops.
	* a bunch of places checks ->d_inode for being non-NULL,
then checks ->d_flags for "is it a symlink".  Race with symlink(2)
in case if our CPU sees ->d_inode update first - we see non-NULL
there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
DCACHE_SYMLINK_TYPE.  Result: false negative on "should we follow
link here?", with subsequent unpleasantness.

Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one
Reported-and-tested-by: Lin Ming <minggr@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-19 12:30:58 -04:00
Linus Torvalds
6e66d5dab5 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "A set of 5 small cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cif: fix dead code
  cifs: fix error handling cifs_user_readv
  fs: cifs: remove unused variable.
  Return correct error on query of xattr on file with empty xattrs
  cifs: Wait for writebacks to complete before attempting write.
2014-04-18 17:52:39 -07:00
Linus Torvalds
60fbf2bda1 driver core fixes for 3.15-rc2
Here are some driver core fixes for 3.15-rc2.  Also in here are some
 documentation updates, as well as an API removal that had to wait for
 after -rc1 due to the cleanups coming into you from multiple developer
 trees (this one and the PPC tree.)
 
 All have been in linux next successfully.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlNRl6cACgkQMUfUDdst+yllxACfV9fZ/A6IQja60AdPEo+oa6Cw
 RiIAoJtH0D0G0eC4+/Qs9GSRMoB4jPPC
 =Wi3a
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are some driver core fixes for 3.15-rc2.  Also in here are some
  documentation updates, as well as an API removal that had to wait for
  after -rc1 due to the cleanups coming into you from multiple developer
  trees (this one and the PPC tree.)

  All have been in linux next successfully"

* tag 'driver-core-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  drivers/base/dd.c incorrect pr_debug() parameters
  Documentation: Update stable address in Chinese and Japanese translations
  topology: Fix compilation warning when not in SMP
  Chinese: add translation of io_ordering.txt
  stable_kernel_rules: spelling/word usage
  sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()
  kernfs: protect lazy kernfs_iattrs allocation with mutex
  fs: Don't return 0 from get_anon_bdev
2014-04-18 16:59:52 -07:00
Theodore Ts'o
86f1ca3889 ext4: use EINVAL if not a regular file in ext4_collapse_range()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 11:52:11 -04:00
jon ernst
6c5e73d3a2 ext4: enforce we are operating on a regular file in ext4_zero_range()
Signed-off-by: Jon Ernst <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 11:50:35 -04:00
Lukas Czerner
6dd834effc ext4: fix extent merging in ext4_ext_shift_path_extents()
There is a bug in ext4_ext_shift_path_extents() where if we actually
manage to merge a extent we would skip shifting the next extent. This
will result in in one extent in the extent tree not being properly
shifted.

This is causing failure in various xfstests tests using fsx or fsstress
with collapse range support. It will also cause file system corruption
which looks something like:

 e2fsck 1.42.9 (4-Feb-2014)
 Pass 1: Checking inodes, blocks, and sizes
 Inode 20 has out of order extents
        (invalid logical block 3, physical block 492938, len 2)
 Clear? yes
 ...

when running e2fsck.

It's also very easily reproducible just by running fsx without any
parameters. I can usually hit the problem within a minute.

Fix it by increasing ex_start only if we're not merging the extent.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2014-04-18 10:55:24 -04:00
Lukas Czerner
ef24f6c234 ext4: discard preallocations after removing space
Currently in ext4_collapse_range() and ext4_punch_hole() we're
discarding preallocation twice. Once before we attempt to do any changes
and second time after we're done with the changes.

While the second call to ext4_discard_preallocations() in
ext4_punch_hole() case is not needed, we need to discard preallocation
right after ext4_ext_remove_space() in collapse range case because in
the case we had to restart a transaction in the middle of removing space
we might have new preallocations created.

Remove unneeded ext4_discard_preallocations() ext4_punch_hole() and move
it to the better place in ext4_collapse_range()

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 10:50:23 -04:00
Lukas Czerner
9337d5d31a ext4: no need to truncate pagecache twice in collapse range
We're already calling truncate_pagecache() before we attempt to do any
actual job so there is not need to truncate pagecache once more using
truncate_setsize() after we're finished.

Remove truncate_setsize() and replace it just with i_size_write() note
that we're holding appropriate locks.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 10:48:25 -04:00
Lukas Czerner
2c1d23289b ext4: fix removing status extents in ext4_collapse_range()
Currently in ext4_collapse_range() when calling ext4_es_remove_extent() to
remove status extents we're passing (EXT_MAX_BLOCKS - punch_start - 1)
in order to remove all extents from start of the collapse range to the
end of the file. However this is wrong because we might miss the
possible extent covering the last block of the file.

Fix it by removing the -1.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2014-04-18 10:43:21 -04:00
Lukas Czerner
1a66c7c3be ext4: use filemap_write_and_wait_range() correctly in collapse range
Currently we're passing -1 as lend argumnet for
filemap_write_and_wait_range() which is wrong since lend is signed type
so it would cause some confusion and we might not write_and_wait for the
entire range we're expecting to write.

Fix it by using LLONG_MAX instead.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 10:41:52 -04:00
Lukas Czerner
694c793fc1 ext4: use truncate_pagecache() in collapse range
We should be using truncate_pagecache() instead of
truncate_pagecache_range() in the collapse range because we're
truncating page cache from offset to the end of file.
truncate_pagecache() also get rid of the private COWed pages from the
range because we're going to shift the end of the file.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 10:21:15 -04:00
J. Bruce Fields
fc208d026b Revert "nfsd4: fix nfs4err_resource in 4.1 case"
Since we're still limiting attributes to a page, the result here is that
a large getattr result will return NFS4ERR_REP_TOO_BIG/TOO_BIG_TO_CACHE
instead of NFS4ERR_RESOURCE.

Both error returns are wrong, and the real bug here is the arbitrary
limit on getattr results, fixed by as-yet out-of-tree patches.  But at a
minimum we can make life easier for clients by sticking to one broken
behavior in released kernels instead of two....

Trond says:

	one immediate consequence of this patch will be that NFSv4.1
	clients will now report EIO instead of EREMOTEIO if they hit the
	problem. That may make debugging a little less obvious.

	Another consequence will be that if we ever do try to add client
	side handling of NFS4ERR_REP_TOO_BIG, then we now have to deal
	with the “handle existing buggy server” syndrome.

Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-04-18 14:46:45 +02:00
Jeff Layton
3758cf7e14 nfsd: set timeparms.to_maxval in setup_callback_client
...otherwise the logic in the timeout handling doesn't work correctly.

Spotted-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-04-18 14:34:31 +02:00
Jeff Layton
4991a628a7 locks: allow __break_lease to sleep even when break_time is 0
A fl->fl_break_time of 0 has a special meaning to the lease break code
that basically means "never break the lease". knfsd uses this to ensure
that leases don't disappear out from under it.

Unfortunately, the code in __break_lease can end up passing this value
to wait_event_interruptible as a timeout, which prevents it from going
to sleep at all. This causes __break_lease to spin in a tight loop and
causes soft lockups.

Fix this by ensuring that we pass a minimum value of 1 as a timeout
instead.

Cc: <stable@vger.kernel.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Reported-by: Terry Barnaby <terry1@beam.ltd.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-04-18 14:34:30 +02:00
Peter Zijlstra
4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
Dongsheng Yang
8698a745d8 sched, treewide: Replace hardcoded nice values with MIN_NICE/MAX_NICE
Replace various -20/+19 hardcoded nice values with MIN_NICE/MAX_NICE.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ff13819fd09b7a5dba5ab5ae797f2e7019bdfa17.1394532288.git.yangds.fnst@cn.fujitsu.com
Cc: devel@driverdev.osuosl.org
Cc: devicetree@vger.kernel.org
Cc: fcoe-devel@open-fcoe.org
Cc: linux390@de.ibm.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: nbd-general@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: openipmi-developer@lists.sourceforge.net
Cc: qla2xxx-upstream@qlogic.com
Cc: linux-arch@vger.kernel.org
[ Consolidated the patches, twiddled the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 12:07:24 +02:00
Abhi Das
991deec819 GFS2: quotas not being refreshed in gfs2_adjust_quota
Old values of user quota limits were being used and
could allow users to exceed their allotted quotas.
This patch refreshes the limits to the latest values
so that quotas are enforced correctly.

Resolves: rhbz#1077463
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-04-17 09:59:40 +01:00
Michael Opdenacker
1f80c0cc39 cif: fix dead code
This issue was found by Coverity (CID 1202536)

This proposes a fix for a statement that creates dead code.
The "rc < 0" statement is within code that is run
with "rc > 0".

It seems like "err < 0" was meant to be used here.
This way, the error code is returned by the function.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-04-16 23:08:57 -05:00
Jeff Layton
bae9f746a1 cifs: fix error handling cifs_user_readv
Coverity says:

*** CID 1202537:  Dereference after null check  (FORWARD_NULL)
/fs/cifs/file.c: 2873 in cifs_user_readv()
2867     		cur_len = min_t(const size_t, len - total_read, cifs_sb->rsize);
2868     		npages = DIV_ROUND_UP(cur_len, PAGE_SIZE);
2869
2870     		/* allocate a readdata struct */
2871     		rdata = cifs_readdata_alloc(npages,
2872     					    cifs_uncached_readv_complete);
>>>     CID 1202537:  Dereference after null check  (FORWARD_NULL)
>>>     Comparing "rdata" to null implies that "rdata" might be null.
2873     		if (!rdata) {
2874     			rc = -ENOMEM;
2875     			goto error;
2876     		}
2877
2878     		rc = cifs_read_allocate_pages(rdata, npages);

...when we "goto error", rc will be non-zero, and then we end up trying
to do a kref_put on the rdata (which is NULL). Fix this by replacing
the "goto error" with a "break".

Reported-by: <scan-admin@coverity.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-04-16 22:54:30 -05:00
Brian Foster
330033d697 xfs: fix tmpfile/selinux deadlock and initialize security
xfstests generic/004 reproduces an ilock deadlock using the tmpfile
interface when selinux is enabled. This occurs because
xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
latter eventually calls into xfs_xattr_get() which attempts to get the
lock again. E.g.:

xfs_io          D ffffffff81c134c0  4096  3561   3560 0x00000080
ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
Call Trace:
[<ffffffff8177f969>] schedule+0x29/0x70
[<ffffffff81783a65>] rwsem_down_read_failed+0xc5/0x120
[<ffffffffa05aa97f>] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
[<ffffffff813b3434>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff810ed179>] ? down_read_nested+0x89/0xa0
[<ffffffffa05aa7f2>] ? xfs_ilock+0x122/0x250 [xfs]
[<ffffffffa05aa7f2>] xfs_ilock+0x122/0x250 [xfs]
[<ffffffffa05aa97f>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
[<ffffffffa05701d0>] xfs_attr_get+0x90/0xe0 [xfs]
[<ffffffffa0565e07>] xfs_xattr_get+0x37/0x50 [xfs]
[<ffffffff8124842f>] generic_getxattr+0x4f/0x70
[<ffffffff8133fd9e>] inode_doinit_with_dentry+0x1ae/0x650
[<ffffffff81340e0c>] selinux_d_instantiate+0x1c/0x20
[<ffffffff813351bb>] security_d_instantiate+0x1b/0x30
[<ffffffff81237db0>] d_instantiate+0x50/0x70
[<ffffffff81237e85>] d_tmpfile+0xb5/0xc0
[<ffffffffa05add02>] xfs_create_tmpfile+0x362/0x410 [xfs]
[<ffffffffa0559ac8>] xfs_vn_tmpfile+0x18/0x20 [xfs]
[<ffffffff81230388>] path_openat+0x228/0x6a0
[<ffffffff810230f9>] ? sched_clock+0x9/0x10
[<ffffffff8105a427>] ? kvm_clock_read+0x27/0x40
[<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
[<ffffffff8123101a>] do_filp_open+0x3a/0x90
[<ffffffff817845e7>] ? _raw_spin_unlock+0x27/0x40
[<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
[<ffffffff8121e3ce>] do_sys_open+0x12e/0x210
[<ffffffff8121e4ce>] SyS_open+0x1e/0x20
[<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b

xfs_vn_tmpfile() also fails to initialize security on the newly created
inode.

Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
has been committed and the inode unlocked. Also, initialize security on
the inode based on the parent directory provided via the tmpfile call.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:30 +10:00
Eric Sandeen
8d6c121018 xfs: fix buffer use after free on IO error
When testing exhaustion of dm snapshots, the following appeared
with CONFIG_DEBUG_OBJECTS_FREE enabled:

ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs]

indicating that we'd freed a buffer which still had a pending reference,
down this path:

[  190.867975]  [<ffffffff8133e6fb>] debug_check_no_obj_freed+0x22b/0x270
[  190.880820]  [<ffffffff811da1d0>] kmem_cache_free+0xd0/0x370
[  190.892615]  [<ffffffffa02c5924>] xfs_buf_free+0xe4/0x210 [xfs]
[  190.905629]  [<ffffffffa02c6167>] xfs_buf_rele+0xe7/0x270 [xfs]
[  190.911770]  [<ffffffffa034c826>] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs]

At issue is the fact that if IO fails in xfs_buf_iorequest,
we'll queue completion unconditionally, and then call
xfs_buf_rele; but if IO failed, there are no IOs remaining,
and xfs_buf_rele will free the bp while work is still queued.

Fix this by not scheduling completion if the buffer has
an error on it; run it immediately.  The rest is only comment
changes.

Thanks to dchinner for spotting the root cause.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:28 +10:00
Dave Chinner
07d5035a28 xfs: wrong error sign conversion during failed DIO writes
We negate the error value being returned from a generic function
incorrectly. The code path that it is running in returned negative
errors, so there is no need to negate it to get the correct error
signs here.

This was uncovered by generic/019.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:27 +10:00
Dave Chinner
9c23eccc1e xfs: unmount does not wait for shutdown during unmount
And interesting situation can occur if a log IO error occurs during
the unmount of a filesystem. The cases reported have the same
signature - the update of the superblock counters fails due to a log
write IO error:

XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa08a44a1
XFS (dm-16): Log I/O Error Detected.  Shutting down filesystem
XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.
XFS (dm-16): xfs_log_force: error 5 returned.
XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)

It can be seen that the last line of output contains a corrupt
device name - this is because the log and xfs_mount structures have
already been freed by the time this message is printed. A kernel
oops closely follows.

The issue is that the shutdown is occurring in a separate IO
completion thread to the unmount. Once the shutdown processing has
started and all the iclogs are marked with XLOG_STATE_IOERROR, the
log shutdown code wakes anyone waiting on a log force so they can
process the shutdown error. This wakes up the unmount code that
is doing a synchronous transaction to update the superblock
counters.

The unmount path now sees all the iclogs are marked with
XLOG_STATE_IOERROR and so never waits on them again, knowing that if
it does, there will not be a wakeup trigger for it and we will hang
the unmount if we do. Hence the unmount runs through all the
remaining code and frees all the filesystem structures while the
xlog_iodone() is still processing the shutdown. When the log
shutdown processing completes, xfs_do_force_shutdown() emits the
"Please umount the filesystem and rectify the problem(s)" message,
and xlog_iodone() then aborts all the objects attached to the iclog.
An iclog that has already been freed....

The real issue here is that there is no serialisation point between
the log IO and the unmount. We have serialisations points for log
writes, log forces, reservations, etc, but we don't actually have
any code that wakes for log IO to fully complete. We do that for all
other types of object, so why not iclogbufs?

Well, it turns out that we can easily do this. We've got xfs_buf
handles, and that's what everyone else uses for IO serialisation.
i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only
release the lock in xlog_iodone() when we are finished with the
buffer. That way before we tear down the iclog, we can lock and
unlock the buffer to ensure IO completion has finished completely
before we tear it down.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Bob Mastors <bob.mastors@solidfire.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:26 +10:00
Dave Chinner
d39a2ced0f xfs: collapse range is delalloc challenged
FSX has been detecting data corruption after to collapse range
calls. The key observation is that the offset of the last extent in
the file was not being shifted, and hence when the file size was
adjusted it was truncating away data because the extents handled
been correctly shifted.

Tracing indicated that before the collapse, the extent list looked
like:

....
ino 0x5788 state  idx 6 offset 26 block 195904 count 10 flag 0
ino 0x5788 state  idx 7 offset 39 block 195917 count 35 flag 0
ino 0x5788 state  idx 8 offset 86 block 195964 count 32 flag 0

and after the shift of 2 blocks:

ino 0x5788 state  idx 6 offset 24 block 195904 count 10 flag 0
ino 0x5788 state  idx 7 offset 37 block 195917 count 35 flag 0
ino 0x5788 state  idx 8 offset 86 block 195964 count 32 flag 0

Note that the last extent did not change offset. After the changing
of the file size:

ino 0x5788 state  idx 6 offset 24 block 195904 count 10 flag 0
ino 0x5788 state  idx 7 offset 37 block 195917 count 35 flag 0
ino 0x5788 state  idx 8 offset 86 block 195964 count 30 flag 0

You can see that the last extent had it's length truncated,
indicating that we've lost data.

The reason for this is that the xfs_bmap_shift_extents() loop uses
XFS_IFORK_NEXTENTS() to determine how many extents are in the inode.
This, unfortunately, doesn't take into account delayed allocation
extents - it's a count of physically allocated extents - and hence
when the file being collapsed has a delalloc extent like this one
does prior to the range being collapsed:

....
ino 0x5788 state  idx 4 offset 11 block 4503599627239429 count 1 flag 0
....

it gets the count wrong and terminates the shift loop early.

Fix it by using the in-memory extent array size that includes
delayed allocation extents to determine the number of extents on the
inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:25 +10:00
Dave Chinner
0e1f789d0d xfs: don't map ranges that span EOF for direct IO
Al Viro tracked down the problem that has caused generic/263 to fail
on XFS since the test was introduced. If is caused by
xfs_get_blocks() mapping a single extent that spans EOF without
marking it as buffer-new() so that the direct IO code does not zero
the tail of the block at the new EOF. This is a long standing bug
that has been around for many, many years.

Because xfs_get_blocks() starts the map before EOF, it can't set
buffer_new(), because that causes he direct IO code to also zero
unaligned sectors at the head of the IO. This would overwrite valid
data with zeros, and hence we cannot validly return a single extent
that spans EOF to direct IO.

Fix this by detecting a mapping that spans EOF and truncate it down
to EOF. This results in the the direct IO code doing the right thing
for unaligned data blocks before EOF, and then returning to get
another mapping for the region beyond EOF which XFS treats correctly
by setting buffer_new() on it. This makes direct Io behave correctly
w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.

Again, thanks to Al Viro for finding what I couldn't.

[ dchinner: Fix for __divdi3 build error:

	Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
	Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
	Signed-off-by: Mark Tinguely <tinguely@sgi.com>
	Reviewed-by: Eric Sandeen <sandeen@redhat.com>
]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:19 +10:00
Tejun Heo
33ac1257ff sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()
All device_schedule_callback_owner() users are converted to use
device_remove_file_self().  Remove now unused
{sysfs|device}_schedule_callback_owner().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-16 11:56:33 -07:00
Tejun Heo
4afddd60a7 kernfs: protect lazy kernfs_iattrs allocation with mutex
kernfs_iattrs is allocated lazily when operations which require it
take place; unfortunately, the lazy allocation and returning weren't
properly synchronized and when there are multiple concurrent
operations, it might end up returning kernfs_iattrs which hasn't
finished initialization yet or different copies to different callers.

Fix it by synchronizing with a mutex.  This can be smarter with memory
barriers but let's go there if it actually turns out to be necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/533ABA32.9080602@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 3.14
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-16 11:54:40 -07:00
Thomas Bächler
a2a4dc494a fs: Don't return 0 from get_anon_bdev
Commit 9e30cc9595 removed an internal mount. This
has the side-effect that rootfs now has FSID 0. Many
userspace utilities assume that st_dev in struct stat
is never 0, so this change breaks a number of tools in
early userspace.

Since we don't know how many userspace programs are affected,
make sure that FSID is at least 1.

References: http://article.gmane.org/gmane.linux.kernel/1666905
References: http://permalink.gmane.org/gmane.linux.utilities.util-linux-ng/8557
Cc: 3.14 <stable@vger.kernel.org>
Signed-off-by: Thomas Bächler <thomas@archlinux.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Tested-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-16 11:53:08 -07:00
Cyril Roelandt
8e3ecc8769 fs: cifs: remove unused variable.
In SMB2_set_compression(), the "res_key" variable is only initialized to NULL
and later kfreed. It is therefore useless and should be removed.

Found with the following semantic patch:

<smpl>
@@
identifier foo;
identifier f;
type T;
@@
* f(...) {
...
* T *foo = NULL;
... when forall
    when != foo
* kfree(foo);
...
}
</smpl>

Signed-off-by: Cyril Roelandt <tipecaml@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2014-04-16 13:51:46 -05:00
Steve French
60977fcc80 Return correct error on query of xattr on file with empty xattrs
xfstest 020 detected a problem with cifs xattr handling.  When a file
had an empty xattr list, we returned success (with an empty xattr value)
on query of particular xattrs rather than returning ENODATA.
This patch fixes it so that query of an xattr returns ENODATA when the
xattr list is empty for the file.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2014-04-16 13:51:46 -05:00
Sachin Prabhu
c11f1df500 cifs: Wait for writebacks to complete before attempting write.
Problem reported in Red Hat bz 1040329 for strict writes where we cache
only when we hold oplock and write direct to the server when we don't.

When we receive an oplock break, we first change the oplock value for
the inode in cifsInodeInfo->oplock to indicate that we no longer hold
the oplock before we enqueue a task to flush changes to the backing
device. Once we have completed flushing the changes, we return the
oplock to the server.

There are 2 ways here where we can have data corruption
1) While we flush changes to the backing device as part of the oplock
break, we can have processes write to the file. These writes check for
the oplock, find none and attempt to write directly to the server.
These direct writes made while we are flushing from cache could be
overwritten by data being flushed from the cache causing data
corruption.
2) While a thread runs in cifs_strict_writev, the machine could receive
and process an oplock break after the thread has checked the oplock and
found that it allows us to cache and before we have made changes to the
cache. In that case, we end up with a dirty page in cache when we
shouldn't have any. This will be flushed later and will overwrite all
subsequent writes to the part of the file represented by this page.

Before making any writes to the server, we need to confirm that we are
not in the process of flushing data to the server and if we are, we
should wait until the process is complete before we attempt the write.
We should also wait for existing writes to complete before we process
an oplock break request which changes oplock values.

We add a version specific  downgrade_oplock() operation to allow for
differences in the oplock values set for the different smb versions.

Cc: stable@vger.kernel.org
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-04-16 13:51:46 -05:00
Anatol Pomozov
e02ba72aab aio: block io_destroy() until all context requests are completed
deletes aio context and all resources related to. It makes sense that
no IO operations connected to the context should be running after the context
is destroyed. As we removed io_context we have no chance to
get requests status or call io_getevents().

man page for io_destroy says that this function may block until
all context's requests are completed. Before kernel 3.11 io_destroy()
blocked indeed, but since aio refactoring in 3.11 it is not true anymore.

Here is a pseudo-code that shows a testcase for a race condition discovered
in 3.11:

  initialize io_context
  io_submit(read to buffer)
  io_destroy()

  // context is destroyed so we can free the resources
  free(buffers);

  // if the buffer is allocated by some other user he'll be surprised
  // to learn that the buffer still filled by an outstanding operation
  // from the destroyed io_context

The fix is straight-forward - add a completion struct and wait on it
in io_destroy, complete() should be called when number of in-fligh requests
reaches zero.

If two or more io_destroy() called for the same context simultaneously then
only the first one waits for IO completion, other calls behaviour is undefined.

Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
  do not see the race condition anymore.

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-04-16 13:38:04 -04:00
Trond Myklebust
1f2edbe3fe NFS: Don't ignore suid/sgid bit changes after a successful write
If we suspect that the server may have cleared the suid/sgid bit,
then mark the inode for revalidation.

Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-04-15 23:24:43 -04:00
Trond Myklebust
43b6535e71 NFS: Don't declare inode uptodate unless all attributes were checked
Fix a bug, whereby nfs_update_inode() was declaring the inode to be
up to date despite not having checked all the attributes.
The bug occurs because the temporary variable in which we cache
the validity information is 'sanitised' before reapplying to
nfsi->cache_validity.

Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Cc: stable@vger.kernel.org # 3.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-04-15 23:24:43 -04:00
Kinglong Mee
4dfc7fdb9e NFS: Fix memroy leak for double mounts
When double mounting same nfs filesystem, the devname saved in d_fsdata
will be lost.The second mount should not change the devname that
be saved in d_fsdata.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-04-15 10:29:25 -04:00
Jeff Layton
f1c6bb2cb8 locks: allow __break_lease to sleep even when break_time is 0
A fl->fl_break_time of 0 has a special meaning to the lease break code
that basically means "never break the lease". knfsd uses this to ensure
that leases don't disappear out from under it.

Unfortunately, the code in __break_lease can end up passing this value
to wait_event_interruptible as a timeout, which prevents it from going
to sleep at all. This makes __break_lease to spin in a tight loop and
causes soft lockups.

Fix this by ensuring that we pass a minimum value of 1 as a timeout
instead.

Cc: <stable@vger.kernel.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Reported-by: Terry Barnaby <terry1@beam.ltd.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-04-15 06:17:49 -04:00
Azat Khuzhin
036acea2ce ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
With bigalloc enabled we must use EXT4_CLUSTERS_PER_GROUP() instead of
EXT4_BLOCKS_PER_GROUP() otherwise we will go beyond the allocated buffer.

$ mount -t ext4 /dev/vde /vde
[   70.573993] EXT4-fs DEBUG (fs/ext4/mballoc.c, 2346): ext4_mb_alloc_groupinfo:
[   70.575174] allocated s_groupinfo array for 1 meta_bg's
[   70.576172] EXT4-fs DEBUG (fs/ext4/super.c, 2092): ext4_check_descriptors:
[   70.576972] Checking group descriptorsBUG: unable to handle kernel paging request at ffff88006ab56000
[   72.463686] IP: [<ffffffff81394eb9>] __bitmap_weight+0x2a/0x7f
[   72.464168] PGD 295e067 PUD 2961067 PMD 7fa8e067 PTE 800000006ab56060
[   72.464738] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[   72.465139] Modules linked in:
[   72.465402] CPU: 1 PID: 3560 Comm: mount Tainted: G        W    3.14.0-rc2-00069-ge57bce1 #60
[   72.466079] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   72.466505] task: ffff88007ce6c8a0 ti: ffff88006b7f0000 task.ti: ffff88006b7f0000
[   72.466505] RIP: 0010:[<ffffffff81394eb9>]  [<ffffffff81394eb9>] __bitmap_weight+0x2a/0x7f
[   72.466505] RSP: 0018:ffff88006b7f1c00  EFLAGS: 00010206
[   72.466505] RAX: 0000000000000000 RBX: 000000000000050a RCX: 0000000000000040
[   72.466505] RDX: 0000000000000000 RSI: 0000000000080000 RDI: 0000000000000000
[   72.466505] RBP: ffff88006b7f1c28 R08: 0000000000000002 R09: 0000000000000000
[   72.466505] R10: 000000000000babe R11: 0000000000000400 R12: 0000000000080000
[   72.466505] R13: 0000000000000200 R14: 0000000000002000 R15: ffff88006ab55000
[   72.466505] FS:  00007f43ba1fa840(0000) GS:ffff88007f800000(0000) knlGS:0000000000000000
[   72.466505] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   72.466505] CR2: ffff88006ab56000 CR3: 000000006b7e6000 CR4: 00000000000006e0
[   72.466505] Stack:
[   72.466505]  ffff88006ab65000 0000000000000000 0000000000000000 0000000000010000
[   72.466505]  ffff88006ab6f400 ffff88006b7f1c58 ffffffff81396bb8 0000000000010000
[   72.466505]  0000000000000000 ffff88007b869a90 ffff88006a48a000 ffff88006b7f1c70
[   72.466505] Call Trace:
[   72.466505]  [<ffffffff81396bb8>] memweight+0x5f/0x8a
[   72.466505]  [<ffffffff811c3b19>] ext4_count_free+0x13/0x21
[   72.466505]  [<ffffffff811c396c>] ext4_count_free_clusters+0xdb/0x171
[   72.466505]  [<ffffffff811e3bdd>] ext4_fill_super+0x117c/0x28ef
[   72.466505]  [<ffffffff81391569>] ? vsnprintf+0x1c7/0x3f7
[   72.466505]  [<ffffffff8114d8dc>] mount_bdev+0x145/0x19c
[   72.466505]  [<ffffffff811e2a61>] ? ext4_calculate_overhead+0x2a1/0x2a1
[   72.466505]  [<ffffffff811dab1d>] ext4_mount+0x15/0x17
[   72.466505]  [<ffffffff8114e3aa>] mount_fs+0x67/0x150
[   72.466505]  [<ffffffff811637ea>] vfs_kern_mount+0x64/0xde
[   72.466505]  [<ffffffff81165d19>] do_mount+0x6fe/0x7f5
[   72.466505]  [<ffffffff81126cc8>] ? strndup_user+0x3a/0xd9
[   72.466505]  [<ffffffff8116604b>] SyS_mount+0x85/0xbe
[   72.466505]  [<ffffffff81619e90>] tracesys+0xdd/0xe2
[   72.466505] Code: c3 89 f0 b9 40 00 00 00 55 99 48 89 e5 41 57 f7 f9 41 56 49 89 ff 41 55 45 31 ed 41 54 41 89 f4 53 31 db 41 89 c6 45 39 ee 7e 10 <4b> 8b 3c ef 49 ff c5 e8 bf ff ff ff 01 c3 eb eb 31 c0 45 85 f6
[   72.466505] RIP  [<ffffffff81394eb9>] __bitmap_weight+0x2a/0x7f
[   72.466505]  RSP <ffff88006b7f1c00>
[   72.466505] CR2: ffff88006ab56000
[   72.466505] ---[ end trace 7d051a08ae138573 ]---
Killed

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-14 23:36:15 -04:00
Christoph Jaeger
0040e606e3 btrfs: fix use-after-free in mount_subvol()
Pointer 'newargs' is used after the memory that it points to has already
been freed.

Picked up by Coverity - CID 1201425.

Fixes: 0723a0473f ("btrfs: allow mounting btrfs subvolumes with
different ro/rw options")
Signed-off-by: Christoph Jaeger <christophjaeger@linux.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-14 11:31:08 -07:00
Fabian Frederick
24e4a0f3de fs/jfs/jfs_inode.c: atomically set inode->i_flags
According to commit 5f16f3225b

ext4: atomically set inode->i_flags in ext4_set_inode_flags()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
2014-04-14 12:56:14 -05:00
Christoph Hellwig
8b90a33f47 xfs: don't try to use the filestream allocator for metadata allocations
xfs_bmap_btalloc_nullfb has two entirely different control flows when
using the filestream allocator vs the regular one, but it get the
conditionals wrong and ends up mixing the two for metadata allocations.
Fix this by adding a missing userdata check and slight refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:37:42 +10:00
Eric Sandeen
5e06d14894 xfs: remove unused calculation in xfs_dir2_sf_addname()
The "add_entsize" calculated here is never used.
"incr_isize" accounts for the inode expansion of the
old entries + parent + new entry all by itself.

Once we've removed add_entsize there, it's just a pointless
intermediate variable elsewhere, so remove it.
For that matter, old_isize is gratuitous too, so nuke that.

And add a few comments so the magic "+1's" and "+2's" make
a bit more sense.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:07:23 +10:00
Eric Sandeen
e5e98bc64d xfs: remove pointless pointer increment in xfs_dir2_block_compact()
xfs_dir2_block_compact() is passed a pointer to *blp, and
advances it locally - but nobody uses the pointer (locally)
after that.

This behavior came about as part of prior refactoring,

20f7e9f xfs: factor dir2 block read operations

and looking at the code as it was before, it seems quite clear
that this change introduced a bug; the pre-refactoring code
expects blp to be modified after compaction.
And indeed it did; see this commit which fixed it:

37f1356 xfs: recalculate leaf entry pointer after compacting a dir2 block

So the bug was introduced & resolved in the 3.8 cycle.

Whoops.  Well, it's fixed now, and mystery solved; just remove
the now-pointless local increment of the blp pointer.

(I guess we should have run clang earlier!)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:06:46 +10:00
Eric Sandeen
bbe4c66869 xfs: remove unused trans pointer arg from xlog_recover_unmount_trans()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:06:25 +10:00
Eric Sandeen
e4a1e29cb0 xfs: remove unused ail pointer arg from xfs_trans_ail_cursor_done()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:06:05 +10:00
Eric Sandeen
bda65ef8a8 xfs: remove unused xfs_mount arg from xfs_symlink_hdr_ok()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:05:43 +10:00
Eric Sandeen
fd9fdba6c3 xfs: remove unused bp arg from xfs_iflush_fork()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:04:46 +10:00
Eric Sandeen
e009400870 xfs: remove unused pag ptr arg from iterator execute functions
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:04:19 +10:00
Eric Sandeen
6f8950cd73 xfs: remove unused length arg from alloc_block ops
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:03:53 +10:00
Eric Sandeen
6ea94bb5b3 xfs: remove unused mp arg from xfs_calc_dquots_per_chunk()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:03:34 +10:00
Eric Sandeen
2599405374 xfs: remove unused mp arg from xfs_dir2 dataptr/byte functions
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:02:30 +10:00
Eric Sandeen
9df2dd0b0d xfs: remove unused tp arg from xfs_da_reada_buf & callers
This one hits a few functions as we unravel the unused arg
up through the callers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:59 +10:00
Eric Sandeen
72b0636bb7 xfs: remove unused bip arg from xfs_buf_item_log_segment()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:34 +10:00
Eric Sandeen
87937bf8ca xfs: remove unused flags arg from _xfs_buf_get_pages()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:20 +10:00
Eric Sandeen
34dcefd717 xfs: remove unused args from xfs_alloc_buftarg()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:00 +10:00
Eric Sandeen
a96c41519a xfs: remove unused blocksize arg from xfs_setsize_buftarg()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:00:29 +10:00
Eric Sandeen
0d7409b142 xfs: remove unused level arg from xfs_btree_read_buf_block()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:59:56 +10:00
Eric Sandeen
6a9edd3d54 xfs: remove unused mp arg from xfs_bmap_forkoff_reset()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:59:26 +10:00
Eric Sandeen
152d93b732 xfs: remove unused mp arg from xfs_bmdr_maxrecs()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:58:51 +10:00
Eric Sandeen
6d0081a3da xfs: remove unused mp arg from xfs_attr3_rmt_hdr_ok()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:58:29 +10:00
Eric Sandeen
7fb2cd4d32 xfs: remove unused tp arg from xfs_bmap_last_offset() and callers
remove unused transaction pointer from various
callchains leading to xfs_bmap_last_offset().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:58:05 +10:00
Dave Chinner
897b73b6a2 xfs: zeroing space needs to punch delalloc blocks
When we are zeroing space andit is covered by a delalloc range, we
need to punch the delalloc range out before we truncate the page
cache. Failing to do so leaves and inconsistency between the page
cache and the extent tree, which we later trip over when doing
direct IO over the same range.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:15:11 +10:00
Dave Chinner
aad3f3755e xfs: xfs_vm_write_end truncates too much on failure
Similar to the write_begin problem, xfs-vm_write_end will truncate
back to the old EOF, potentially removing page cache from over the
top of delalloc blocks with valid data in them. Fix this by
truncating back to just the start of the failed write.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:14:11 +10:00
Dave Chinner
72ab70a19b xfs: write failure beyond EOF truncates too much data
If we fail a write beyond EOF and have to handle it in
xfs_vm_write_begin(), we truncate the inode back to the current inode
size. This doesn't take into account the fact that we may have
already made successful writes to the same page (in the case of block
size < page size) and hence we can truncate the page cache away from
blocks with valid data in them. If these blocks are delayed
allocation blocks, we now have a mismatch between the page cache and
the extent tree, and this will trigger - at minimum - a delayed
block count mismatch assert when the inode is evicted from the cache.
We can also trip over it when block mapping for direct IO - this is
the most common symptom seen from fsx and fsstress when run from
xfstests.

Fix it by only truncating away the exact range we are updating state
for in this write_begin call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:13:29 +10:00
Dave Chinner
4ab9ed578e xfs: kill buffers over failed write ranges properly
When a write fails, if we don't clear the delalloc flags from the
buffers over the failed range, they can persist beyond EOF and cause
problems. writeback will see the pages in the page cache, see they
are dirty and continually retry the write, assuming that the page
beyond EOF is just racing with a truncate. The page will eventually
be released due to some other operation (e.g. direct IO), and it
will not pass through invalidation because it is dirty. Hence it
will be released with buffer_delay set on it, and trigger warnings
in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct
because invalidation failed and we didn't write the corect amount.

This causes failures on block size < page size filesystems in fsx
and fsstress workloads run by xfstests.

Fix it by completely trashing any state on the buffer that could be
used to imply that it contains valid data when the delalloc range
over the buffer is punched out during the failed write handling.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:11:58 +10:00
Geert Uytterhoeven
e686bd8dc5 cifs: Use min_t() when comparing "size_t" and "unsigned long"
On 32 bit, size_t is "unsigned int", not "unsigned long", causing the
following warning when comparing with PAGE_SIZE, which is always "unsigned
long":

  fs/cifs/file.c: In function ‘cifs_readdata_to_iov’:
  fs/cifs/file.c:2757: warning: comparison of distinct pointer types lacks a cast

Introduced by commit 7f25bba819 ("cifs_iovec_read: keep iov_iter
between the calls of cifs_readdata_to_iov()"), which changed the
signedness of "remaining" and the code from min_t() to min().

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-13 14:10:26 -07:00
Dmitry Monakhov
a18ed359bd ext4: always check ext4_ext_find_extent result
Where are some places where logic guaranties us that extent we are
searching exits, but this may not be true due to on-disk data
corruption. If such corruption happens we must prevent possible
null pointer dereferences.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-13 15:41:13 -04:00
Dmitry Monakhov
8dc79ec4c0 ext4: fix error handling in ext4_ext_shift_extents
Fix error handling by adding some.  :-)

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-13 15:05:42 -04:00
jon ernst
e2cbd58741 ext4: silence sparse check warning for function ext4_trim_extent
This fixes the following sparse warning:

     CHECK   fs/ext4/mballoc.c
   fs/ext4/mballoc.c:5019:9: warning: context imbalance in
   'ext4_trim_extent' - unexpected unlock

Signed-off-by: "Jon Ernst" <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 23:01:28 -04:00
Theodore Ts'o
40c406c74e ext4: COLLAPSE_RANGE only works on extent-based files
Unfortunately, we weren't checking to make sure of this the inode was
extent-based before attempt operate on it.  Hilarity ensues.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
2014-04-12 22:53:53 -04:00
Linus Torvalds
454fd351f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull yet more networking updates from David Miller:

 1) Various fixes to the new Redpine Signals wireless driver, from
    Fariya Fatima.

 2) L2TP PPP connect code takes PMTU from the wrong socket, fix from
    Dmitry Petukhov.

 3) UFO and TSO packets differ in whether they include the protocol
    header in gso_size, account for that in skb_gso_transport_seglen().
   From Florian Westphal.

 4) If VLAN untagging fails, we double free the SKB in the bridging
    output path.  From Toshiaki Makita.

 5) Several call sites of sk->sk_data_ready() were referencing an SKB
    just added to the socket receive queue in order to calculate the
    second argument via skb->len.  This is dangerous because the moment
    the skb is added to the receive queue it can be consumed in another
    context and freed up.

    It turns out also that none of the sk->sk_data_ready()
    implementations even care about this second argument.

    So just kill it off and thus fix all these use-after-free bugs as a
    side effect.

 6) Fix inverted test in tcp_v6_send_response(), from Lorenzo Colitti.

 7) pktgen needs to do locking properly for LLTX devices, from Daniel
    Borkmann.

 8) xen-netfront driver initializes TX array entries in RX loop :-) From
    Vincenzo Maffione.

 9) After refactoring, some tunnel drivers allow a tunnel to be
    configured on top itself.  Fix from Nicolas Dichtel.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
  vti: don't allow to add the same tunnel twice
  gre: don't allow to add the same tunnel twice
  drivers: net: xen-netfront: fix array initialization bug
  pktgen: be friendly to LLTX devices
  r8152: check RTL8152_UNPLUG
  net: sun4i-emac: add promiscuous support
  net/apne: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  net: ipv6: Fix oif in TCP SYN+ACK route lookup.
  drivers: net: cpsw: enable interrupts after napi enable and clearing previous interrupts
  drivers: net: cpsw: discard all packets received when interface is down
  net: Fix use after free by removing length arg from sk_data_ready callbacks.
  Drivers: net: hyperv: Address UDP checksum issues
  Drivers: net: hyperv: Negotiate suitable ndis version for offload support
  Drivers: net: hyperv: Allocate memory for all possible per-pecket information
  bridge: Fix double free and memory leak around br_allowed_ingress
  bonding: Remove debug_fs files when module init fails
  i40evf: program RSS LUT correctly
  i40evf: remove open-coded skb_cow_head
  ixgb: remove open-coded skb_cow_head
  igbvf: remove open-coded skb_cow_head
  ...
2014-04-12 17:31:22 -07:00
Linus Torvalds
96c57ade7e ceph: fix pr_fmt() redefinition
The vfs merge caused a latent bug to show up:

   In file included from fs/ceph/super.h:4:0,
                    from fs/ceph/ioctl.c:3:
   include/linux/ceph/ceph_debug.h:4:0: warning: "pr_fmt" redefined [enabled by default]
    #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
    ^
   In file included from include/linux/kernel.h:13:0,
                    from include/linux/uio.h:12,
                    from include/linux/socket.h:7,
                    from include/uapi/linux/in.h:22,
                    from include/linux/in.h:23,
                    from fs/ceph/ioctl.c:1:
   include/linux/printk.h:214:0: note: this is the location of the previous definition
    #define pr_fmt(fmt) fmt
    ^

where the reason is that <linux/ceph_debug.h> is included much too late
for the "pr_fmt()" define.

The include of <linux/ceph_debug.h> needs to be the first include in the
file, but fs/ceph/ioctl.c had for some reason missed that, and it wasn't
noticeable until some unrelated header file changes brought in an
indirect earlier include of <linux/kernel.h>.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-12 15:39:53 -07:00
Linus Torvalds
5166701b36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe->buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
2014-04-12 14:49:50 -07:00
Linus Torvalds
0b747172dc Merge git://git.infradead.org/users/eparis/audit
Pull audit updates from Eric Paris.

* git://git.infradead.org/users/eparis/audit: (28 commits)
  AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
  audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
  audit: do not cast audit_rule_data pointers pointlesly
  AUDIT: Allow login in non-init namespaces
  audit: define audit_is_compat in kernel internal header
  kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
  sched: declare pid_alive as inline
  audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
  syscall_get_arch: remove useless function arguments
  audit: remove stray newline from audit_log_execve_info() audit_panic() call
  audit: remove stray newlines from audit_log_lost messages
  audit: include subject in login records
  audit: remove superfluous new- prefix in AUDIT_LOGIN messages
  audit: allow user processes to log from another PID namespace
  audit: anchor all pid references in the initial pid namespace
  audit: convert PPIDs to the inital PID namespace.
  pid: get pid_t ppid of task in init_pid_ns
  audit: rename the misleading audit_get_context() to audit_take_context()
  audit: Add generic compat syscall support
  audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
  ...
2014-04-12 12:38:53 -07:00
Zheng Liu
847c6c422a ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
This commit tries to fix some byte order issues that is found by sparse
check.

$ make M=fs/ext4 C=2 CF=-D__CHECK_ENDIAN__
...
  CHECK   fs/ext4/extents.c
fs/ext4/extents.c:5232:41: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5236:52: warning: bad assignment (-=) to restricted __le32
fs/ext4/extents.c:5258:45: warning: bad assignment (-=) to restricted __le32
fs/ext4/extents.c:5303:28: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5318:18: warning: incorrect type in assignment (different base types)
fs/ext4/extents.c:5318:18:    expected unsigned int [unsigned] [usertype] ex_start
fs/ext4/extents.c:5318:18:    got restricted __le32 [usertype] ee_block
fs/ext4/extents.c:5319:24: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5334:31: warning: incorrect type in assignment (different base types)
...

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 12:45:55 -04:00
Theodore Ts'o
6e6358fc3c ext4: use i_size_read in ext4_unaligned_aio()
We haven't taken i_mutex yet, so we need to use i_size_read().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-04-12 12:45:25 -04:00
Lukas Czerner
0790b31b69 fs: disallow all fallocate operation on active swapfile
Currently some file system have IS_SWAPFILE check in their fallocate
implementations and some do not. However we should really prevent any
fallocate operation on swapfile so move the check to vfs and remove the
redundant checks from the file systems fallocate implementations.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 10:05:37 -04:00
Lukas Czerner
23fffa925e fs: move falloc collapse range check into the filesystem methods
Currently in do_fallocate in collapse range case we're checking
whether offset + len is not bigger than i_size.  However there is
nothing which would prevent i_size from changing so the check is
pointless.  It should be done in the file system itself and the file
system needs to make sure that i_size is not going to change.  The
i_size check for the other fallocate modes are also done in the
filesystems.

As it is now we can easily crash the kernel by having two processes
doing truncate and fallocate collapse range at the same time.  This
can be reproduced on ext4 and it is theoretically possible on xfs even
though I was not able to trigger it with this simple test.

This commit removes the check from do_fallocate and adds it to the
file system.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-04-12 09:56:41 -04:00
Lukas Czerner
8fc61d9263 fs: prevent doing FALLOC_FL_ZERO_RANGE on append only file
Currently punch hole and collapse range fallocate operation are not
allowed on append only file. This should be case for zero range as well.
Fix it by allowing only pure fallocate (possibly with keep size set).

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 09:51:34 -04:00
Lukas Czerner
9ef06cec7c ext4: remove unnecessary check for APPEND and IMMUTABLE
All the checks IS_APPEND and IS_IMMUTABLE for the fallocate operation on
the inode are done in vfs. No need to do this again in ext4. Remove it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 09:47:00 -04:00
Al Viro
19dfc1f5f2 cifs: fix the race in cifs_writev()
O_APPEND handling there hadn't been completely fixed by Pavel's
patch; it checks the right value, but it's racy - we can't really
do that until i_mutex has been taken.

Fix by switching to __generic_file_aio_write() (open-coding
generic_file_aio_write(), actually) and pulling mutex_lock() above
inode_size_read().

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-12 06:52:48 -04:00
Al Viro
eab87235c0 ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
ceph_osdc_put_request(ERR_PTR(-error)) oopses.  What we want there
is break, not goto out.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-12 06:51:51 -04:00
Linus Torvalds
a63b747b41 Merge git://git.kvack.org/~bcrl/aio-next
Pull aio ctx->ring_pages migration serialization fix from Ben LaHaise.

* git://git.kvack.org/~bcrl/aio-next:
  aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration
2014-04-11 16:36:50 -07:00
Linus Torvalds
3123bca719 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull second set of btrfs updates from Chris Mason:
 "The most important changes here are from Josef, fixing a btrfs
  regression in 3.14 that can cause corruptions in the extent allocation
  tree when snapshots are in use.

  Josef also fixed some deadlocks in send/recv and other assorted races
  when balance is running"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
  Btrfs: fix compile warnings on on avr32 platform
  btrfs: allow mounting btrfs subvolumes with different ro/rw options
  btrfs: export global block reserve size as space_info
  btrfs: fix crash in remount(thread_pool=) case
  Btrfs: abort the transaction when we don't find our extent ref
  Btrfs: fix EINVAL checks in btrfs_clone
  Btrfs: fix unlock in __start_delalloc_inodes()
  Btrfs: scrub raid56 stripes in the right way
  Btrfs: don't compress for a small write
  Btrfs: more efficient io tree navigation on wait_extent_bit
  Btrfs: send, build path string only once in send_hole
  btrfs: filter invalid arg for btrfs resize
  Btrfs: send, fix data corruption due to incorrect hole detection
  Btrfs: kmalloc() doesn't return an ERR_PTR
  Btrfs: fix snapshot vs nocow writting
  btrfs: Change the expanding write sequence to fix snapshot related bug.
  btrfs: make device scan less noisy
  btrfs: fix lockdep warning with reclaim lock inversion
  Btrfs: hold the commit_root_sem when getting the commit root during send
  Btrfs: remove transaction from send
  ...
2014-04-11 14:16:53 -07:00
David S. Miller
676d23690f net: Fix use after free by removing length arg from sk_data_ready callbacks.
Several spots in the kernel perform a sequence like:

	skb_queue_tail(&sk->s_receive_queue, skb);
	sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up.  So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument.  And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-11 16:15:36 -04:00
Theodore Ts'o
622cad1325 ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent()
The function ext4_update_i_disksize() is used in only one place, in
the function mpage_map_and_submit_extent().  Move its code to simplify
the code paths, and also move the call to ext4_mark_inode_dirty() into
the i_data_sem's critical region, to be consistent with all of the
other places where we update i_disksize.  That way, we also keep the
raw_inode's i_disksize protected, to avoid the following race:

      CPU #1                                 CPU #2

   down_write(&i_data_sem)
   Modify i_disk_size
   up_write(&i_data_sem)
                                        down_write(&i_data_sem)
                                        Modify i_disk_size
                                        Copy i_disk_size to on-disk inode
                                        up_write(&i_data_sem)
   Copy i_disk_size to on-disk inode

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2014-04-11 10:35:17 -04:00
Wang Shilong
e4fbaee292 Btrfs: fix compile warnings on on avr32 platform
fs/btrfs/scrub.c: In function 'get_raid56_logic_offset':
fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast
fs/btrfs/scrub.c:2269: warning: right shift count >= width of type
fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type

Since @rot is an int type, we should not use do_div(), fix it.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-11 06:35:50 -07:00
Younger Liu
c57ab39b96 ext4: return ENOMEM rather than EIO when find_###_page() fails
Return ENOMEM rather than EIO when find_get_page() fails in
ext4_mb_get_buddy_page_lock() and find_or_create_page() fails in
ext4_mb_load_buddy().

Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-10 23:03:43 -04:00
Namjae Jeon
1ce01c4a19 ext4: fix COLLAPSE_RANGE test failure in data journalling mode
When mounting ext4 with data=journal option, xfstest shared/002 and
shared/004 are currently failing as checksum computed for testfile
does not match with the checksum computed in other journal modes.
In case of data=journal mode, a call to filemap_write_and_wait_range
will not flush anything to disk as buffers are not marked dirty in
write_end. In collapse range this call is followed by a call to
truncate_pagecache_range. Due to this, when checksum is computed,
a portion of file is re-read from disk which replace valid data with
NULL bytes and hence the reason for the difference in checksum.

Calling ext4_force_commit before filemap_write_and_wait_range solves
the issue as it will mark the buffers dirty during commit transaction
which can be later synced by a call to filemap_write_and_wait_range.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-10 22:58:20 -04:00
Linus Torvalds
9e897e13bd Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
Pull exofs updates from Boaz Harrosh:
 "Trivial updates to exofs for 3.15-rc1

  Just a few fixes sent by people"

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  MAINTAINERS: Update email address for bhalevy
  fs: Mark functions as static in exofs/ore_raid.c
  fs: Mark function as static in exofs/super.c
2014-04-10 14:33:02 -07:00
Harald Hoyer
0723a0473f btrfs: allow mounting btrfs subvolumes with different ro/rw options
Given the following /etc/fstab entries:

/dev/sda3 /mnt/foo btrfs subvol=foo,ro 0 0
/dev/sda3 /mnt/bar btrfs subvol=bar,rw 0 0

you can't issue:

$ mount /mnt/foo
$ mount /mnt/bar

You would have to do:

$ mount /mnt/foo
$ mount -o remount,rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo
$ mount /mnt/bar

or

$ mount /mnt/bar
$ mount --rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo

With this patch you can do

$ mount /mnt/foo
$ mount /mnt/bar

$ cat /proc/self/mountinfo
49 33 0:41 /foo /mnt/foo ro,relatime shared:36 - btrfs /dev/sda3 rw,ssd,space_cache
87 33 0:41 /bar /mnt/bar rw,relatime shared:74 - btrfs /dev/sda3 rw,ssd,space_cache

Signed-off-by: Chris Mason <clm@fb.com>
2014-04-10 13:32:50 -07:00
Linus Torvalds
dd76a786af Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A small collection of fixes that should go in before -rc1.  The pull
  request contains:

   - A two patch fix for a regression with block enabled tagging caused
     by a commit in the initial pull request.  One patch is from Martin
     and ensures that SCSI doesn't truncate 64-bit block flags, the
     other one is from me and prevents us from double using struct
     request queuelist for both completion and busy tags.  This caused
     anything from a boot crash for some, to crashes under load.

   - A blk-mq fix for a potential soft stall when hot unplugging CPUs
     with busy IO.

   - percpu_counter fix is listed in here, that caused a suspend issue
     with virtio-blk due to percpu counters having an inconsistent state
     during CPU removal.  Andrew sent this in separately a few days ago,
     but it's here.  JFYI.

   - A few fixes for block integrity from Martin.

   - A ratelimit fix for loop from Mike Galbraith, to avoid spewing too
     much in error cases"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix regression with block enabled tagging
  scsi: Make sure cmd_flags are 64-bit
  block: Ensure we only enable integrity metadata for reads and writes
  block: Fix integrity verification
  block: Fix for_each_bvec()
  drivers/block/loop.c: ratelimit error messages
  blk-mq: fix potential stall during CPU unplug with IO pending
  percpu_counter: fix bad counter state during suspend
2014-04-10 09:26:55 -07:00
Martin K. Petersen
e69f18f06b block: Ensure we only enable integrity metadata for reads and writes
We'd occasionally attempt to generate protection information for flushes
and other requests with a zero payload. Make sure we only attempt to
enable integrity for reads and writes.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-09 08:00:06 -06:00
Martin K. Petersen
0bc6997306 block: Fix integrity verification
Commit bf36f9cfa6 caused a regression by effectively reverting Nic's
fix from 5837c80e87 that ensures we traverse the full bio_vec list
upon completion.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-09 08:00:04 -06:00
Linus Torvalds
75ff24fa52 Merge branch 'for-3.15' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "Highlights:
   - server-side nfs/rdma fixes from Jeff Layton and Tom Tucker
   - xdr fixes (a larger xdr rewrite has been posted but I decided it
     would be better to queue it up for 3.16).
   - miscellaneous fixes and cleanup from all over (thanks especially to
     Kinglong Mee)"

* 'for-3.15' of git://linux-nfs.org/~bfields/linux: (36 commits)
  nfsd4: don't create unnecessary mask acl
  nfsd: revert v2 half of "nfsd: don't return high mode bits"
  nfsd4: fix memory leak in nfsd4_encode_fattr()
  nfsd: check passed socket's net matches NFSd superblock's one
  SUNRPC: Clear xpt_bc_xprt if xs_setup_bc_tcp failed
  NFSD/SUNRPC: Check rpc_xprt out of xs_setup_bc_tcp
  SUNRPC: New helper for creating client with rpc_xprt
  NFSD: Free backchannel xprt in bc_destroy
  NFSD: Clear wcc data between compound ops
  nfsd: Don't return NFS4ERR_STALE_STATEID for NFSv4.1+
  nfsd4: fix nfs4err_resource in 4.1 case
  nfsd4: fix setclientid encode size
  nfsd4: remove redundant check from nfsd4_check_resp_size
  nfsd4: use more generous NFS4_ACL_MAX
  nfsd4: minor nfsd4_replay_cache_entry cleanup
  nfsd4: nfsd4_replay_cache_entry should be static
  nfsd4: update comments with obsolete function name
  rpc: Allow xdr_buf_subsegment to operate in-place
  NFSD: Using free_conn free connection
  SUNRPC: fix memory leak of peer addresses in XPRT
  ...
2014-04-08 18:28:14 -07:00
Dan Carpenter
ffddc5fd19 fs/ncpfs/dir.c: fix indenting in ncp_lookup()
My static checker suggests adding curly braces here.  Probably that was
the intent, but actually the code works the same either way.  I've just
changed the indenting and left the code as-is.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Acked-by: Dave Chiluk <chiluk@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 16:48:53 -07:00
Joe Perches
15a03ac6f8 ncpfs/inode.c: fix mismatch printk formats and arguments
Conversions to ncp_dbg showed some format/argument mismatches so fix
them.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 16:48:53 -07:00
Joe Perches
485b47f68c ncpfs: remove now unused PRINTK macro
Uses are gone, remove the macro.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 16:48:52 -07:00
Joe Perches
e45ca8baa3 ncpfs: convert PPRINTK to ncp_vdbg
Use a more current logging style.

Convert the paranoia debug statement to vdbg.
Remove the embedded function names as dynamic_debug can do that.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 16:48:52 -07:00
Joe Perches
d3b73ca1be ncpfs: convert DPRINTK/DDPRINTK to ncp_dbg
Use a more current logging style and enable use of dynamic debugging.

Remove embedded function names, dynamic debug can add this instead.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 16:48:52 -07:00
Joe Perches
b41f8b84d0 ncpfs: Add pr_fmt and convert printks to pr_<level>
Convert to a more current logging style.

Add pr_fmt to prefix with "ncpfs: ".
Remove the embedded function names and use "%s: ", __func__

Some previously unprefixed messages now have "ncpfs: "

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 16:48:52 -07:00
Sasha Levin
e53d77eb8b autofs4: check dev ioctl size before allocating
There wasn't any check of the size passed from userspace before trying
to allocate the memory required.

This meant that userspace might request more space than allowed,
triggering an OOM.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 16:48:51 -07:00
Linus Torvalds
e9f37d3a8d Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
 "Highlights:

   - drm:

     Generic display port aux features, primary plane support, drm
     master management fixes, logging cleanups, enforced locking checks
     (instead of docs), documentation improvements, minor number
     handling cleanup, pseudofs for shared inodes.

   - ttm:

     add ability to allocate from both ends

   - i915:

     broadwell features, power domain and runtime pm, per-process
     address space infrastructure (not enabled)

   - msm:

     power management, hdmi audio support

   - nouveau:

     ongoing GPU fault recovery, initial maxwell support, random fixes

   - exynos:

     refactored driver to clean up a lot of abstraction, DP support
     moved into drm, LVDS bridge support added, parallel panel support

   - gma500:

     SGX MMU support, SGX irq handling, asle irq work fixes

   - radeon:

     video engine bringup, ring handling fixes, use dp aux helpers

   - vmwgfx:

     add rendernode support"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (849 commits)
  DRM: armada: fix corruption while loading cursors
  drm/dp_helper: don't return EPROTO for defers (v2)
  drm/bridge: export ptn3460_init function
  drm/exynos: remove MODULE_DEVICE_TABLE definitions
  ARM: dts: exynos4412-trats2: enable exynos/fimd node
  ARM: dts: exynos4210-trats: enable exynos/fimd node
  ARM: dts: exynos4412-trats2: add panel node
  ARM: dts: exynos4210-trats: add panel node
  ARM: dts: exynos4: add MIPI DSI Master node
  drm/panel: add S6E8AA0 driver
  ARM: dts: exynos4210-universal_c210: add proper panel node
  drm/panel: add ld9040 driver
  panel/ld9040: add DT bindings
  panel/s6e8aa0: add DT bindings
  drm/exynos: add DSIM driver
  exynos/dsim: add DT bindings
  drm/exynos: disallow fbdev initialization if no device is connected
  drm/mipi_dsi: create dsi devices only for nodes with reg property
  drm/mipi_dsi: add flags to DSI messages
  Skip intel_crt_init for Dell XPS 8700
  ...
2014-04-08 09:52:16 -07:00
Theodore Ts'o
87f7e41636 ext4: update PF_MEMALLOC handling in ext4_write_inode()
The special handling of PF_MEMALLOC callers in ext4_write_inode()
shouldn't be necessary as there shouldn't be any. Warn about it. Also
update comment before the function as it seems somewhat outdated.

(Changes modeled on an ext3 patch posted by Jan Kara to the linux-ext4
mailing list on Februaryt 28, 2014, which apparently never went into
the ext3 tree.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
2014-04-08 11:38:28 -04:00
Linus Torvalds
a7963eb7f4 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext3 improvements, cleanups, reiserfs fix from Jan Kara:
 "various cleanups for ext2, ext3, udf, isofs, a documentation update
  for quota, and a fix of a race in reiserfs readdir implementation"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: fix race in readdir
  ext2: acl: remove unneeded include of linux/capability.h
  ext3: explicitly remove inode from orphan list after failed direct io
  fs/isofs/inode.c add __init to init_inodecache()
  ext3: Speedup WB_SYNC_ALL pass
  fs/quota/Kconfig: Update filesystems
  ext3: Update outdated comment before ext3_ordered_writepage()
  ext3: Update PF_MEMALLOC handling in ext3_write_inode()
  ext2/3: use prandom_u32() instead of get_random_bytes()
  ext3: remove an unneeded check in ext3_new_blocks()
  ext3: remove unneeded check in ext3_ordered_writepage()
  fs: Mark function as static in ext3/xattr_security.c
  fs: Mark function as static in ext3/dir.c
  fs: Mark function as static in ext2/xattr_security.c
  ext3: Add __init macro to init_inodecache
  ext2: Add __init macro to init_inodecache
  udf: Add __init macro to init_inodecache
  fs: udf: parse_options: blocksize check
2014-04-07 17:59:17 -07:00
Linus Torvalds
26c12d9334 Merge branch 'akpm' (incoming from Andrew)
Merge second patch-bomb from Andrew Morton:
 - the rest of MM
 - zram updates
 - zswap updates
 - exit
 - procfs
 - exec
 - wait
 - crash dump
 - lib/idr
 - rapidio
 - adfs, affs, bfs, ufs
 - cris
 - Kconfig things
 - initramfs
 - small amount of IPC material
 - percpu enhancements
 - early ioremap support
 - various other misc things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits)
  MAINTAINERS: update Intel C600 SAS driver maintainers
  fs/ufs: remove unused ufs_super_block_third pointer
  fs/ufs: remove unused ufs_super_block_second pointer
  fs/ufs: remove unused ufs_super_block_first pointer
  fs/ufs/super.c: add __init to init_inodecache()
  doc/kernel-parameters.txt: add early_ioremap_debug
  arm64: add early_ioremap support
  arm64: initialize pgprot info earlier in boot
  x86: use generic early_ioremap
  mm: create generic early_ioremap() support
  x86/mm: sparse warning fix for early_memremap
  lglock: map to spinlock when !CONFIG_SMP
  percpu: add preemption checks to __this_cpu ops
  vmstat: use raw_cpu_ops to avoid false positives on preemption checks
  slub: use raw_cpu_inc for incrementing statistics
  net: replace __this_cpu_inc in route.c with raw_cpu_inc
  modules: use raw_cpu_write for initialization of per cpu refcount.
  mm: use raw_cpu ops for determining current NUMA node
  percpu: add raw_cpu_ops
  slub: fix leak of 'name' in sysfs_slab_add
  ...
2014-04-07 16:38:06 -07:00
Christian Engelmayer
fe4487d18f fs/ufs: remove unused ufs_super_block_third pointer
Pointer 'usb3' to struct ufs_super_block_third acquired via
ubh_get_usb_third() is never used in function
ufs_read_cylinder_structures().  Thus remove it.

Detected by Coverity: CID 139939.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:16 -07:00
Christian Engelmayer
48968a112c fs/ufs: remove unused ufs_super_block_second pointer
Pointer 'usb2' to struct ufs_super_block_second acquired via
ubh_get_usb_second() is never used in function ufs_statfs().  Thus
remove it.

Detected by Coverity: CID 139940.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:16 -07:00
Christian Engelmayer
6e0bd34c33 fs/ufs: remove unused ufs_super_block_first pointer
Remove occurences of unused pointers to struct ufs_super_block_first
that were acquired via ubh_get_usb_first().

Detected by Coverity: CID 139929 - CID 139936, CID 139940.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:16 -07:00
Fabian Frederick
76ee473578 fs/ufs/super.c: add __init to init_inodecache()
init_inodecache is only called by __init init_ufs_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:16 -07:00
Dave Jones
16caed3196 fault-injection: set bounds on what /proc/self/make-it-fail accepts.
/proc/self/make-it-fail is a boolean, but accepts any number, including
negative ones.  Change variable to unsigned, and cap upper bound at 1.

[akpm@linux-foundation.org: don't make make_it_fail unsigned]
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:10 -07:00
Fabian Frederick
758b444075 fs/bfs/inode.c: add __init to init_inodecache()
init_inodecache is only called by __init init_bfs_fs

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:08 -07:00
Fabian Frederick
8ca577223f affs: add mount option to avoid filename truncates
Normal behavior for filenames exceeding specific filesystem limits is to
refuse operation.

AFFS standard name length being only 30 characters against 255 for usual
Linux filesystems, original implementation does filename truncate by
default with a define value AFFS_NO_TRUNCATE which can be enabled but
needs module compilation.

This patch adds 'nofilenametruncate' mount option so that user can
easily activate that feature and avoid a lot of problems (eg overwrite
files ...)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:08 -07:00
Fabian Frederick
d40c4d46ea fs/affs/dir.c: unlock/brelse dir on failure + code clean-up
Commit 0edf977d2a ("[readdir] convert affs") returns directly -EIO
without unlocking dir inode and releasing dir bh when second affs_bread
sequence fails.  This patch restores initial behaviour.  It also fixes
pr_debug and affs_error to fit in 80 columns + removes reference to
filldir (replaced by dir_emit in the commit above).

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:08 -07:00
Fabian Frederick
adbd319e5a affs: add __init to init_inodecache ()
init_inodecache is only called by __init init_affs_fs

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:08 -07:00
Fabian Frederick
894122db49 fs/adfs/super.c: add __init to init_inodecache()
init_inodecache is only called by __init init_adfs_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:08 -07:00
WANG Chao
c4082f36fa vmcore: continue vmcore initialization if PT_NOTE is found empty
Currently when an empty PT_NOTE is detected, vmcore initialization
fails.  It sounds too harsh.  Because PT_NOTE could be empty, for
example, one offlined a cpu but never restarted kdump service, and after
crash, PT_NOTE program header is there but no data contains.  It's
better to warn about the empty PT_NOTE and continue to initialise
vmcore.

And ultimately the multiple PT_NOTE are merged into a single one, all
empty PT_NOTE are discarded naturally during the merge.  So empty
PT_NOTE is not visible to user space and vmcore is as good as expected.

Signed-off-by: WANG Chao <chaowang@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Greg Pearson <greg.pearson@hp.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:06 -07:00
Rashika Kheria
82e0703b6c include/linux/crash_dump.h: add vmcore_cleanup() prototype
Eliminate the following warning in proc/vmcore.c:

  fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]

[akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:06 -07:00
Oleg Nesterov
ad86622b47 wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space
get_task_state() uses the most significant bit to report the state to
user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
can be noticed via /proc as Z -> X -> Z change.  Note that this was
possible even before EXIT_TRACE was introduced.

This is not really bad but imho it make sense to hide EXIT_TRACE from
user-space completely.  So the patch simply swaps EXIT_ZOMBIE and
EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Lennart Poettering <lpoetter@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:06 -07:00
Oleg Nesterov
23aebe1691 exec: kill bprm->tcomm[], simplify the "basename" logic
Starting from commit c4ad8f98be ("execve: use 'struct filename *' for
executable name passing") bprm->filename can not go away after
flush_old_exec(), so we do not need to save the binary name in
bprm->tcomm[] added by 96e02d1586 ("exec: fix use-after-free bug in
setup_new_exec()").

And there was never need for filename_to_taskname-like code, we can
simply do set_task_comm(kbasename(filename).

This patch has to change set_task_comm() and trace_task_rename() to
accept "const char *", but I think this change is also good.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:05 -07:00
Djalal Harouni
32ed74a4b9 procfs: make /proc/*/pagemap 0400
The /proc/*/pagemap contain sensitive information and currently its mode
is 0444.  Change this to 0400, so the VFS will prevent unprivileged
processes from getting file descriptors on arbitrary privileged
/proc/*/pagemap files.

This reduces the scope of address space leaking and bypasses by protecting
already running processes.

Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:05 -07:00
Djalal Harouni
35a35046e4 procfs: make /proc/*/{stack,syscall,personality} 0400
These procfs files contain sensitive information and currently their
mode is 0444.  Change this to 0400, so the VFS will be able to block
unprivileged processes from getting file descriptors on arbitrary
privileged /proc/*/{stack,syscall,personality} files.

This reduces the scope of ASLR leaking and bypasses by protecting already
running processes.

Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:04 -07:00
Monam Agarwal
1c44dbc82f fs/proc/inode.c: use RCU_INIT_POINTER(x, NULL)
Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.  And in the
case of the NULL pointer, there is no structure to initialize.  So,
rcu_assign_pointer(p, NULL) can be safely converted to
RCU_INIT_POINTER(p, NULL)

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:04 -07:00
Andrey Vagin
49d063cb35 proc: show mnt_id in /proc/pid/fdinfo
Currently we don't have a way how to determing from which mount point
file has been opened.  This information is required for proper dumping
and restoring file descriptos due to presence of mount namespaces.  It's
possible, that two file descriptors are opened using the same paths, but
one fd references mount point from one namespace while the other fd --
from other namespace.

$ ls -l /proc/1/fd/1
lrwx------ 1 root root 64 Mar 19 23:54 /proc/1/fd/1 -> /dev/null

$ cat /proc/1/fdinfo/1
pos:	0
flags:	0100002
mnt_id:	16

$ cat /proc/1/mountinfo | grep ^16
16 32 0:4 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,size=1013356k,nr_inodes=253339,mode=755

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Rob Landley <rob@landley.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:04 -07:00
Luiz Capitulino
f0b5664ba7 fs/proc/meminfo: meminfo_proc_show(): fix typo in comment
It should read "reclaimable slab" and not "reclaimable swap".

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:04 -07:00
Davidlohr Bueso
615d6e8756 mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:53 -07:00
Kirill A. Shutemov
f1820361f8 mm: implement ->map_pages for page cache
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:53 -07:00
Alex Thorlton
ab0e113f6b exec: kill the unnecessary mm->def_flags setting in load_elf_binary()
load_elf_binary() sets current->mm->def_flags = def_flags and def_flags
is always zero.  Not only this looks strange, this is unnecessary
because mm_init() has already set ->def_flags = 0.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:52 -07:00
Fabian Frederick
87c1b497c2 ntfs: logging clean-up
- Convert spinlock/static array to va_format (inspired by Joe Perches
  help on previous logging patches).

- Convert printk(KERN_ERR to pr_warn in __ntfs_warning.

- Convert printk(KERN_ERR to pr_err in __ntfs_error.

- Convert printk(KERN_DEBUG to pr_debug in __ntfs_debug.  (Note that
  __ntfs_debug is still guarded by #if DEBUG)

- Improve !DEBUG to parse all arguments (Joe Perches).

- Sparse pr_foo() conversions in super.c

NTFS, NTFS-fs prefixes as well as 'warning' and 'error' were removed :
pr_foo() automatically adds module name and error level is already
specified.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:49 -07:00
Linus Torvalds
240cd6a817 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "The biggest chunk is a series of patches from Ilya that add support
  for new Ceph osd and crush map features, including some new tunables,
  primary affinity, and the new encoding that is needed for erasure
  coding support.  This brings things into parity with the server side
  and the looming firefly release.  There is also support for allocation
  hints in RBD that help limit fragmentation on the server side.

  There is also a series of patches from Zheng fixing NFS reexport,
  directory fragmentation support, flock vs fnctl behavior, and some
  issues with clustered MDS.

  Finally, there are some miscellaneous fixes from Yunchuan Wen for
  fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
  behavior"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
  ceph: skip invalid dentry during dcache readdir
  libceph: dump pool {read,write}_tier to debugfs
  libceph: output primary affinity values on osdmap updates
  ceph: flush cap release queue when trimming session caps
  ceph: don't grabs open file reference for aborted request
  ceph: drop extra open file reference in ceph_atomic_open()
  ceph: preallocate buffer for readdir reply
  libceph: enable PRIMARY_AFFINITY feature bit
  libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
  libceph: add support for osd primary affinity
  libceph: add support for primary_temp mappings
  libceph: return primary from ceph_calc_pg_acting()
  libceph: switch ceph_calc_pg_acting() to new helpers
  libceph: introduce apply_temps() helper
  libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
  libceph: ceph_can_shift_osds(pool) and pool type defines
  libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
  libceph: enable OSDMAP_ENC feature bit
  libceph: primary_affinity decode bits
  libceph: primary_affinity infrastructure
  ...
2014-04-07 11:09:13 -07:00
Linus Torvalds
3021112598 f2fs updates for v3.15
This patch-set includes the following major enhancement patches.
  o introduce large directory support
  o introduce f2fs_issue_flush to merge redundant flush commands
  o merge write IOs as much as possible aligned to the segment
  o add sysfs entries to tune the f2fs configuration
  o use radix_tree for the free_nid_list to reduce in-memory operations
  o remove costly bit operations in f2fs_find_entry
  o enhance the readahead flow for CP/NAT/SIT/SSA blocks
 
 The other bug fixes are as follows.
  o recover xattr node blocks correctly after sudden-power-cut
  o fix to calculate the maximum number of node ids
  o enhance to handle many error cases
 
 And, there are a bunch of cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJTQiQrAAoJEEAUqH6CSFDSlbIP/iq06BrUeMDLoQFhA2GQFKFD
 wd0A5h9hCiFcKBcI/u/aAQqj/a5wdwzDl9XzH2PzJ45IM6sVGQZ0lv+kdLhab6rk
 ipNbV7G0yLAX+8ygS6GZF7pSKfMzGSGTrRvfdtoiunIip1jCY1IkUxv1XMgBSPza
 wnWYrE5HXEqRUDCqPXJyxrPmx0/0jw8/V82Ng9stnY34ySs+l/3Pvg65Kh0QuSSy
 BRjJUGlOCF68KUBKd+6YB2T5KlbQde3/5lhP+GMOi+xm5sFB+j+59r/WpJpF2Nxs
 ImxQs5GkiU01ErH/rn5FgHY/zzddQenBKwOvrjEeUA1eVpBurdsIr1JN0P6qDbgB
 ho5U8LzCQq+HZiW444eQGkXSOagpUKqDhTVJO7Fji/wG88Atc9gLX3ix8TH2skxT
 C5CvvrJM7DKBtkZyTzotKY/cWorOZhge6E/EkbGaM1sSHdK5b1Rg4YlFi9TDyz0n
 QjGD1uuvEeukeKGdIG9pjc7o5ledbMDYwLpT2RuRXenLOTsn8BqDOo9aRTg+5Kag
 tJNJLFumjPR2mEBNKjicJMUf381J/SKDwZszAz9mgvCZXldMza/Ax0LzJDJCVmkP
 UuBiVzGxVzpd33IsESUDr0J9hc+t8kS10jfAeKnE3cpb6n7/RYxstHh6CHOFKNXM
 gPUSYPN3CYiP47DnSfzA
 =eSW+
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This patch-set includes the following major enhancement patches.
   - introduce large directory support
   - introduce f2fs_issue_flush to merge redundant flush commands
   - merge write IOs as much as possible aligned to the segment
   - add sysfs entries to tune the f2fs configuration
   - use radix_tree for the free_nid_list to reduce in-memory operations
   - remove costly bit operations in f2fs_find_entry
   - enhance the readahead flow for CP/NAT/SIT/SSA blocks

  The other bug fixes are as follows:
   - recover xattr node blocks correctly after sudden-power-cut
   - fix to calculate the maximum number of node ids
   - enhance to handle many error cases

  And, there are a bunch of cleanups"

* tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
  f2fs: fix wrong statistics of inline data
  f2fs: check the acl's validity before setting
  f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
  f2fs: fix to cover io->bio with io_rwsem
  f2fs: fix error path when fail to read inline data
  f2fs: use list_for_each_entry{_safe} for simplyfying code
  f2fs: avoid free slab cache under spinlock
  f2fs: avoid unneeded lookup when xattr name length is too long
  f2fs: avoid unnecessary bio submit when wait page writeback
  f2fs: return -EIO when node id is not matched
  f2fs: avoid RECLAIM_FS-ON-W warning
  f2fs: skip unnecessary node writes during fsync
  f2fs: introduce fi->i_sem to protect fi's info
  f2fs: change reclaim rate in percentage
  f2fs: add missing documentation for dir_level
  f2fs: remove unnecessary threshold
  f2fs: throttle the memory footprint with a sysfs entry
  f2fs: avoid to drop nat entries due to the negative nr_shrink
  f2fs: call f2fs_wait_on_page_writeback instead of native function
  f2fs: introduce nr_pages_to_write for segment alignment
  ...
2014-04-07 10:55:36 -07:00
David Sterba
36523e9512 btrfs: export global block reserve size as space_info
Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e806) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.

The unpatched userspace tools will show the blockgroup as 'unknown'.

CC: Jeff Mahoney <jeffm@suse.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 10:41:53 -07:00
Sergei Trofimovich
800ee2247f btrfs: fix crash in remount(thread_pool=) case
Reproducer:
    mount /dev/ubda /mnt
    mount -oremount,thread_pool=42 /mnt

Gives a crash:
    ? btrfs_workqueue_set_max+0x0/0x70
    btrfs_resize_thread_pool+0xe3/0xf0
    ? sync_filesystem+0x0/0xc0
    ? btrfs_resize_thread_pool+0x0/0xf0
    btrfs_remount+0x1d2/0x570
    ? kern_path+0x0/0x80
    do_remount_sb+0xd9/0x1c0
    do_mount+0x26a/0xbf0
    ? kfree+0x0/0x1b0
    SyS_mount+0xc4/0x110

It's a call
    btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size);
with
    fs_info->scrub_wr_completion_workers = NULL;

as scrub wqs get created only on user's demand.

Patch skips not-created-yet workqueues.

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 10:41:52 -07:00
Linus Torvalds
c29aa153ef MTD updates for 3.15:
- A few SPI NOR ID definitions
  - Kill the NAND "max pagesize" restriction
  - Fix some x16 bus-width NAND support
  - Add NAND JEDEC parameter page support
  - DT bindings for NAND ECC
  - GPMI NAND updates (subpage reads)
  - More OMAP NAND refactoring
  - New STMicro SPI NOR driver (now in 40 patches!)
  - A few other random bugfixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTP6x+AAoJEFySrpd9RFgtit0P/jLWsjMK8G2ldPC4bMZsXmDF
 n3c71GcCRlUq4Qzb4rtZx9DANLAh+JyRrMOKCPg6dAMegFdmUqDOpZpNp0vF57KG
 myFjqTk+n5y0tfSkWLMUFt0tQ8ArDp3IBkQCUWkD5LgG50EWmjveIQGH0kFnkE39
 Kytqvw17RV7f81tIs+WvKt8++YWD2X1VTpTi0S4fx2bJ99bJDBf/GgdoQpj2oirt
 igXmloUFEsob9JHZ3qumcUm9vaHwv2TiouZTvRyGdJCCoPdpJEZO4Ka6e4uAVarT
 6kMKXBk3lj2GsilOSFFCNetXfy5Bf0TkJkv4rDjh3R1Y4J/hSgraVCbWXdKhb6tj
 RmwesdFMjsyS4f/Rhk5PXwJgGL9uK2mi6bk/SmXU0AMgCDSa5zjshY8Wq6C6uXwk
 LqlnK8l3h8Txotbc/XJIL+QGMbMkYQI8gxWTHFaqzDtkMe36mnGs9Zec3oso/s2d
 CNRpq5+dMZ6qF0z3zpOQHmFbaOekivMy7kCKMXer6ONsrQBNwTdmkwy+SdqnWsLF
 YdJttwV/RRcE0SRvK6GrhvzkGlV83Z8RPny6hC1kbrgQ0ffoy2CmIqyWNObK1RXf
 sYqoF8TCtR6Y8rHHi5dzZ1lria+nm8pb4+UfQLRI0mK8og7YW+fIfHXxqRrZZIrt
 No8NCPBKWzyew2UE0AiQ
 =CKih
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 - A few SPI NOR ID definitions
 - Kill the NAND "max pagesize" restriction
 - Fix some x16 bus-width NAND support
 - Add NAND JEDEC parameter page support
 - DT bindings for NAND ECC
 - GPMI NAND updates (subpage reads)
 - More OMAP NAND refactoring
 - New STMicro SPI NOR driver (now in 40 patches!)
 - A few other random bugfixes

* tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd: (120 commits)
  Fix index regression in nand_read_subpage
  mtd: diskonchip: mem resource name is not optional
  mtd: nand: fix mention to CONFIG_MTD_NAND_ECC_BCH
  mtd: nand: fix GET/SET_FEATURES address on 16-bit devices
  mtd: omap2: Use devm_ioremap_resource()
  mtd: denali_dt: Use devm_ioremap_resource()
  mtd: devices: elm: update DRIVER_NAME as "omap-elm"
  mtd: devices: elm: configure parallel channels based on ecc_steps
  mtd: devices: elm: clean elm_load_syndrome
  mtd: devices: elm: check for hardware engine's design constraints
  mtd: st_spi_fsm: Succinctly reorganise .remove()
  mtd: st_spi_fsm: Allow loop to run at least once before giving up CPU
  mtd: st_spi_fsm: Correct vendor name spelling issue - missing "M"
  mtd: st_spi_fsm: Avoid duplicating MTD core code
  mtd: st_spi_fsm: Remove useless consts from function arguments
  mtd: st_spi_fsm: Convert ST SPI FSM (NOR) Flash driver to new DT partitions
  mtd: st_spi_fsm: Move runtime configurable msg sequences into device's struct
  mtd: st_spi_fsm: Supply the W25Qxxx chip specific configuration call-back
  mtd: st_spi_fsm: Supply the S25FLxxx chip specific configuration call-back
  mtd: st_spi_fsm: Supply the MX25xxx chip specific configuration call-back
  ...
2014-04-07 10:17:30 -07:00
Josef Bacik
c4a050bbbb Btrfs: abort the transaction when we don't find our extent ref
I'm not sure why we weren't aborting here in the first place, it is obviously a
bad time from the fact that we print the leaf and yell loudly about it.  Fix
this up, otherwise we panic because our path could be pointing into oblivion.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:51 -07:00
Chris Mason
3a29bc0928 Btrfs: fix EINVAL checks in btrfs_clone
btrfs_drop_extents can now return -EINVAL, but only one caller
in btrfs_clone was checking for it.  This adds it to the
caller for inline extents, which is where we really need it.

Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:50 -07:00
Wang Shilong
a1ecaabbf9 Btrfs: fix unlock in __start_delalloc_inodes()
This patch fix a regression caused by the following patch:
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock

break while loop will make us call @spin_unlock() without
calling @spin_lock() before, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:50 -07:00
Wang Shilong
3b080b2564 Btrfs: scrub raid56 stripes in the right way
Steps to reproduce:
 # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
 # mount /dev/sda8 /mnt
 # btrfs scrub start -BR /mnt
 # echo $? <--unverified errors make return value be 3

This is because we don't setup right mapping between physical
and logical address for raid56, which makes checksum mismatch.
But we will find everthing is fine later when rechecking using
btrfs_map_block().

This patch fixed the problem by settuping right mappings and
we only verify data stripes' checksums.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:49 -07:00
Wang Shilong
68bb462d42 Btrfs: don't compress for a small write
To compress a small file range(<=blocksize) that is not
an inline extent can not save disk space at all. skip it can
save us some cpu time.

This patch can also fix wrong setting nocompression flag for
inode, say a case when @total_in is 4096, and then we get
@total_compressed 52,because we do aligment to page cache size
firstly, and then we get into conclusion @total_in=@total_compressed
thus we will clear this inode's compression flag.

An exception comes from inserting inline extent failure but we
still have @total_compressed < @total_in,so we will still reset
inode's flag, this is ok, because we don't have good compression
effect.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:48 -07:00
Filipe Manana
c50d3e71c3 Btrfs: more efficient io tree navigation on wait_extent_bit
If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:47 -07:00
Filipe Manana
c715e155c9 Btrfs: send, build path string only once in send_hole
There's no point building the path string in each iteration of the
send_hole loop, as it produces always the same string.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:46 -07:00
Gui Hecheng
9a40f1222a btrfs: filter invalid arg for btrfs resize
Originally following cmds will work:
	# btrfs fi resize -10A  <mnt>
	# btrfs fi resize -10Gaha <mnt>
Filter the arg by checking the return pointer of memparse.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:45 -07:00
Filipe Manana
766b5e5ae7 Btrfs: send, fix data corruption due to incorrect hole detection
During an incremental send, when we finish processing an inode (corresponding to
a regular file) we would assume the gap between the end of the last processed file
extent and the file's size corresponded to a file hole, and therefore incorrectly
send a bunch of zero bytes to overwrite that region in the file.

This affects only kernel 3.14.

Reproducer:

    mkfs.btrfs -f /dev/sdc
    mount /dev/sdc /mnt

    xfs_io -f -c "falloc -k 0 268435456" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap0

    xfs_io -c "pwrite -S 0x01 -b 9216 16190218 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x02 -b 1121 198720104 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x05 -b 9216 107887439 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x06 -b 9216 225520207 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x07 -b 67584 102138300 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x08 -b 7000 94897484 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x09 -b 113664 245083212 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x10 -b 123 17937788 123" /mnt/foo
    xfs_io -c "pwrite -S 0x11 -b 39936 229573311 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x12 -b 67584 174792222 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x13 -b 9216 249253213 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x16 -b 67584 150046083 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x17 -b 39936 118246040 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x18 -b 67584 215965442 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x19 -b 33792 97096725 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x20 -b 125952 166300596 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x21 -b 123 1078957 123" /mnt/foo
    xfs_io -c "pwrite -S 0x25 -b 9216 212044492 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x26 -b 7000 265037146 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x27 -b 42757 215922685 42757" /mnt/foo
    xfs_io -c "pwrite -S 0x28 -b 7000 69865411 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x29 -b 67584 67948958 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x30 -b 39936 266967019 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x31 -b 1121 19582453 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x32 -b 17408 257710255 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x33 -b 39936 3895518 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x34 -b 125952 12045847 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x35 -b 17408 19156379 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x36 -b 39936 50160066 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x37 -b 113664 9549793 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x38 -b 105472 94391506 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x39 -b 23552 143632863 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x40 -b 39936 241283845 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x41 -b 113664 199937606 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x42 -b 67584 67380093 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x43 -b 67584 26793129 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x44 -b 39936 14421913 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x45 -b 123 253097405 123" /mnt/foo
    xfs_io -c "pwrite -S 0x46 -b 1121 128233424 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x47 -b 105472 91577959 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x48 -b 1121 7245381 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x49 -b 113664 182414694 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x50 -b 9216 32750608 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x51 -b 67584 266546049 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x52 -b 67584 87969398 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x53 -b 9216 260848797 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x54 -b 39936 119461243 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x55 -b 7000 200178693 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x56 -b 9216 243316029 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x57 -b 7000 209658229 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x58 -b 101376 179745192 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x59 -b 9216 64012300 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x60 -b 125952 181705139 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x61 -b 23552 235737348 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x62 -b 113664 106021355 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x63 -b 67584 135753552 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x64 -b 23552 95730888 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x65 -b 11 17311415 11" /mnt/foo
    xfs_io -c "pwrite -S 0x66 -b 33792 120695553 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x67 -b 9216 17164631 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x68 -b 9216 136065853 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x69 -b 67584 37752198 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x70 -b 101376 189717473 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x71 -b 7000 227463698 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x72 -b 9216 12655137 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x73 -b 7000 7488866 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x74 -b 113664 87813649 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x75 -b 33792 25802183 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x76 -b 39936 93524024 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x77 -b 33792 113336388 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x78 -b 105472 184955320 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x79 -b 101376 225691598 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x80 -b 23552 77023155 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x81 -b 11 201888192 11" /mnt/foo
    xfs_io -c "pwrite -S 0x82 -b 11 115332492 11" /mnt/foo
    xfs_io -c "pwrite -S 0x83 -b 67584 230278015 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x84 -b 11 120589073 11" /mnt/foo
    xfs_io -c "pwrite -S 0x85 -b 125952 202207819 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x86 -b 113664 86672080 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x87 -b 17408 208459603 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x88 -b 7000 73372211 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x89 -b 7000 42252122 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x90 -b 23552 46784881 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x91 -b 101376 63172351 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x92 -b 23552 59341931 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x93 -b 39936 239599283 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x94 -b 67584 175643105 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x97 -b 23552 105534880 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x98 -b 113664 8236844 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x99 -b 125952 144489686 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa0 -b 7000 73273112 7000" /mnt/foo
    xfs_io -c "pwrite -S 0xa1 -b 125952 194580243 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa2 -b 123 56296779 123" /mnt/foo
    xfs_io -c "pwrite -S 0xa3 -b 11 233066845 11" /mnt/foo
    xfs_io -c "pwrite -S 0xa4 -b 39936 197727090 39936" /mnt/foo
    xfs_io -c "pwrite -S 0xa5 -b 101376 53579812 101376" /mnt/foo
    xfs_io -c "pwrite -S 0xa6 -b 9216 85669738 9216" /mnt/foo
    xfs_io -c "pwrite -S 0xa7 -b 125952 21266322 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa8 -b 23552 125726568 23552" /mnt/foo
    xfs_io -c "pwrite -S 0xa9 -b 9216 18423680 9216" /mnt/foo
    xfs_io -c "pwrite -S 0xb0 -b 1121 165901483 1121" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    xfs_io -c "pwrite -S 0xff -b 10 16190218 10" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    md5sum /mnt/foo          # returns 79e53f1466bfc09fd82b450689e6119e
    md5sum /mnt/mysnap2/foo  # returns 79e53f1466bfc09fd82b450689e6119e too

    btrfs send /mnt/mysnap1 -f /tmp/1.snap
    btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap

    mkfs.btrfs -f /dev/sdc
    mount /dev/sdc /mnt

    btrfs receive /mnt -f /tmp/1.snap
    btrfs receive /mnt -f /tmp/2.snap

    md5sum /mnt/mysnap2/foo  # returns 2bb414c5155767cedccd7063e51beabd !!

A testcase for xfstests follows soon too.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:45 -07:00
Dan Carpenter
84dbeb87d1 Btrfs: kmalloc() doesn't return an ERR_PTR
The error handling was copy and pasted from memdup_user().  It should be
checking for NULL obviously.

Fixes: abccd00f8a ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:44 -07:00
Wang Shilong
e9894fd3e3 Btrfs: fix snapshot vs nocow writting
While running fsstress and snapshots concurrently, we will hit something
like followings:

Thread 1			Thread 2

|->fallocate
  |->write pages
    |->join transaction
       |->add ordered extent
    |->end transaction
				|->flushing data
				  |->creating pending snapshots
|->write data into src root's
   fallocated space

After above work flows finished, we will get a state that source and
snapshot root share same space, but source root have written data into
fallocated space, this will make fsck fail to verify checksums for
snapshot root's preallocating file extent data.Nocow writting also
has this same problem.

Fix this problem by syncing snapshots with nocow writting:

 1.for nocow writting,if there are pending snapshots, we will
 fall into COW way.

 2.if there are pending nocow writes, snapshots for this root
 will be blocked until nocow writting finish.

Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:43 -07:00
Qu Wenruo
3ac0d7b96a btrfs: Change the expanding write sequence to fix snapshot related bug.
When testing fsstress with snapshot making background, some snapshot
following problem.

Snapshot 270:
inode 323: size 0

Snapshot 271:
inode 323: size 349145
|-------Hole---|---------Empty gap-------|-------Hole-----|
0	    122880			172032	      349145

Snapshot 272:
inode 323: size 349145
|-------Hole---|------------Data---------|-------Hole-----|
0	    122880			172032	      349145

The fsstress operation on inode 323 is the following:
write: 		offset 	126832 	len 43124
truncate: 	size 	349145

Since the write with offset is consist of 2 operations:
1. punch hole
2. write data
Hole punching is faster than data write, so hole punching in write
and truncate is done first and then buffered write, so the snapshot 271 got
empty gap, which will not pass btrfsck.

To fix the bug, this patch will change the write sequence which will
first punch a hole covering the write end if a hole is needed.

Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:42 -07:00
David Sterba
60999ca4b4 btrfs: make device scan less noisy
Print the message only when the device is seen for the first time.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:41 -07:00
Jeff Mahoney
ed55b6ac07 btrfs: fix lockdep warning with reclaim lock inversion
When encountering memory pressure, testers have run into the following
lockdep warning. It was caused by __link_block_group calling kobject_add
with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
which gets us into reclaim context. The kobject doesn't actually need
to be added under the lock -- it just needs to ensure that it's only
added for the first block group to be linked.

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc8-default #1 Not tainted
---------------------------------------------------------
kswapd0/169 just changed the state of lock:
 (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
 (&found->groups_sem){+++++.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
 Possible interrupt unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&found->groups_sem);
                               local_irq_disable();
                               lock(&delayed_node->mutex);
                               lock(&found->groups_sem);
  <Interrupt>
    lock(&delayed_node->mutex);

 *** DEADLOCK ***
2 locks held by kswapd0/169:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
 #1:  (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:40 -07:00
Josef Bacik
3f8a18cc53 Btrfs: hold the commit_root_sem when getting the commit root during send
We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root.  We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:39 -07:00
Jan Kara
ec4cb1aa2b ext4: fix jbd2 warning under heavy xattr load
When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:

WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()

CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G    W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
 [<ffffffff816311b0>] dump_stack+0x19/0x1b
 [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0
 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260
 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140
 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0
 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910
 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0
 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140
 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50
 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90
 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0
 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0
 [<ffffffff8119421e>] setxattr+0x13e/0x1e0
 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70
 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100
 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b

The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:

CPU1                                CPU2
ext4_xattr_release_block()          ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
  le32_add_cpu(&BHDR(bh)->h_refcount, -1);
  unlock_buffer(bh);
                                    lock_buffer(bh);
                                    /* True */
                                    if (BHDR(bh)->h_refcount == cpu_to_le32(1))
                                      get_bh(bh);
                                      ext4_free_blocks()
                                        ...
                                        jbd2_journal_forget()
                                          jbd2_journal_unfile_buffer()
                                          -> JH is gone
  error = ext4_handle_dirty_xattr_block(handle, inode, bh);
  -> triggers the warning

We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-04-07 10:54:21 -04:00
Matthew Wilcox
9503c67c93 ext4: note the error in ext4_end_bio()
ext4_end_bio() currently throws away the error that it receives.  Chances
are this is part of a spate of errors, one of which will end up getting
the error returned to userspace somehow, but we shouldn't take that risk.
Also print out the errno to aid in debug.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2014-04-07 10:54:20 -04:00
Azat Khuzhin
007649375f ext4: initialize multi-block allocator before checking block descriptors
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.

And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
  because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
  number of clusters returned by ext4_free_clusters_after_init() is not correct.
  (see ext4_bg_num_gdb_nometa())

Here is the stack backtrace:

(gdb) bt
 #0  ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
 #1  ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
     desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
 #2  0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
 #3  0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0) at balloc.c:489
 #4  0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
 #5  0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
     sb=0xffff880079a10000) at super.c:2143
 #6  ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
     data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
     ...

Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 10:54:20 -04:00
Kazuya Mio
4adb6ab3e0 ext4: FIBMAP ioctl causes BUG_ON due to handle EXT_MAX_BLOCKS
When we try to get 2^32-1 block of the file which has the extent
(ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON
in ext4_ext_put_gap_in_cache().

To avoid the problem, ext4_map_blocks() needs to check the file logical block
number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot
handle 2^32-1 because the maximum file logical block number is 2^32-2.

Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid.
So ext4_map_blocks() should also return the same errno.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-04-07 10:53:28 -04:00
Chen Gang
666525dfbd ext4: fix 64-bit number truncation warning
'0x7FDEADBEEF' will be truncated to 32-bit number under unicore32. Need
append 'ULL' for it.

The related warning (with allmodconfig under unicore32):

    CC [M]  fs/ext4/extents_status.o
  fs/ext4/extents_status.c: In function "__es_remove_extent":
  fs/ext4/extents_status.c:813: warning: integer constant is too large for "long" type

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 10:18:56 -04:00
Chao Yu
48b230a583 f2fs: fix wrong statistics of inline data
If we remove a file that has inline data after mount, our statistics turns to
inaccurate.

cat /sys/kernel/debug/f2fs/status
  - Inline_data Inode: 4294967295

Let's add stat_inc_inline_inode() to stat inline info of the file when lookup.

Change log from v1:
 o stat in f2fs_lookup() instead of in do_read_inode() for excluding wrong stat.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-07 12:40:58 +09:00
ZhangZhen
3a8861e271 f2fs: check the acl's validity before setting
Before setting the acl, call posix_acl_valid() to check if it is
valid or not.

Signed-off-by: zhangzhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-07 12:18:30 +09:00
Jaegeuk Kim
6b4afdd794 f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
Some storage devices show relatively high latencies to complete cache_flush
commands, even though their normal IO speed is prettry much high. In such
the case, it needs to merge cache_flush commands as much as possible to avoid
issuing them redundantly.
So, this patch introduces a mount option, "-o flush_merge", to mitigate such
the overhead.

If this option is enabled by user, F2FS merges the cache_flush commands and then
issues just one cache_flush on behalf of them. Once the single command is
finished, F2FS sends a completion signal to all the pending threads.

Note that, this option can be used under a workload consisting of very intensive
concurrent fsync calls, while the storage handles cache_flush commands slowly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-07 09:50:58 +09:00
Josef Bacik
9e351cc862 Btrfs: remove transaction from send
Lets try this again.  We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send.  So fix this by making sure looking at the
commit roots is always going to be consistent.  We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once.  Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes.  With this patch we no longer
deadlock and pass all the weird send/receive corner cases.  Thanks,

Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:39:30 -07:00
Josef Bacik
a26e8c9f75 Btrfs: don't clear uptodate if the eb is under IO
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel.  This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed.  Turns out this is because of a
bad interaction of balance and send/receive.  Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening.  So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid.  But because we think it is invalid we clear uptodate and
re-read in the block.  If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.

Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data.  So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:34:37 -07:00
Josef Bacik
573a075567 Btrfs: check for an extent_op on the locked ref
We could have possibly added an extent_op to the locked_ref while we dropped
locked_ref->lock, so check for this case as well and loop around.  Otherwise we
could lose flag updates which would lead to extent tree corruption.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:34:36 -07:00
Josef Bacik
ba8b028933 Btrfs: do not reset last_snapshot after relocation
This was done to allow NO_COW to continue to be NO_COW after relocation but it
is not right.  When relocating we will convert blocks to FULL_BACKREF that we
relocate.  We can leave some of these full backref blocks behind if they are not
cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
then just drop the reloc tree.  Then when we go to cow the block again we won't
lookup the extent flags because we won't think there has been a snapshot
recently which means we will do our normal ref drop thing instead of adding back
a tree ref and dropping the shared ref.  This will cause btrfs_free_extent to
blow up because it can't find the ref we are trying to free.  This was found
with my ref verifying tool.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:34:35 -07:00
Linus Torvalds
2b3a8fd735 NFS client updates for Linux 3.15
Highlights include:
 
 - Stable fix for a use after free issue in the NFSv4.1 open code
 - Fix the SUNRPC bi-directional RPC code to account for TCP segmentation
 - Optimise usage of readdirplus when confronted with 'ls -l' situations
 - Soft mount bugfixes
 - NFS over RDMA bugfixes
 - NFSv4 close locking fixes
 - Various NFSv4.x client state management optimisations
 - Rename/unlink code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTQBayAAoJEGcL54qWCgDyUzgQAKzSlbcksMQT55M/KZJXabNW
 KSctJeDrkTkRxOXTNxuF9NbIgeqenLijCokXty6BIUgup0zkOPMzFfRfgdQvplnp
 YEj4sOEXEZ8CX+PoUTYOEayzt0ssEAOyidumiM+Gx2LD/E1d2xyCL7YaAOjIhVQS
 OnXcX1cZw+dZSUxC9vu5fVDjrphJTnp4CXdbvR5PiJiXeKqzZd9e5M3hXgpAQ/AS
 mWjYeUvM9mwyz7UmbLKkWEmzB3tFlGdTzDPxLRrkfcOSKI2Ham0lL3/Uv50/nRTu
 99ts6KH8KLGcUuL9vD9KRebht2f71usBrWAdvpy1cUcf1Fh6lmEg4ktGfkqldaUu
 9kNu9d5DCxJoGc6R2UTw5FeyPwYuDWoBwEGy1DcguJ5CeQn2R2nH4ps/P3J3DX4d
 DZsJqCY9idKZCQhtyR0iF9j3x2bNFoENaL6WHI6b0J+xjMedIbHgeUQzIQP0RLBJ
 h0IcjK0D+e7WdyC7jk4Nm3krtms5SNUG5/N9OUO36a7v8735PJBcbcgm9hZJt8Fh
 t/4vqUmKIBXHioHsMhaFslqTWlYIR9a3MYmN7QtHFYbqUfNxH69v9y3d6jb4Igck
 kqoEiui5aJOCR76s7oVdHCcm+klBwEPiACT+H9CUMzSoKzHSWsBSNZbJR3BEia4M
 7dwScS1OfI2KuutshGQA
 =weNx
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - Stable fix for a use after free issue in the NFSv4.1 open code
   - Fix the SUNRPC bi-directional RPC code to account for TCP segmentation
   - Optimise usage of readdirplus when confronted with 'ls -l' situations
   - Soft mount bugfixes
   - NFS over RDMA bugfixes
   - NFSv4 close locking fixes
   - Various NFSv4.x client state management optimisations
   - Rename/unlink code cleanups"

* tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
  nfs: pass string length to pr_notice message about readdir loops
  NFSv4: Fix a use-after-free problem in open()
  SUNRPC: rpc_restart_call/rpc_restart_call_prepare should clear task->tk_status
  SUNRPC: Don't let rpc_delay() clobber non-timeout errors
  SUNRPC: Ensure call_connect_status() deals correctly with SOFTCONN tasks
  SUNRPC: Ensure call_status() deals correctly with SOFTCONN tasks
  NFSv4: Ensure we respect soft mount timeouts during trunking discovery
  NFSv4: Schedule recovery if nfs40_walk_client_list() is interrupted
  NFS: advertise only supported callback netids
  SUNRPC: remove KERN_INFO from dprintk() call sites
  SUNRPC: Fix large reads on NFS/RDMA
  NFS: Clean up: revert increase in READDIR RPC buffer max size
  SUNRPC: Ensure that call_bind times out correctly
  SUNRPC: Ensure that call_connect times out correctly
  nfs: emit a fsnotify_nameremove call in sillyrename codepath
  nfs: remove synchronous rename code
  nfs: convert nfs_rename to use async_rename infrastructure
  nfs: make nfs_async_rename non-static
  nfs: abstract out code needed to complete a sillyrename
  NFSv4: Clear the open state flags if the new stateid does not match
  ...
2014-04-06 10:09:38 -07:00
Linus Torvalds
6f4c98e1c2 Nothing major: the stricter permissions checking for sysfs broke
a staging driver; fix included.  Greg KH said he'd take the patch
 but hadn't as the merge window opened, so it's included here
 to avoid breaking build.
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTQMH9AAoJENkgDmzRrbjxo4UP/jwlenP44v+RFpo/dn8Z8E2n
 SREQscU5ZZKvuyFD6kUdvOz8YC/nTrJvXoVkMUF05GVbuvb8/8UPtT9ECVemd0rW
 xNy4aFfv9rbrqRLBLpLK9LAgTuhwlbTgGxgL78zRn3hWmf1hBZWCY+cEvKM8l/+9
 oEQdORL0sUpZh7iryAeGqbOrXT4gqJEvSLOFwiYTSo6ryzWIilmdXSUAh6s8MIEX
 PR1+oH9J8B6J29lcXKMf8/sDI1EBUeSLdBmMCuN5Y7xpYxsQLroVx94kPbdBY+XK
 ZRoYuUGSUJfGRZY46cFKApIGeF07z1DGoyXghbSWEQrI+23TMUmrKUg47LSukE4Y
 yCUf8HAtqIA3gVc9GKDdSp/2UpkAhTTv5ogKgnIzs1InWtOIBdDRSVUQXDosFEXw
 6ZZe1pQs2zfXyXxO4j0Wq36K4RgI0aqOVw+dcC+w5BidjVylgnYRV0PSDd72tid7
 bIfnjDbUBo+o4LanPNGYK474KyO7AslgTE50w6zwbJzgdwCQ36hCpKqScBZzm60a
 42LrgTVoIHHWAL1tDzWL/LzWflZGdJAezzNje0/f2Q3bGMiNHWoljAvUphkTZ7qt
 E8+jWqmM+riH3e8Y5wKpO1BKt7NGHISEy//bUlnqTwisjIzVILZ6VjfugQ1AI+0x
 llTXPBotFvfvXqxunBg7
 =yzUO
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module updates from Rusty Russell:
 "Nothing major: the stricter permissions checking for sysfs broke a
  staging driver; fix included.  Greg KH said he'd take the patch but
  hadn't as the merge window opened, so it's included here to avoid
  breaking build"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  staging: fix up speakup kobject mode
  Use 'E' instead of 'X' for unsigned module taint flag.
  VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
  kallsyms: fix percpu vars on x86-64 with relocation.
  kallsyms: generalize address range checking
  module: LLVMLinux: Remove unused function warning from __param_check macro
  Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
  module: remove MODULE_GENERIC_TABLE
  module: allow multiple calls to MODULE_DEVICE_TABLE() per module
  module: use pr_cont
2014-04-06 09:38:07 -07:00
Yan, Zheng
a30be7cb2c ceph: skip invalid dentry during dcache readdir
skip dentries that were added before MDS issued FILE_SHARED to
client.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-06 09:13:14 -07:00
Jeff Layton
9581a4ae75 nfs: pass string length to pr_notice message about readdir loops
There is no guarantee that the strings in the nfs_cache_array will be
NULL-terminated. In the event that we end up hitting a readdir loop, we
need to ensure that we pass the warning message the length of the
string.

Reported-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-04-05 09:25:42 -04:00
Yan, Zheng
a56371d9d9 ceph: flush cap release queue when trimming session caps
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:08:26 -07:00
Yan, Zheng
4819301287 ceph: don't grabs open file reference for aborted request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:08:25 -07:00
Yan, Zheng
ab866549b3 ceph: drop extra open file reference in ceph_atomic_open()
ceph_atomic_open() calls ceph_open() after receiving the MDS reply.
ceph_open() grabs an extra open file reference. (The open request
already holds an open file reference)

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:08:23 -07:00
Yan, Zheng
54008399dc ceph: preallocate buffer for readdir reply
Preallocate buffer for readdir reply. Limit number of entries in
readdir reply according to the buffer size.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:08:22 -07:00
Yan, Zheng
cc48c3e85f ceph: don't include ceph.{file,dir}.layout vxattr in listxattr()
This avoids 'cp -a' modifying layout of new files/directories.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:21 -07:00
Yan, Zheng
1e5c6649ff ceph: check buffer size in ceph_vxattrcb_layout()
If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:19 -07:00
Yan, Zheng
00bd8edb86 ceph: fix null pointer dereference in discard_cap_releases()
send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04 21:07:17 -07:00
Fabian Frederick
5f75ce5781 ceph: Remove get/set acl on symlinks
Remove unsupported symlink operations.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-04-04 21:07:14 -07:00
Yan, Zheng
d9ffc4f770 ceph: set mds_wanted when MDS reply changes a cap to auth cap
When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:12 -07:00
Yan, Zheng
eb13e832f8 ceph: use fl->fl_file as owner identifier of flock and posix lock
flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04 21:07:11 -07:00
Yan, Zheng
eb70c0ce4e ceph: forbid mandatory file lock
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:09 -07:00
Yan, Zheng
0e8e95d6d7 ceph: use fl->fl_type to decide flock operation
VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:08 -07:00
Yan, Zheng
8c93cd610c ceph: update i_max_size even if inode version does not change
handle following sequence of events:
 - client releases a inode with i_max_size > 0. The release message
   is queued. (is not sent to the auth MDS)
 - a 'lookup' request reply from non-auth MDS returns the same inode.
 - client opens the inode in write mode. The version of inode trace
   in 'open' request reply is equal to the cached inode's version.
 - client requests new max size. The MDS ignores the request because
   it does not affect client's write range

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04 21:07:06 -07:00
Yan, Zheng
a255060451 ceph: make sure write caps are registered with auth MDS
Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:05 -07:00
Linus Torvalds
d15e03104e xfs: update for 3.15-rc1
The main changes in the XFS tree for 3.15-rc1 are:
 
         - O_TMPFILE support
         - allowing AIO+DIO writes beyond EOF
         - FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
           implementation
         - FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
           implementation
         - IO verifier cleanup and rework
         - stack usage reduction changes
         - vm_map_ram NOIO context fixes to remove lockdep warings
         - various bug fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJTPykMAAoJEK3oKUf0dfod/KoP/jKQwzQPdtT8EtAu5vENh9AO
 55zwCDXXFjCNIGIFPkrUBQbbARVAqhLZn3vuLUUhqtRRELdgJy/yFKZ37MPd8bhU
 dKetivEB192Jcd6Sn74vsOsNLm1u9mJqbQ1aothz0TiOrkkWFZlz4Otu36MZRHN3
 9WgZXWSxr6I/hYHGyCorJWZ5ISs0XD3vR5dYXYeZChbTpTxlCT4X/YgUtW4WH/Tq
 y4gG0fKfwr9KK07/LXuQgUuZGU8vwVuNNsXPhqh+FZ39SLD2Ea83h46Hzf/+vVNI
 kCIyYN1y40uBWczmwAptVEnUwhpGK8PzNrhKwTZICDtuct9sikf7c+o0aEE9lcqo
 8IBt0Dy4l7BQVFSZOjYo5Jw5a8jAbkh47zru31HxogEVqafdz80iWB12JagOOnXM
 v/McvDvZMyfgGckih32FM4G7ElvTYgGai5/3dLhfMuhc4/DdwcBOF1yHmFmnjhWO
 QRsQxLdefUtP3MfMYKaJHM6v2wE1S2l0owgp+HdPluNiOUmH/fqFq1WpHxqqeRPz
 nuHF8oYlxaZP5WAarz6Yf1/twIeZJ1rTD8np8uocvMqQJzMYJgrQyH+xJqjJaITR
 iveQcEoRB8D7/fXMGDdcjZYE2fG4l4JE2kuh97k5NZw76e3v2YXSGh0kd9WqR1uN
 t07joLRQKR2pJuSmuD5E
 =uSkJ
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Dave Chinner:
 "There are a couple of new fallocate features in this request - it was
  decided that it was easiest to push them through the XFS tree using
  topic branches and have the ext4 support be based on those branches.
  Hence you may see some overlap with the ext4 tree merge depending on
  how they including those topic branches into their tree.  Other than
  that, there is O_TMPFILE support, some cleanups and bug fixes.

  The main changes in the XFS tree for 3.15-rc1 are:

   - O_TMPFILE support
   - allowing AIO+DIO writes beyond EOF
   - FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
     implementation
   - FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
     implementation
   - IO verifier cleanup and rework
   - stack usage reduction changes
   - vm_map_ram NOIO context fixes to remove lockdep warings
   - various bug fixes and cleanups"

* tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
  xfs: fix directory hash ordering bug
  xfs: extra semi-colon breaks a condition
  xfs: Add support for FALLOC_FL_ZERO_RANGE
  fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  xfs: inode log reservations are still too small
  xfs: xfs_check_page_type buffer checks need help
  xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
  xfs: use NOIO contexts for vm_map_ram
  xfs: don't leak EFSBADCRC to userspace
  xfs: fix directory inode iolock lockdep false positive
  xfs: allocate xfs_da_args to reduce stack footprint
  xfs: always do log forces via the workqueue
  xfs: modify verifiers to differentiate CRC from other errors
  xfs: print useful caller information in xfs_error_report
  xfs: add xfs_verifier_error()
  xfs: add helper for updating checksums on xfs_bufs
  xfs: add helper for verifying checksums on xfs_bufs
  xfs: Use defines for CRC offsets in all cases
  xfs: skip pointless CRC updates after verifier failures
  xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
  ...
2014-04-04 15:50:08 -07:00
Linus Torvalds
24e7ea3bea Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
 in the jbd2 layer and in xattr handling when the extended attributes
 spill over into an external block.
 
 Other than that, the usual clean ups and minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
 Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
 ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
 kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
 bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
 a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
 rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
 ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
 +165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
 2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
 700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
 DrPKlXwIgva7zq5/S0Vr
 =4s1Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Major changes for 3.14 include support for the newly added ZERO_RANGE
  and COLLAPSE_RANGE fallocate operations, and scalability improvements
  in the jbd2 layer and in xattr handling when the extended attributes
  spill over into an external block.

  Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
  ext4: fix premature freeing of partial clusters split across leaf blocks
  ext4: remove unneeded test of ret variable
  ext4: fix comment typo
  ext4: make ext4_block_zero_page_range static
  ext4: atomically set inode->i_flags in ext4_set_inode_flags()
  ext4: optimize Hurd tests when reading/writing inodes
  ext4: kill i_version support for Hurd-castrated file systems
  ext4: each filesystem creates and uses its own mb_cache
  fs/mbcache.c: doucple the locking of local from global data
  fs/mbcache.c: change block and index hash chain to hlist_bl_node
  ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  ext4: refactor ext4_fallocate code
  ext4: Update inode i_size after the preallocation
  ext4: fix partial cluster handling for bigalloc file systems
  ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
  ext4: only call sync_filesystm() when remounting read-only
  fs: push sync_filesystem() down to the file system's remount_fs()
  jbd2: improve error messages for inconsistent journal heads
  jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
  jbd2: minimize region locked by j_list_lock in journal_get_create_access()
  ...
2014-04-04 15:39:39 -07:00
Linus Torvalds
8e343c8b5c Series of small bug fixes for pstore.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTO0ytAAoJEKurIx+X31iBRlMP/33oGKncx/iQOi8MqNOvOzXG
 o6y6jVXosn/0sbiSnqcDtne/VstVP4RbmLpwm22tSD5QALGVubdhK+ryJ9PGC7gZ
 2wXUP75UN3BXTqzCSE3vXchQnNJooT2EjX16sZ/PvR4lO1QrWKlozuWvTR9Qhbdb
 czWh/pWrlckhgl1CR01haAdlmFo3M2/6CvkpBW7hDiY8yvBl9g0MrFPHLQH6Y15p
 kAIXQSYYVctT0xhOiMSodl6K0hpH4iRi8t2c5jUfRL6M56Ggfw1h7tVgc25ZU5NO
 /aG7k3MtKosxd7yk+Nz2bArOki22w+jXFSP45hjnZwOP096m4z6jBRawjb9/nNla
 WNfrRCTnBn8caBAasz+b3obAuX25w2YW86ITq+G54541llQCmj07KXBx8Z6TC0bI
 MVFG9wFbCghd0dal4DWkQ+tOLyZFVWWwChNnZahiPONzSqSMMR+Xr6dNNwMjPMm/
 HSM+PNUxvgLRuVvn3zLKaJp+2WugVxqBFyRdMeKRPsqzBqZ9HSkPfqkAuZg/7CZa
 Yq1HBMR8ctWGbpTdxMZ1WEh2FTfuL8l/hr+Ns3Efe7+G2muXmCP2x1TvPOF2djlK
 f7udwp62ZR6D5SZvWW9nUBZTmgCJ264ypk3hS4f4Sbu42bvttyHkYPKMDuGFioYR
 PUXMWhcQoyB4PGh09NMr
 =zOOz
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull pstore fixes from Tony Luck:
 "Series of small bug fixes for pstore"

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore: Fix memory leak when decompress using big_oops_buf
  pstore: Fix buffer overflow while write offset equal to buffer size
  pstore: Correct the max_dump_cnt clearing of ramoops
  pstore: Fix NULL pointer fault if get NULL prz in ramoops_get_next_prz
  pstore: skip zero size persistent ram buffer in traverse
  pstore: clarify clearing of _read_cnt in ramoops_context
2014-04-04 15:37:43 -07:00
Linus Torvalds
d15fee814d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
 "This series adds cached writeback support to fuse, improving write
  throughput"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix "uninitialized variable" warning
  fuse: Turn writeback cache on
  fuse: Fix O_DIRECT operations vs cached writeback misorder
  fuse: fuse_flush() should wait on writeback
  fuse: Implement write_begin/write_end callbacks
  fuse: restructure fuse_readpage()
  fuse: Flush files on wb close
  fuse: Trust kernel i_mtime only
  fuse: Trust kernel i_size only
  fuse: Connection bit for enabling writeback
  fuse: Prepare to handle short reads
  fuse: Linking file to inode helper
2014-04-04 15:34:27 -07:00
Linus Torvalds
56c225fe39 dlm for 3.15
This set includes a couple trivial cleanups and changes recovery
 log messages from DEBUG to INFO.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJTPCtSAAoJEDgbc8f8gGmqn3IQALyqRO6spT5fGgPHktLo5lZx
 Qp6uFn1dHwPncQbdCOQPsFNBnCKE76+lfY09NjABCYnJdEyZFye1MwwYDyqQjqEt
 6kUi8LLrkF39KUkOPivV19fUfwNVMoNwAIUKzTSo8wuWE/1+0NL+FIxCUd3rGqxv
 A5yg1nXGTIBPNvZArq8bi49tfriCu167/QXLPlVWIRkyJ69H5ErC+sewufSP/Z5m
 S4yDFGOLy6pu+snT0GHGZcTRAfz37YpHdpAjlxUiurEZFXGvCagF55DaZIiRhf0K
 RDXrm1rgyNQ7H1gfKa92TOijGx62nOYR9F3zlc8c4tQB+eDaWTQWGR0+lgI/PklZ
 VPMHG+lqauaEv9YwKS5UL+E3BuyiAVuwF41mue8lOSA0t+D5P4O2gB+BFuYSpUy1
 Kwt2+S38sqiPwEqRcS/n+x/qx+uaGHxtsQh6cEJw4YM5cslJukkRSx8SUqgoETNB
 uUGjD+pG1lReyUUzzjG74UKRvb7dV+psWRDb56hx2nJrZ2H24s4h7FEHaHJWKoX+
 5H4K3b+E3lqwHp4ARu/yuUYgg/PCsGTsL986135nVnvjV/cHaG4u5N8KP+a9hSyW
 PkubO7kUO34KKH45yhSh79ELsWyw4gFgy0mls2BoDpE31phGkoYbKB+hbIkbeKVQ
 EZLmwKt2aijdo3z0g4bO
 =Ljna
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set includes a couple trivial cleanups and changes recovery log
  messages from DEBUG to INFO"

* tag 'dlm-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: use INFO for recovery messages
  fs: Include appropriate header file in dlm/ast.c
  dlm: silence a harmless use after free warning
2014-04-04 15:33:30 -07:00
Linus Torvalds
53c566625f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs changes from Chris Mason:
 "This is a pretty long stream of bug fixes and performance fixes.

  Qu Wenruo has replaced the btrfs async threads with regular kernel
  workqueues.  We'll keep an eye out for performance differences, but
  it's nice to be using more generic code for this.

  We still have some corruption fixes and other patches coming in for
  the merge window, but this batch is tested and ready to go"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
  Btrfs: fix a crash of clone with inline extents's split
  btrfs: fix uninit variable warning
  Btrfs: take into account total references when doing backref lookup
  Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
  Btrfs: fix incremental send's decision to delay a dir move/rename
  Btrfs: remove unnecessary inode generation lookup in send
  Btrfs: fix race when updating existing ref head
  btrfs: Add trace for btrfs_workqueue alloc/destroy
  Btrfs: less fs tree lock contention when using autodefrag
  Btrfs: return EPERM when deleting a default subvolume
  Btrfs: add missing kfree in btrfs_destroy_workqueue
  Btrfs: cache extent states in defrag code path
  Btrfs: fix deadlock with nested trans handles
  Btrfs: fix possible empty list access when flushing the delalloc inodes
  Btrfs: split the global ordered extents mutex
  Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
  Btrfs: reclaim delalloc metadata more aggressively
  Btrfs: remove unnecessary lock in may_commit_transaction()
  Btrfs: remove the unnecessary flush when preparing the pages
  Btrfs: just do dirty page flush for the inode with compression before direct IO
  ...
2014-04-04 15:31:36 -07:00
Linus Torvalds
34917f9713 One of the main highlights this time, is not the patches themselves
but instead the widening contributor base. It is good to see that
 interest is increasing in GFS2, and I'd like to thank all the
 contributors to this patch set.
 
 In addition to the usual set of bug fixes and clean ups, there are
 patches to improve inode creation performance when xattrs are
 required and some improvements to the transaction code which is
 intended to help improve scalability after further changes in due
 course.
 
 Journal extent mapping is also updated to make it more efficient
 and again, this is a foundation for future work in this area.
 
 The maximum number of ACLs has been increased to 300 (for a 4k
 block size) which means that even with a few additional xattrs
 from selinux, everything should fit within a single fs block.
 
 There is also a patch to bring GFS2's own copy of the writepages
 code up to the same level as the core VFS. Eventually we may be
 able to merge some of this code, since it is fairly similar.
 
 The other major change this time, is bringing consistency to
 the printing of messages via fs_<level>, pr_<level> macros.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJTOn+JAAoJEMrg3m4a/8jSSoYQALctSOmyGW978JMAKiwuUeSr
 367ho/I/WfZWybWH7iZ/hdEMNCUCnP3C1ZJhYKJ6J60h35p1hIK7DYp9tOy0RsTS
 JD3VamE/jboljXyZaaMCtly7HPQMV82rRmI3+bSoXpT4mPz+PB+kRCe2QkvyVAsh
 5tojtLz6L/In/eo4UlqZjn1BITcYRL5AgMi+8h8h6Foi4MgnFISZbezC6U5eO46P
 DT/xwd0fw+o5ZTm/dTQmhCCH30y4cpKZnNhi+xhHrEm95gBZWcONHD0qyNZe3fBc
 WuGUU9hURHHkqT671T7sBGzfNrsKk1OgNzFNy1YrF5C+t6hpG9iAKRIHtuVqSqPx
 OblhKP0lebDY1L41NqZR4Up+pUjCMxOs3f+FAl2rlHRBIQdroOu82CZHdTBfM/HJ
 1ZvkMrIkxMKb7RtSnTdXsPxcPPZNakHhDaNxHMmMlFlflbXGQqWZaMMhK181d7dn
 Y0WU2ayPmjjUdO5OnekMV5J/hNYNLobnV9OO75j4pyqlnHLIIycc/wgNULcU+OJ6
 GooOQJNnnAo+2JUvS+Ejn88q2if05HOg4fCXRfu4bdA2zDehei1jr5xz5IWj0OAM
 AlmTgUYzK7osvA1XtNd6naCmes+fnm3+Jfh0+YtpeZKgCvaYYCoZCHJccb+a0AIq
 7dTkyCQtgsKE+yjPKose
 =IWmZ
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw

Pull GFS2 updates from Steven Whitehouse:
 "One of the main highlights this time, is not the patches themselves
  but instead the widening contributor base.  It is good to see that
  interest is increasing in GFS2, and I'd like to thank all the
  contributors to this patch set.

  In addition to the usual set of bug fixes and clean ups, there are
  patches to improve inode creation performance when xattrs are required
  and some improvements to the transaction code which is intended to
  help improve scalability after further changes in due course.

  Journal extent mapping is also updated to make it more efficient and
  again, this is a foundation for future work in this area.

  The maximum number of ACLs has been increased to 300 (for a 4k block
  size) which means that even with a few additional xattrs from selinux,
  everything should fit within a single fs block.

  There is also a patch to bring GFS2's own copy of the writepages code
  up to the same level as the core VFS.  Eventually we may be able to
  merge some of this code, since it is fairly similar.

  The other major change this time, is bringing consistency to the
  printing of messages via fs_<level>, pr_<level> macros"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (29 commits)
  GFS2: Fix address space from page function
  GFS2: Fix uninitialized VFS inode in gfs2_create_inode
  GFS2: Fix return value in slot_get()
  GFS2: inline function gfs2_set_mode
  GFS2: Remove extraneous function gfs2_security_init
  GFS2: Increase the max number of ACLs
  GFS2: Re-add a call to log_flush_wait when flushing the journal
  GFS2: Ensure workqueue is scheduled after noexp request
  GFS2: check NULL return value in gfs2_ok_to_move
  GFS2: Convert gfs2_lm_withdraw to use fs_err
  GFS2: Use fs_<level> more often
  GFS2: Use pr_<level> more consistently
  GFS2: Move recovery variables to journal structure in memory
  GFS2: global conversion to pr_foo()
  GFS2: return -E2BIG if hit the maximum limits of ACLs
  GFS2: Clean up journal extent mapping
  GFS2: replace kmalloc - __vmalloc / memset 0
  GFS2: Remove extra "if" in gfs2_log_flush()
  fs: NULL dereference in posix_acl_to_xattr()
  GFS2: Move log buffer accounting to transaction
  ...
2014-04-04 14:49:16 -07:00
Linus Torvalds
f7789dc0d4 Merge branch 'locks-3.15' of git://git.samba.org/jlayton/linux
Pull file locking updates from Jeff Layton:
 "Highlights:

   - maintainership change for fs/locks.c.  Willy's not interested in
     maintaining it these days, and is OK with Bruce and I taking it.
   - fix for open vs setlease race that Al ID'ed
   - cleanup and consolidation of file locking code
   - eliminate unneeded BUG() call
   - merge of file-private lock implementation"

* 'locks-3.15' of git://git.samba.org/jlayton/linux:
  locks: make locks_mandatory_area check for file-private locks
  locks: fix locks_mandatory_locked to respect file-private locks
  locks: require that flock->l_pid be set to 0 for file-private locks
  locks: add new fcntl cmd values for handling file private locks
  locks: skip deadlock detection on FL_FILE_PVT locks
  locks: pass the cmd value to fcntl_getlk/getlk64
  locks: report l_pid as -1 for FL_FILE_PVT locks
  locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
  locks: rename locks_remove_flock to locks_remove_file
  locks: consolidate checks for compatible filp->f_mode values in setlk handlers
  locks: fix posix lock range overflow handling
  locks: eliminate BUG() call when there's an unexpected lock on file close
  locks: add __acquires and __releases annotations to locks_start and locks_stop
  locks: remove "inline" qualifier from fl_link manipulation functions
  locks: clean up comment typo
  locks: close potential race between setlease and open
  MAINTAINERS: update entry for fs/locks.c
2014-04-04 14:21:20 -07:00
Linus Torvalds
7df934526c Merge branch 'cross-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull renameat2 system call from Miklos Szeredi:
 "This adds a new syscall, renameat2(), which is the same as renameat()
  but with a flags argument.

  The purpose of extending rename is to add cross-rename, a symmetric
  variant of rename, which exchanges the two files.  This allows
  interesting things, which were not possible before, for example
  atomically replacing a directory tree with a symlink, etc...  This
  also allows overlayfs and friends to operate on whiteouts atomically.

  Andy Lutomirski also suggested a "noreplace" flag, which disables the
  overwriting behavior of rename.

  These two flags, RENAME_EXCHANGE and RENAME_NOREPLACE are only
  implemented for ext4 as an example and for testing"

* 'cross-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ext4: add cross rename support
  ext4: rename: split out helper functions
  ext4: rename: move EMLINK check up
  ext4: rename: create ext4_renament structure for local vars
  vfs: add cross-rename
  vfs: lock_two_nondirectories: allow directory args
  security: add flags to rename hooks
  vfs: add RENAME_NOREPLACE flag
  vfs: add renameat2 syscall
  vfs: rename: use common code for dir and non-dir
  vfs: rename: move d_move() up
  vfs: add d_is_dir()
2014-04-04 14:03:05 -07:00
J. Bruce Fields
06f9cc12ca nfsd4: don't create unnecessary mask acl
Any setattr of the ACL attribute, even if it sets just the basic 3-ACE
ACL exactly as it was returned from a file with only mode bits, creates
a mask entry, and it is only the mask, not group, entry that is changed
by subsequent modifications of the mode bits.

So, for example, it's surprising that GROUP@ is left without read or
write permissions after a chmod 0666:

  touch test
  chmod 0600 test
  nfs4_getfacl test
        A::OWNER@:rwatTcCy
        A::GROUP@:tcy
        A::EVERYONE@:tcy
  nfs4_getfacl test | nfs4_setfacl -S - test #
  chmod 0666 test
  nfs4_getfacl test
        A::OWNER@:rwatTcCy
        A::GROUP@:tcy
        D::GROUP@:rwa
        A::EVERYONE@:rwatcy

So, let's stop creating the unnecessary mask ACL.

A mask will still be created on non-trivial ACLs (ACLs with actual named
user and group ACEs), so the odd posix-acl behavior of chmod modifying
only the mask will still be left in that case; but that's consistent
with local behavior.

Reported-by: Soumya Koduri <skoduri@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-04-04 10:13:23 -04:00
J. Bruce Fields
082f31a216 nfsd: revert v2 half of "nfsd: don't return high mode bits"
This reverts the part of commit 6e14b46b91
that changes NFSv2 behavior.

Mark Lord found that it broke nfs-root for Linux clients, because it
broke NFSv2.

In fact, from RFC 1094:

	"Notice that the file type is specified both in the mode bits
	and in the file type.  This is really a bug in the protocol and
	will be fixed in future versions."

So NFSv2 clients really are expected to depend on the high bits of the
mode.

Cc: stable@kernel.org
Reported-by: Mark Lord <mlord@pobox.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-04-04 10:13:07 -04:00
Linus Torvalds
76ca7d1cca Merge branch 'akpm' (incoming from Andrew)
Merge first patch-bomb from Andrew Morton:
 - Various misc bits
 - kmemleak fixes
 - small befs, codafs, cifs, efs, freexxfs, hfsplus, minixfs, reiserfs things
 - fanotify
 - I appear to have become SuperH maintainer
 - ocfs2 updates
 - direct-io tweaks
 - a bit of the MM queue
 - printk updates
 - MAINTAINERS maintenance
 - some backlight things
 - lib/ updates
 - checkpatch updates
 - the rtc queue
 - nilfs2 updates
 - Small Documentation/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (237 commits)
  Documentation/SubmittingPatches: remove references to patch-scripts
  Documentation/SubmittingPatches: update some dead URLs
  Documentation/filesystems/ntfs.txt: remove changelog reference
  Documentation/kmemleak.txt: updates
  fs/reiserfs/super.c: add __init to init_inodecache
  fs/reiserfs: move prototype declaration to header file
  fs/hfsplus/attributes.c: add __init to hfsplus_create_attr_tree_cache()
  fs/hfsplus/extents.c: fix concurrent acess of alloc_blocks
  fs/hfsplus/extents.c: remove unused variable in hfsplus_get_block
  nilfs2: update project's web site in nilfs2.txt
  nilfs2: update MAINTAINERS file entries fix
  nilfs2: verify metadata sizes read from disk
  nilfs2: add FITRIM ioctl support for nilfs2
  nilfs2: add nilfs_sufile_trim_fs to trim clean segs
  nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl
  nilfs2: add nilfs_sufile_set_suinfo to update segment usage
  nilfs2: add struct nilfs_suinfo_update and flags
  nilfs2: update MAINTAINERS file entries
  fs/coda/inode.c: add __init to init_inodecache()
  BEFS: logging cleanup
  ...
2014-04-03 16:22:16 -07:00
Fabian Frederick
31e143686a fs/reiserfs/super.c: add __init to init_inodecache
init_inodecache is only called by __init init_reiserfs_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:26 -07:00
Rashika Kheria
ea0856cd50 fs/reiserfs: move prototype declaration to header file
Move prototype declaration to header file reiserfs/reiserfs.h from
reiserfs/super.c because they are used by more than one file.

This eliminates the following warning in reiserfs/bitmap.c:

  fs/reiserfs/bitmap.c:647:6: warning: no previous prototype for `show_alloc_options' [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:26 -07:00
Fabian Frederick
c11e614d71 fs/hfsplus/attributes.c: add __init to hfsplus_create_attr_tree_cache()
hfsplus_create_attr_tree_cache is only called by __init init_hfsplus_fs

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:26 -07:00
Sougata Santra
d7bdb996ae fs/hfsplus/extents.c: fix concurrent acess of alloc_blocks
Concurrent access to alloc_blocks in hfsplus_inode_info() is protected
by extents_lock mutex.  This patch fixes two instances where
alloc_blocks modification was not protected with this lock.

This fixes possible allocation bitmap corruption in race conditions
while extending and truncating files.

[akpm@linux-foundation.org: take extents_lock before taking a copy of ->alloc_blocks]
[akpm@linux-foundation.org: remove now-unused label `out']
Signed-off-by: Sougata Santra <sougata@tuxera.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:26 -07:00
Sougata Santra
abfeb724b4 fs/hfsplus/extents.c: remove unused variable in hfsplus_get_block
The variable is defined but not used.  Generally it compiles away with
-O2 optimization hence it does not show a warning.

Signed-off-by: Sougata Santra <sougata@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:26 -07:00
Ryusuke Konishi
0ec060d188 nilfs2: verify metadata sizes read from disk
Add code to check sizes of on-disk data of metadata files such as inode
size, segment usage size, DAT entry size, and checkpoint size.  Although
these sizes are read from disk, the current implementation doesn't check
them.

If these sizes are not sane on disk, it can cause out-of-range access to
metadata or memory access overrun on metadata block buffers due to
overflow in sundry calculations.

Both lower limit and upper limit of metadata sizes are verified to
prevent these issues.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:26 -07:00
Andreas Rohner
f9f32c44e7 nilfs2: add FITRIM ioctl support for nilfs2
Add support for the FITRIM ioctl, which enables user space tools to
issue TRIM/DISCARD requests to the underlying device.  Every clean
segment within the specified range will be discarded.

Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:26 -07:00
Andreas Rohner
82e11e857b nilfs2: add nilfs_sufile_trim_fs to trim clean segs
Add nilfs_sufile_trim_fs(), which takes an fstrim_range structure and
calls blkdev_issue_discard for every clean segment in the specified
range.  The range is truncated to file system block boundaries.

Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:25 -07:00
Andreas Rohner
2cc88f3a5f nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl
With this ioctl the segment usage entries in the SUFILE can be updated
from userspace.

This is useful, because it allows the userspace GC to modify and update
segment usage entries for specific segments, which enables it to avoid
unnecessary write operations.

If a segment needs to be cleaned, but there is no or very little
reclaimable space in it, the cleaning operation basically degrades to a
useless moving operation.  In the end the only thing that changes is the
location of the data and a timestamp in the segment usage information.
With this ioctl the GC can skip the cleaning and update the segment
usage entries directly instead.

This is basically a shortcut to cleaning the segment.  It is still
necessary to read the segment summary information, but the writing of
the live blocks can be skipped if it's not worth it.

[konishi.ryusuke@lab.ntt.co.jp: add description of NILFS_IOCTL_SET_SUINFO ioctl]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:25 -07:00
Andreas Rohner
00e9ffcd27 nilfs2: add nilfs_sufile_set_suinfo to update segment usage
Introduce nilfs_sufile_set_suinfo(), which expects an array of
nilfs_suinfo_update structures and updates the segment usage information
accordingly.

This is basically a helper function for the newly introduced
NILFS_IOCTL_SET_SUINFO ioctl.

[konishi.ryusuke@lab.ntt.co.jp: use put_bh() instead of brelse() because we know bh != NULL]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:25 -07:00
Fabian Frederick
5f356fd4d7 fs/coda/inode.c: add __init to init_inodecache()
init_inodecache is only called by __init init_coda

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:25 -07:00
Fabian Frederick
dac52fc182 BEFS: logging cleanup
Summary:
 - all printk(KERN_foo converted to pr_foo()
 - add pr_fmt and remove redundant prefixes
 - convert befs_() to va_format (based on patch by Joe Perches)
 - remove non standard %Lu
 - use __func__ for all debugging

[akpm@linux-foundation.org: fix printk warnings, reported by Fengguang]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:25 -07:00
Fabian Frederick
91a52ab7d6 fs/befs/linuxvfs.c: add __init to befs_init_inodecache()
init_inodecache is only called by __init init_befs_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:24 -07:00
Fabian Frederick
2d4319ef57 befs: replace kmalloc/memset 0 by kzalloc
Use kzalloc for clean fs_info allocation like other filesystems.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:24 -07:00
Fabian Frederick
84ee353df0 fs/minix/inode.c: add __init to init_inodecache()
init_inodecache is only called by __init init_minix_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:24 -07:00
Luis Henriques
b003f96502 binfmt_misc: add missing 'break' statement
A missing 'break' statement in bm_status_write() results in a user program
receiving '3' when doing the following:

  write(fd, "-1", 2);

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:16 -07:00
Fabian Frederick
7a42d4b6a1 fs/efs/super.c: add __init to init_inodecache()
init_inodecache is only called by __init init_efs_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:16 -07:00
Josh Triplett
69369a7003 fs, kernel: permit disabling the uselib syscall
uselib hasn't been used since libc5; glibc does not use it.  Support
turning it off.

When disabled, also omit the load_elf_library implementation from
binfmt_elf.c, which only uselib invokes.

bloat-o-meter:
add/remove: 0/4 grow/shrink: 0/1 up/down: 0/-785 (-785)
function                                     old     new   delta
padzero                                       39      36      -3
uselib_flags                                  20       -     -20
sys_uselib                                   168       -    -168
SyS_uselib                                   168       -    -168
load_elf_library                             426       -    -426

The new CONFIG_USELIB defaults to `y'.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:05 -07:00
Wang YanQing
8f6c5ffc89 kernel/groups.c: remove return value of set_groups
After commit 6307f8fee2 ("security: remove dead hook task_setgroups"),
set_groups will always return zero, so we could just remove return value
of set_groups.

This patch reduces code size, and simplfies code to use set_groups,
because we don't need to check its return value any more.

[akpm@linux-foundation.org: remove obsolete claims from set_groups() comment]
Signed-off-by: Wang YanQing <udknight@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:05 -07:00
Fabian Frederick
6af9f7bf3c sys_sysfs: Add CONFIG_SYSFS_SYSCALL
sys_sysfs is an obsolete system call no longer supported by libc.

 - This patch adds a default CONFIG_SYSFS_SYSCALL=y

 - Option can be turned off in expert mode.

 - cond_syscall added to kernel/sys_ni.c

[akpm@linux-foundation.org: tweak Kconfig help text]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:05 -07:00
Dave Hansen
5509a5d27b drop_caches: add some documentation and info message
There is plenty of anecdotal evidence and a load of blog posts
suggesting that using "drop_caches" periodically keeps your system
running in "tip top shape".  Perhaps adding some kernel documentation
will increase the amount of accurate data on its use.

If we are not shrinking caches effectively, then we have real bugs.
Using drop_caches will simply mask the bugs and make them harder to
find, but certainly does not fix them, nor is it an appropriate
"workaround" to limit the size of the caches.  On the contrary, there
have been bug reports on issues that turned out to be misguided use of
cache dropping.

Dropping caches is a very drastic and disruptive operation that is good
for debugging and running tests, but if it creates bug reports from
production use, kernel developers should be aware of its use.

Add a bit more documentation about it, a syslog message to track down
abusers, and vmstat drop counters to help analyze problem reports.

[akpm@linux-foundation.org: checkpatch fixes]
[hannes@cmpxchg.org: add runtime suppression control]
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:04 -07:00
Sasha Levin
67f9fd91f9 mm: remove read_cache_page_async()
This patch removes read_cache_page_async() which wasn't really needed
anywhere and simplifies the code around it a bit.

read_cache_page_async() is useful when we want to read a page into the
cache without waiting for it to complete.  This happens when the
appropriate callback 'filler' doesn't complete its read operation and
releases the page lock immediately, and instead queues a different
completion routine to do that.  This never actually happened anywhere in
the code.

read_cache_page_async() had 3 different callers:

- read_cache_page() which is the sync version, it would just wait for
  the requested read to complete using wait_on_page_read().

- JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
  function it supplied doesn't do any async reads, and would complete
  before the filler function returns - making it actually a sync read.

- CRAMFS would call it using the read_mapping_page_async() wrapper, with
  a similar story to JFFS2 - the filler function doesn't do anything that
  reminds async reads and would always complete before the filler function
  returns.

To sum it up, the code in mm/filemap.c never took advantage of having
read_cache_page_async().  While there are filler callbacks that do async
reads (such as the block one), we always called it with the
read_cache_page().

This patch adds a mandatory wait for read to complete when adding a new
page to the cache, and removes read_cache_page_async() and its wrappers.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:04 -07:00
Johannes Weiner
91b0abe36a mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page.  As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently.  At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate.  Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:01 -07:00
Johannes Weiner
0cd6144aad mm + fs: prepare for non-page entries in page cache radix trees
shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache, prepare
every site dealing with the radix trees directly to handle entries other
than pages.

The common lookup functions will filter out non-page entries and return
NULL for page cache holes, just as before.  But provide a raw version of
the API which returns non-page entries as well, and switch shmem over to
use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:00 -07:00
Johannes Weiner
e7b563bb2a mm: filemap: move radix tree hole searching here
The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow page
descriptors will be stored in the page cache after the actual pages get
evicted from memory.

Move the functions over to mm/filemap.c and make them native page cache
operations, where they can later be adapted to handle the new definition
of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:00 -07:00
Johannes Weiner
55881bc76f fs: cachefiles: use add_to_page_cache_lru()
This code used to have its own lru cache pagevec up until a0b8cab3 ("mm:
remove lru parameter from __pagevec_lru_add and remove parts of pagevec
API").  Now it's just add_to_page_cache() followed by lru_cache_add(),
might as well use add_to_page_cache_lru() directly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:00 -07:00
Joonsoo Kim
9119a41e90 mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping.  For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.

Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:59 -07:00
Dan Carpenter
45d4f85504 fs/direct-io.c: remove some left over checks
We know that "ret > 0" is true here.  These tests were left over from
commit 02afc27fae ('direct-io: Handle O_(D)SYNC AIO') and aren't
needed any more.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:57 -07:00
Gu Zheng
2b665e276c fs/direct-io.c: remove redundant comparison
The return value of bio_get_nr_vecs() cannot be bigger than
BIO_MAX_PAGES, so we can remove redundant the comparison between
nr_pages and BIO_MAX_PAGES.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:57 -07:00
Wengang Wang
9c339255cb ocfs2: pass "new" parameter to ocfs2_init_xattr_bucket
This patch fixes the following crash:

  kernel BUG at fs/ocfs2/uptodate.c:530!
  Modules linked in: ocfs2(F) ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd sunrpc 8021q garp stp llc bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iTCO_wdt iTCO_vendor_support dcdbas coretemp freq_table mperf microcode pcspkr serio_raw bnx2 lpc_ich mfd_core i5k_amb i5000_edac edac_core e1000e sg shpchp ext4(F) jbd2(F) mbcache(F) dm_round_robin(F) sr_mod(F) cdrom(F) usb_storage(F) sd_mod(F) crc_t10dif(F) pata_acpi(F) ata_generic(F) ata_piix(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) radeon(F)
   ttm(F) drm_kms_helper(F) drm(F) hwmon(F) i2c_algo_bit(F) i2c_core(F) dm_multipath(F) dm_mirror(F) dm_region_hash(F) dm_log(F) dm_mod(F)
  CPU 5
  Pid: 21303, comm: xattr-test Tainted: GF       W    3.8.13-30.el6uek.x86_64 #2 Dell Inc. PowerEdge 1950/0M788G
  RIP: ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2]
  Process xattr-test (pid: 21303, threadinfo ffff880017aca000, task ffff880016a2c480)
  Call Trace:
    ocfs2_init_xattr_bucket+0x8a/0x120 [ocfs2]
    ocfs2_cp_xattr_bucket+0xbb/0x1b0 [ocfs2]
    ocfs2_extend_xattr_bucket+0x20a/0x2f0 [ocfs2]
    ocfs2_add_new_xattr_bucket+0x23e/0x4b0 [ocfs2]
    ocfs2_xattr_set_entry_index_block+0x13c/0x3d0 [ocfs2]
    ocfs2_xattr_block_set+0xf9/0x220 [ocfs2]
    __ocfs2_xattr_set_handle+0x118/0x710 [ocfs2]
    ocfs2_xattr_set+0x691/0x880 [ocfs2]
    ocfs2_xattr_user_set+0x46/0x50 [ocfs2]
    generic_setxattr+0x96/0xa0
    __vfs_setxattr_noperm+0x7b/0x170
    vfs_setxattr+0xbc/0xc0
    setxattr+0xde/0x230
    sys_fsetxattr+0xc6/0xf0
    system_call_fastpath+0x16/0x1b
  Code: 41 80 0c 24 01 48 89 df e8 7d f0 ff ff 4c 89 e6 48 89 df e8 a2 fe ff ff 48 89 df e8 3a f0 ff ff 48 8b 1c 24 4c 8b 64 24 08 c9 c3 <0f> 0b eb fe 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 66 66
  RIP  ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2]

It hit the BUG_ON() in ocfs2_set_new_buffer_uptodate():

    void ocfs2_set_new_buffer_uptodate(struct ocfs2_caching_info *ci,
                                       struct buffer_head *bh)
    {
          /* This should definitely *not* exist in our cache */
          if (ocfs2_buffer_cached(ci, bh))
                  printk(KERN_ERR "bh->b_blocknr: %lu @ %p\n", bh->b_blocknr, bh);
          BUG_ON(ocfs2_buffer_cached(ci, bh));

          set_buffer_uptodate(bh);

          ocfs2_metadata_cache_io_lock(ci);
          ocfs2_set_buffer_uptodate(ci, bh);
          ocfs2_metadata_cache_io_unlock(ci);
    }

The problem here is:

We cached a block, but the buffer_head got reused.  When we are to pick
up this block again, a new buffer_head created with UPTODATE flag
cleared.  ocfs2_buffer_uptodate() returned false since no UPTODATE is
set on the buffer_head.  so we set this block to cache as a NEW block,
then it failed at asserting block is not in cache.

The fix is to add a new parameter indicating the bucket is a new
allocated or not to ocfs2_init_xattr_bucket().
ocfs2_init_xattr_bucket() assert block not cached accordingly.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:57 -07:00
jiangyiwen
43b10a2037 ocfs2: avoid system inode ref confusion by adding mutex lock
The following case may lead to the same system inode ref in confusion.

A thread                            B thread
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
                                    because of *arr == NULL,
                                    ocfs2_get_system_file_inode
                                    ->get_local_system_inode
                                    ->_ocfs2_get_system_file_inode
gets first ref thru
_ocfs2_get_system_file_inode,
gets second ref thru igrab and
set *arr = inode
                                    at the moment, B thread also gets
                                    two refs, so lead to one more
                                    inode ref.

So add mutex lock to avoid multi thread set two inode ref once at the
same time.

Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:57 -07:00
jiangyiwen
7dc3e83901 ocfs2: iput inode alloc when failed locally
In ocfs2_info_handle_freeinode() and ocfs2_test_inode_bit() func, after
calls ocfs2_get_system_file_inode() to get inode ref, if calls
ocfs2_info_scan_inode_alloc() or ocfs2_inode_lock() failed, we should
iput inode alloc to avoid leaking the inode.

Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:57 -07:00
Tariq Saeed
da8ded405d ocfs2/o2net: o2net_listen_data_ready should do nothing if socket state is not TCP_LISTEN
Orabug: 17330860

When accepting an incomming connection o2net_accept_one clones a child
data socket from the parent listening socket.  It then proceeds to setup
the child with callback o2net_data_ready() and sk_user_data to NULL.  If
data arrives in this window, o2net_listen_data_ready will be called with
some non-deterministic value in sk_user_data (not inherited).  We panic
when we page fault on sk_user_data -- in parent it is
sock_def_readable().

The fix is to recognize that this is a data socket being set up by
looking at the socket state and do nothing.

Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:56 -07:00
Younger Liu
db66c71577 ocfs2: rollback alloc_dinode counts when ocfs2_block_group_set_bits() failed
After updating alloc_dinode counts in ocfs2_alloc_dinode_update_counts(),
if ocfs2_alloc_dinode_update_bitmap() failed, there is a rare case that
some space may be lost.

So, roll back alloc_dinode counts when ocfs2_block_group_set_bits()
failed.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:56 -07:00
Wengang Wang
e228f64398 ocfs2: flock: drop cross-node lock when failed locally
ocfs2_do_flock() calls ocfs2_file_lock() to get the cross-node clock and
then call flock_lock_file_wait() to compete with local processes.  In
case flock_lock_file_wait() failed, say -ENOMEM, clean up work is not
done.  This patch adds the cleanup --drop the cross-node lock which was
just granted.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:56 -07:00
Darrick J. Wong
6fdb702d62 ocfs2: call ocfs2_update_inode_fsync_trans when updating any inode
Ensure that ocfs2_update_inode_fsync_trans() is called any time we touch
an inode in a given transaction.  This is a follow-on to the previous
patch to reduce lock contention and deadlocking during an fsync
operation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Wengang <wen.gang.wang@oracle.com>
Cc: Greg Marsden <greg.marsden@oracle.com>
Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:56 -07:00
Tetsuo Handa
f81c20158f ocfs2: fix panic on kfree(xattr->name)
Commit 9548906b2b ('xattr: Constify ->name member of "struct xattr"')
missed that ocfs2 is calling kfree(xattr->name).  As a result, kernel
panic occurs upon calling kfree(xattr->name) because xattr->name refers
static constant names.  This patch removes kfree(xattr->name) from
ocfs2_mknod() and ocfs2_symlink().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Tested-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:56 -07:00
alex chen
f7cf4f5bfe ocfs2: do not put bh when buffer_uptodate failed
Do not put bh when buffer_uptodate failed in ocfs2_write_block and
ocfs2_write_super_or_backup, because it will put bh in b_end_io.
Otherwise it will hit a warning "VFS: brelse: Trying to free free
buffer".

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:56 -07:00
Xue jiufei
466e68c430 ocfs2: __ocfs2_mknod_locked should return error when ocfs2_create_new_inode_locks() failed
When ocfs2_create_new_inode_locks() return error, inode open lock may
not be obtainted for this inode.  So other nodes can remove this file
and free dinode when inode still remain in memory on this node, which is
not correct and may trigger BUG.  So __ocfs2_mknod_locked should return
error when ocfs2_create_new_inode_locks() failed.

              Node_1                              Node_2
create fileA, call ocfs2_mknod()
  -> ocfs2_get_init_inode(), allocate inodeA
  -> ocfs2_claim_new_inode(), claim dinode(dinodeA)
  -> call ocfs2_create_new_inode_locks(),
     create open lock failed, return error
  -> __ocfs2_mknod_locked return success

                                                unlink fileA
                                                try open lock succeed,
                                                and free dinodeA

create another file, call ocfs2_mknod()
  -> ocfs2_get_init_inode(), allocate inodeB
  -> ocfs2_claim_new_inode(), as Node_2 had freed dinodeA,
     so claim dinodeA and update generation for dinodeA

call __ocfs2_drop_dl_inodes()->ocfs2_delete_inode()
to free inodeA, and finally triggers BUG
on(inode->i_generation != le32_to_cpu(fe->i_generation))
in function ocfs2_inode_lock_update().

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Tariq Saeed
3ed2be719e ocfs2: allow for more than one data extent when creating xattr
Orabug: 18108070

ocfs2_xattr_extend_allocation() hits panic when creating xattr during
data extent alloc phase.  The problem occurs if due to local alloc
fragmentation, clusters are spread over multiple extents.  In this case
ocfs2_add_clusters_in_btree() finds no space to store more than one
extent record and therefore fails returning RESTART_META.  The situation
is anticipated for xattr update case but not xattr create case.  This
fix simply ports that code to create case.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Zhonghua Guo
a35ad97cd4 ocfs2: fix deadlock risk when kmalloc failed in dlm_query_region_handler
In dlm_query_region_handler(), once kmalloc failed, it will unlock
dlm_domain_lock without lock first, then deadlock happens.

Signed-off-by: Zhonghua Guo <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Tested-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Jensen
c8d888d9f1 ocfs2: llseek requires ocfs2 inode lock for the file in SEEK_END
llseek requires ocfs2 inode lock for updating the file size in SEEK_END.
because the file size maybe update on another node.

This bug can be reproduce the following scenario: at first, we dd a test
fileA, the file size is 10k.

on NodeA:
---------
 1) open the test fileA, lseek the end of file. and print the position.
 2) close the test fileA

on NodeB:
 1) open the test fileA, append the 5k data to test FileA.
 2) lseek the end of file. and print the position.
 3) close file.

At first we run the test program1 on NodeA , the result is 10k.  And
then run the test program2 on NodeB, the result is 15k.  At last, we run
the test program1 on NodeA again, the result is 10k.

After applying this patch the three step result is 15k.

test result: 1000000 times lseek call;
index        lseek with inode lock (unit:us)                lseek without inode lock (unit:us)
  1                   1168162                                    555383
  2                   1168011                                    549504
  3                   1170538                                    549396
  4                   1170375                                    551685
  5                   1170444                                    556719
  6                   1174364                                    555307
  7                   1163294                                    551552
  8                   1170080                                    549350
  9                   1162464                                    553700
 10                   1165441                                    552594
 avg                  1168317                                    552519

avg with lock - avg without lock = 615798
(avg with lock - avg without lock)/1000000=0.615798 us

Signed-off-by: Jensen <shencanquan@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Joseph Qi
41b63efb68 ocfs2: fix type conversion risk when get cluster attributes
In o2nm_cluster, cl_idle_timeout_ms, cl_keepalive_delay_ms, as well as
cl_reconnect_delay_ms, are defined as type of unsigned int.  So we
should also use unsigned int in the helper functions.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Goldwyn Rodrigues
8ed6b23709 ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock
The following patches are reverted in this patch because these patches
caused performance regression in the remote unlink() calls.

  ea455f8ab6 - ocfs2: Push out dropping of dentry lock to ocfs2_wq
  f7b1aa69be - ocfs2: Fix deadlock on umount
  5fd1318937 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount

Previous patches in this series removed the possible deadlocks from
downconvert thread so the above patches shouldn't be needed anymore.

The regression is caused because these patches delay the iput() in case
of dentry unlocks.  This also delays the unlocking of the open lockres.
The open lockresource is required to test if the inode can be wiped from
disk or not.  When the deleting node does not get the open lock, it
marks it as orphan (even though it is not in use by another
node/process) and causes a journal checkpoint.  This delays operations
following the inode eviction.  This also moves the inode to the orphaned
inode which further causes more I/O and a lot of unneccessary orphans.

The following script can be used to generate the load causing issues:

  declare -a create
  declare -a remove
  declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
  unique="`mktemp -u XXXXX`"
  script="/tmp/idontknow-${unique}.sh"
  cat <<EOF > "${script}"
  for n in {1..8}; do mkdir -p test/dir\${n}
    eval touch test/dir\${n}/foo{1.."\$1"}
  done
  EOF
  chmod 700 "${script}"

  function fcreate ()
  {
    exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
  }

  function fremove ()
  {
    exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
  }

  function fcp ()
  {
    exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
  }

  echo -------------------------------------------------
  echo "| # files | create #s | copy #s | remove #s |"
  echo -------------------------------------------------
  for ((x=0; x < ${#iterations[*]} ; x++)) do
    create[$x]="`fcreate ${iterations[$x]}`"
    copy[$x]="`fcp ${iterations[$x]}`"
    remove[$x]="`fremove`"
    printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
  done
  rm "${script}"
  echo "------------------------"

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Jan Kara
84d86f83f9 ocfs2: avoid blocking in ocfs2_mark_lockres_freeing() in downconvert thread
If we are dropping last inode reference from downconvert thread, we will
end up calling ocfs2_mark_lockres_freeing() which can block if the lock
we are freeing is queued thus creating an A-A deadlock.  Luckily, since
we are the downconvert thread, we can immediately dequeue the lock and
thus avoid waiting in this case.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Jan Kara
e3a767b60f ocfs2: implement delayed dropping of last dquot reference
We cannot drop last dquot reference from downconvert thread as that
creates the following deadlock:

NODE 1                                  NODE2
holds dentry lock for 'foo'
holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
                                        dquot_initialize(bar)
                                          ocfs2_dquot_acquire()
                                            ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                                            ...
downconvert thread (triggered from another
node or a different process from NODE2)
  ocfs2_dentry_post_unlock()
    ...
    iput(foo)
      ocfs2_evict_inode(foo)
        ocfs2_clear_inode(foo)
          dquot_drop(inode)
            ...
	    ocfs2_dquot_release()
              ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
               - blocks
                                            finds we need more space in
                                            quota file
                                            ...
                                            ocfs2_extend_no_holes()
                                              ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
                                                - deadlocks waiting for
                                                  downconvert thread

We solve the problem by postponing dropping of the last dquot reference to
a workqueue if it happens from the downconvert thread.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Jan Kara
9f985cb6c4 quota: provide function to grab quota structure reference
Provide dqgrab() function to get quota structure reference when we are
sure it already has at least one active reference.  Make use of this
function inside quota code.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Jan Kara
bd62ad7aeb ocfs2: move dquot_initialize() in ocfs2_delete_inode() somewhat later
Move dquot_initalize() call in ocfs2_delete_inode() after the moment we
verify inode is actually a sane one to delete.  We certainly don't want
to initialize quota for system inodes etc.  This also avoids calling
into quota code from downconvert thread.

Add more details into the comment why bailing out from
ocfs2_delete_inode() when we are in downconvert thread is OK.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Jan Kara
7bf619c142 ocfs2: remove OCFS2_INODE_SKIP_DELETE flag
The flag was never set, delete it.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Goldwyn Rodrigues
765aabbbc7 ocfs2: add dlm_recover_callback_support in sysfs
This is a part of the nocontrold feature which was incorporated sometime
back.

This is required for backward compatibility of the tools, specifically
the scenario where the tools with recovery callback is used with a
kernel not using the recovery callbacks (older kernel + newer tools).
The tools look for this file to understand if the kernel supports DLM
recovery callbacks.

For kernels which support recovery callbacks but will miss this patch,
ocfs2 will continue to use the older API and would still be able to
mount the filesystem.

[akpm@linux-foundation.org: simplify]
[sfr@canb.auug.org.au: VERIFY_OCTAL_PERMISSIONS fix up]
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Junxiao Bi
ded2cf7141 ocfs2: dlm: fix recovery hung
There is a race window in dlm_do_recovery() between dlm_remaster_locks()
and dlm_reset_recovery() when the recovery master nearly finish the
recovery process for a dead node.  After the master sends FINALIZE_RECO
message in dlm_remaster_locks(), another node may become the recovery
master for another dead node, and then send the BEGIN_RECO message to
all the nodes included the old master, in the handler of this message
dlm_begin_reco_handler() of old master, dlm->reco.dead_node and
dlm->reco.new_master will be set to the second dead node and the new
master, then in dlm_reset_recovery(), these two variables will be reset
to default value.  This will cause new recovery master can not finish
the recovery process and hung, at last the whole cluster will hung for
recovery.

old recovery master:                                 new recovery master:
dlm_remaster_locks()
                                                  become recovery master for
                                                  another dead node.
                                                  dlm_send_begin_reco_message()
dlm_begin_reco_handler()
{
 if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
  return -EAGAIN;
 }
 dlm_set_reco_master(dlm, br->node_idx);
 dlm_set_reco_dead_node(dlm, br->dead_node);
}
dlm_reset_recovery()
{
 dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
 dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
}
                                                  will hang in dlm_remaster_locks() for
                                                  request dlm locks info

Before send FINALIZE_RECO message, recovery master should set
DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done,
this can break the race windows as the BEGIN_RECO messages will not be
handled before DLM_RECO_STATE_FINALIZE flag is cleared.

A similar race may happen between new recovery master and normal node
which is in dlm_finalize_reco_handler(), also fix it.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Junxiao Bi
34aa8dac48 ocfs2: dlm: fix lock migration crash
This issue was introduced by commit 800deef3f6 ("ocfs2: use
list_for_each_entry where benefical") in 2007 where it replaced
list_for_each with list_for_each_entry.  The variable "lock" will point
to invalid data if "tmpq" list is empty and a panic will be triggered
due to this.  Sunil advised reverting it back, but the old version was
also not right.  At the end of the outer for loop, that
list_for_each_entry will also set "lock" to an invalid data, then in the
next loop, if the "tmpq" list is empty, "lock" will be an stale invalid
data and cause the panic.  So reverting the list_for_each back and reset
"lock" to NULL to fix this issue.

Another concern is that this seemes can not happen because the "tmpq"
list should not be empty.  Let me describe how.

old lock resource owner(node 1):                                  migratation target(node 2):
image there's lockres with a EX lock from node 2 in
granted list, a NR lock from node x with convert_type
EX in converting list.
dlm_empty_lockres() {
 dlm_pick_migration_target() {
   pick node 2 as target as its lock is the first one
   in granted list.
 }
 dlm_migrate_lockres() {
   dlm_mark_lockres_migrating() {
     res->state |= DLM_LOCK_RES_BLOCK_DIRTY;
     wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
	 //after the above code, we can not dirty lockres any more,
     // so dlm_thread shuffle list will not run
                                                                   downconvert lock from EX to NR
                                                                   upconvert lock from NR to EX
<<< migration may schedule out here, then
<<< node 2 send down convert request to convert type from EX to
<<< NR, then send up convert request to convert type from NR to
<<< EX, at this time, lockres granted list is empty, and two locks
<<< in the converting list, node x up convert lock followed by
<<< node 2 up convert lock.

	 // will set lockres RES_MIGRATING flag, the following
	 // lock/unlock can not run
     dlm_lockres_release_ast(dlm, res);
   }

   dlm_send_one_lockres()
                                                                 dlm_process_recovery_data()
                                                                   for (i=0; i<mres->num_locks; i++)
                                                                     if (ml->node == dlm->node_num)
                                                                       for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
                                                                        list_for_each_entry(lock, tmpq, list)
                                                                        if (lock) break; <<< lock is invalid as grant list is empty.
                                                                       }
                                                                       if (lock->ml.node != ml->node)
                                                                         BUG() >>> crash here
 }

I see the above locks status from a vmcore of our internal bug.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Darrick J. Wong
2931cdcb49 ocfs2: improve fsync efficiency and fix deadlock between aio_write and sync_file
Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
transaction to complete.  This isn't terribly efficient, since sync_file
really only needs to wait for the last transaction involving that inode
to complete, and this doesn't require i_mutex.

Therefore, implement the necessary bits to track the newest tid
associated with an inode, and teach sync_file to wait for that instead
of waiting for everything in the journal to commit.  Furthermore, only
issue the flush request to the drive if jbd2 hasn't already done so.

This also eliminates the deadlock between ocfs2_file_aio_write() and
ocfs2_sync_file().  aio_write takes i_mutex then calls
ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
However, if that dio completion involves calling fsync, then we can get
into trouble when some ocfs2_sync_file tries to take i_mutex.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:53 -07:00
joyce.xue
a75fe48cad ocfs2: remove unused variable uuid_net_key in ocfs2_initialize_super
Variable uuid_net_key in ocfs2_initialize_super() is not used.  Clean it
up.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:53 -07:00
Wengang Wang
c18ceab012 ocfs2: change ip_unaligned_aio to of type mutex from atomit_t
There is a problem that waitqueue_active() may check stale data thus miss
a wakeup of threads waiting on ip_unaligned_aio.

The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
be of type mutex thus the above prolem is avoid.  Another benifit is that
mutex which works as FIFO is fairer than wake_up_all().

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:53 -07:00
Zongxun Wang
181a9a043b ocfs2: fix null pointer dereference when access dlm_state before launching dlm thread
When mounting an ocfs2 volume, it will firstly generate a file
/sys/kernel/debug/o2dlm/<uuid>/dlm_state, and then launch the dlm thread.
So the following situation will cause a null pointer dereference.
dlm_debug_init -> access file dlm_state which will call dlm_state_print ->
dlm_launch_thread

Move dlm_debug_init after dlm_launch_thread and dlm_launch_recovery_thread
can fix this issue.

Signed-off-by: Zongxun Wang <wangzongxun@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:53 -07:00
Jan Kara
d507816b58 fanotify: move unrelated handling from copy_event_to_user()
Move code moving event structure to access_list from copy_event_to_user()
to fanotify_read() where it is more logical (so that we can immediately
see in the main loop that we either move the event to a different list
or free it).  Also move special error handling for permission events
from copy_event_to_user() to the main loop to have it in one place with
error handling for normal events.  This makes copy_event_to_user()
really only copy the event to user without any side effects.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Jan Kara
d8aaab4f61 fanotify: reorganize loop in fanotify_read()
Swap the error / "read ok" branches in the main loop of fanotify_read().
We will grow the "read ok" part in the next patch and this makes the
indentation easier.  Also it is more common to have error conditions
inside an 'if' instead of the fast path.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Jan Kara
9573f79355 fanotify: convert access_mutex to spinlock
access_mutex is used only to guard operations on access_list.  There's
no need for sleeping within this lock so just make a spinlock out of it.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Jan Kara
f083441ba8 fanotify: use fanotify event structure for permission response processing
Currently, fanotify creates new structure to track the fact that
permission event has been reported to userspace and someone is waiting
for a response to it.  As event structures are now completely in the
hands of each notification framework, we can use the event structure for
this tracking instead of allocating a new structure.

Since this makes the event structures for normal events and permission
events even more different and the structures have different lifetime
rules, we split them into two separate structures (where permission
event structure contains the structure for a normal event).  This makes
normal events 8 bytes smaller and the code a tad bit cleaner.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Jan Kara
3298cf37be fanotify: remove useless bypass_perm check
The prepare_for_access_response() function checks whether
group->fanotify_data.bypass_perm is set.  However this test can never be
true because prepare_for_access_response() is called only from
fanotify_read() which means fanotify group is alive with an active fd
while bypass_perm is set from fanotify_release() when all file
descriptors pointing to the group are closed and the group is going
away.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Fabian Frederick
ddae82d8e6 fs/freevxfs/vxfs_lookup.c: update function comment
nameidata was replaced by flags in commit 00cd8dd3bf ("stop passing
nameidata to ->lookup()").

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:50 -07:00
Fabian Frederick
9ee108b2c6 fs/cifs/cifsfs.c: add __init to cifs_init_inodecache()
cifs_init_inodecache is only called by __init init_cifs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:50 -07:00
Jan Kara
5acda9d12d bdi: avoid oops on device removal
After commit 839a8e8660 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660

Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:49 -07:00
Derek Basehore
6ca738d60c backing_dev: fix hung task on sync
bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes.  The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead.  This is a regression caused by commit
839a8e8660 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task.  Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future.  This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty.  The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zento.linux.org.uk>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benson Leung <bleung@chromium.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Luigi Semenzato <semenzato@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:49 -07:00
Dave Chinner
a6cf33bc56 Merge branch 'xfs-bug-fixes-for-3.15-3' into for-next 2014-04-04 08:07:35 +11:00
Mark Tinguely
c88547a811 xfs: fix directory hash ordering bug
Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
in 3.10 incorrectly converted the btree hash index array pointer in
xfs_da3_fixhashpath(). It resulted in the the current hash always
being compared against the first entry in the btree rather than the
current block index into the btree block's hash entry array. As a
result, it was comparing the wrong hashes, and so could misorder the
entries in the btree.

For most cases, this doesn't cause any problems as it requires hash
collisions to expose the ordering problem. However, when there are
hash collisions within a directory there is a very good probability
that the entries will be ordered incorrectly and that actually
matters when duplicate hashes are placed into or removed from the
btree block hash entry array.

This bug results in an on-disk directory corruption and that results
in directory verifier functions throwing corruption warnings into
the logs. While no data or directory entries are lost, access to
them may be compromised, and attempts to remove entries from a
directory that has suffered from this corruption may result in a
filesystem shutdown.  xfs_repair will fix the directory hash
ordering without data loss occuring.

[dchinner: wrote useful a commit message]

cc: <stable@vger.kernel.org>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-04 07:10:49 +11:00
Linus Torvalds
32d01dc7be Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot updates for cgroup:

   - The biggest one is cgroup's conversion to kernfs.  cgroup took
     after the long abandoned vfs-entangled sysfs implementation and
     made it even more convoluted over time.  cgroup's internal objects
     were fused with vfs objects which also brought in vfs locking and
     object lifetime rules.  Naturally, there are places where vfs rules
     don't fit and nasty hacks, such as credential switching or lock
     dance interleaving inode mutex and cgroup_mutex with object serial
     number comparison thrown in to decide whether the operation is
     actually necessary, needed to be employed.

     After conversion to kernfs, internal object lifetime and locking
     rules are mostly isolated from vfs interactions allowing shedding
     of several nasty hacks and overall simplification.  This will also
     allow implmentation of operations which may affect multiple cgroups
     which weren't possible before as it would have required nesting
     i_mutexes.

   - Various simplifications including dropping of module support,
     easier cgroup name/path handling, simplified cgroup file type
     handling and task_cg_lists optimization.

   - Prepatory changes for the planned unified hierarchy, which is still
     a patchset away from being actually operational.  The dummy
     hierarchy is updated to serve as the default unified hierarchy.
     Controllers which aren't claimed by other hierarchies are
     associated with it, which BTW was what the dummy hierarchy was for
     anyway.

   - Various fixes from Li and others.  This pull request includes some
     patches to add missing slab.h to various subsystems.  This was
     triggered xattr.h include removal from cgroup.h.  cgroup.h
     indirectly got included a lot of files which brought in xattr.h
     which brought in slab.h.

  There are several merge commits - one to pull in kernfs updates
  necessary for converting cgroup (already in upstream through
  driver-core), others for interfering changes in the fixes branch"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
  cgroup: remove useless argument from cgroup_exit()
  cgroup: fix spurious lockdep warning in cgroup_exit()
  cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
  cgroup: break kernfs active_ref protection in cgroup directory operations
  cgroup: fix cgroup_taskset walking order
  cgroup: implement CFTYPE_ONLY_ON_DFL
  cgroup: make cgrp_dfl_root mountable
  cgroup: drop const from @buffer of cftype->write_string()
  cgroup: rename cgroup_dummy_root and related names
  cgroup: move ->subsys_mask from cgroupfs_root to cgroup
  cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
  cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
  cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
  cgroup: reorganize cgroup bootstrapping
  cgroup: relocate setting of CGRP_DEAD
  cpuset: use rcu_read_lock() to protect task_cs()
  cgroup_freezer: document freezer_fork() subtleties
  cgroup: update cgroup_transfer_tasks() to either succeed or fail
  cgroup: drop task_lock() protection around task->cgroups
  cgroup: update how a newly forked task gets associated with css_set
  ...
2014-04-03 13:05:42 -07:00
Dan Carpenter
805eeb8e04 xfs: extra semi-colon breaks a condition
There were some extra semi-colons here which mean that we return true
unintentionally.

Fixes: a49935f200 ('xfs: xfs_check_page_type buffer checks need help')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-04 06:56:30 +11:00
Rashika Kheria
0961f02a37 fs: Mark functions as static in exofs/ore_raid.c
Mark functions as static in exofs/ore_raid.c because they are not used
outside this file.

This also eliminates the following warning in exofs/ore_raid.c:
fs/exofs/ore_raid.c:24:14: warning: no previous prototype for _raid_page_alloc [-Wmissing-prototypes]
fs/exofs/ore_raid.c:29:6: warning: no previous prototype for _raid_page_free [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-04-03 11:40:55 +03:00
Rashika Kheria
af8d0cc0d2 fs: Mark function as static in exofs/super.c
Mark function as static in exofs/super.c because it is not used outside
this file.

This also eliminates the following warning in exofs/super.c:
 fs/exofs/super.c:546:5: warning: no previous prototype \
 for __alloc_dev_table[-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-04-03 11:36:42 +03:00
Yan, Zheng
c137a32a40 ceph: print inode number for LOOKUPINO request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:54 +08:00
Yan, Zheng
19913b4eac ceph: add get_name() NFS export callback
Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
8996f4f23d ceph: fix ceph_fh_to_parent()
ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
9017c2ec78 ceph: add get_parent() NFS export callback
The callback uses LOOKUPPARENT MDS request to find parent.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
4f32b42dca ceph: simplify ceph_fh_to_dentry()
MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
f1fc4fee3b ceph: fscache: Wait for completion of object initialization
The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
32d3e148dd ceph: fscache: Update object store limit after file writing
Synchronize object->store_limit[_l] with new inode->i_size after file writing.

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
020c4bddc0 ceph: fscache: add an interface to synchronize object store limit
Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Sage Weil
4b58c9b19b ceph: do not set r_old_dentry_dir on link()
This is racy--we do not know whather d_parent has changed out from
underneath us because i_mutex is not held on the source inode's directory.

Also, taking this reference is useless.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:53 +08:00
Sage Weil
844d87c332 ceph: do not assume r_old_dentry[_dir] always set together
Do not assume that r_old_dentry implies that r_old_dentry_dir is also
true.  Separate out the ref cleanup and make the debugs dump behave when
it is NULL.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:53 +08:00
Sage Weil
752c8bdcfe ceph: do not chain inode updates to parent fsync
The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.

Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Sage Weil
180061a58c ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename()
This is just old_dir; no reason to abuse the dcache pointers.

Reported-by: Al Viro <viro.zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Yan, Zheng
15289dc85b ceph: let MDS adjust readdir 'frag'
If readdir 'frag' is adjusted, readdir 'offset' should be reset.
Otherwise some dentries may be lost when readdir and fragmenting
directory happen at the some.

Another way to fix this issue is let MDS adjust readdir 'frag'.
The code that handles MDS reply reset the readdir 'offset' if
the readdir reply is different than the requested one.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Yan, Zheng
dcd3cc05e5 ceph: fix reset_readdir()
When changing readdir postion, fi->next_offset should be set to 0
if the new postion is not in the first dirfrag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:52 +08:00
Yan, Zheng
f049420607 ceph: fix ceph_dir_llseek()
Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
directory. For a fragmented directory, offset (frag_t, off) can be
larger than inode->i_sb->s_maxbytes.

At the very beginning of ceph_dir_llseek(), local variable old_offset
is initialized to parameter offset. This doesn't make sense neither.
Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:52 +08:00
Linus Torvalds
159d8133d0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual rocket science -- mostly documentation and comment updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  sparse: fix comment
  doc: fix double words
  isdn: capi: fix "CAPI_VERSION" comment
  doc: DocBook: Fix typos in xml and template file
  Bluetooth: add module name for btwilink
  driver core: unexport static function create_syslog_header
  mmc: core: typo fix in printk specifier
  ARM: spear: clean up editing mistake
  net-sysfs: fix comment typo 'CONFIG_SYFS'
  doc: Insert MODULE_ in module-signing macros
  Documentation: update URL to hfsplus Technote 1150
  gpio: update path to documentation
  ixgbe: Fix format string in ixgbe_fcoe.
  Kconfig: Remove useless "default N" lines
  user_namespace.c: Remove duplicated word in comment
  CREDITS: fix formatting
  treewide: Fix typo in Documentation/DocBook
  mm: Fix warning on make htmldocs caused by slab.c
  ata: ata-samsung_cf: cleanup in header file
  idr: remove unused prototype of idr_free()
2014-04-02 16:23:38 -07:00
Jeff Mahoney
01d8885785 reiserfs: fix race in readdir
jdm-20004 reiserfs_delete_xattrs: Couldn't delete all xattrs (-2)

The -ENOENT is due to readdir calling dir_emit on the same entry twice.

If the dir_emit callback sleeps and the tree is changed underneath us,
we won't be able to trust deh_offset(deh) anymore. We need to save
next_pos before we might sleep so we can find the next entry.

CC: stable@vger.kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-04-03 00:06:14 +02:00
Linus Torvalds
b9f2b21a32 Devicetree changes for v3.15
Updates to devicetree core code. This branch contains the following notable changes:
 * Add reserved memory binding
 * Make struct device_node a kobject and remove legacy /proc/device-tree
 * ePAPR conformance fixes
 * Update in-kernel DTC copy to version v1.4.0
 * Preparation changes for dynamic device tree overlays
 * minor bug fixes and documentation changes
 
 The most significant change in this branch is the conversion of struct
 device_node to be a kobject that is exposed via sysfs and removal of the
 old /proc/device-tree code. This simplifies the device tree handling
 code and tightens up the lifecycle on device tree nodes.
 
 [updated: added fix for dangling select PROC_DEVICETREE]
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTOyNwAAoJEMWQL496c2LNZY0QAIreUrpo3/hKRau61EDPXkOA
 UFRyPUHD0k/dNXWWDbTfvKH/nAfzdVwejhePqEWiODiFOFkq7JyQlMKPA+CZuZj0
 ygN4215A1yj/hDf6JRD5Zn4WGpawDt9InlbZSps6P5dd8voV5t5dz6uzz+Y7uqaK
 CAjTDlBSmxEen5vRHiHQgKv74au/+b9yfSURjPQVWg46+wl3WJwjsdzerphm4unW
 tpEr8zkIsm51mqqAx4penIuiovh7+L2J5v4BFeg8o+kaZEuZpVxLHJPOuBd5hdom
 zeqEIj3AqHTh5suYIHe4aAbZ2wMP3kYGgkPGwfWLnwLyULxalcCtGZeaCi9nwTFj
 Fdj+7f17ocrt5mif0f5Deufi1LqJsDjhY6G9p7HuV7Y9hsMILpJIUoGENPji+TWj
 BA4L45eaPmNYdKJytEtFD7F2WnXeHZ6fDtYho/39DWW+Bt16IFX85T199irhxGG4
 byN6LRaahk2UeycSXkQHAlWOQHqzBcJJAkQLN2iahzyYRr9Dy+VI2E9clm53m49O
 YQYcONdUlMYrtfRwJpbB9XHM0HgZUvg0LT5z/iHQs9uJtoo33Oj+zxFixyZLQ9Dq
 qyLqQWEpV9gFLAo9tpf56gffkLiJRsHkX4UJ6oTtj4DY1WWU9H81jjCvv/7flzp/
 8ZyyZzANQf1DZ9kqO2v+
 =lyA5
 -----END PGP SIGNATURE-----

Merge tag 'dt-for-linus' of git://git.secretlab.ca/git/linux

Pull devicetree changes from Grant Likely:
 "Updates to devicetree core code.  This branch contains the following
  notable changes:

   - add reserved memory binding
   - make struct device_node a kobject and remove legacy
     /proc/device-tree
   - ePAPR conformance fixes
   - update in-kernel DTC copy to version v1.4.0
   - preparatory changes for dynamic device tree overlays
   - minor bug fixes and documentation changes

  The most significant change in this branch is the conversion of struct
  device_node to be a kobject that is exposed via sysfs and removal of
  the old /proc/device-tree code.  This simplifies the device tree
  handling code and tightens up the lifecycle on device tree nodes.

  [updated: added fix for dangling select PROC_DEVICETREE]"

* tag 'dt-for-linus' of git://git.secretlab.ca/git/linux: (29 commits)
  dt: Remove dangling "select PROC_DEVICETREE"
  of: Add support for ePAPR "stdout-path" property
  of: device_node kobject lifecycle fixes
  of: only scan for reserved mem when fdt present
  powerpc: add support for reserved memory defined by device tree
  arm64: add support for reserved memory defined by device tree
  of: add missing major vendors
  of: add vendor prefix for SMSC
  of: remove /proc/device-tree
  of/selftest: Add self tests for manipulation of properties
  of: Make device nodes kobjects so they show up in sysfs
  arm: add support for reserved memory defined by device tree
  drivers: of: add support for custom reserved memory drivers
  drivers: of: add initialization code for dynamic reserved memory
  drivers: of: add initialization code for static reserved memory
  of: document bindings for reserved-memory nodes
  Revert "of: fix of_update_property()"
  kbuild: dtbs_install: new make target
  ARM: mvebu: Allows to get the SoC ID even without PCI enabled
  of: Allows to use the PCI translator without the PCI core
  ...
2014-04-02 14:27:15 -07:00
Linus Torvalds
7125764c5d Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull compat time conversion changes from Peter Anvin:
 "Despite the branch name this is really neither an x86 nor an
  x32-specific patchset, although it the implementation of the
  discussions that followed the x32 security hole a few months ago.

  This removes get/put_compat_timespec/val() and replaces them with
  compat_get/put_timespec/val() which are savvy as to the current status
  of COMPAT_USE_64BIT_TIME.

  It removes several unused and/or incorrect/misleading functions (like
  compat_put_timeval_convert which doesn't in fact do any conversion)
  and also replaces several open-coded implementations what is now
  called compat_convert_timespec() with that function"

* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  compat: Fix sparse address space warnings
  compat: Get rid of (get|put)_compat_time(val|spec)
2014-04-02 12:51:41 -07:00
Rajat Jain
f3846266f5 fuse: fix "uninitialized variable" warning
Fix the following warning:

In file included from include/linux/fs.h:16:0,
                 from fs/fuse/fuse_i.h:13,
                 from fs/fuse/file.c:9:
fs/fuse/file.c: In function 'fuse_file_poll':
include/linux/rbtree.h:82:28: warning: 'parent' may be used
uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/file.c:2592:27: note: 'parent' was declared here

Signed-off-by: Rajat Jain <rajatxjain@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:51 +02:00
Pavel Emelyanov
4d99ff8f12 fuse: Turn writeback cache on
Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:50 +02:00
Pavel Emelyanov
ea8cd33390 fuse: Fix O_DIRECT operations vs cached writeback misorder
The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:50 +02:00
Maxim Patlasov
fe38d7df23 fuse: fuse_flush() should wait on writeback
The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:50 +02:00
Pavel Emelyanov
6b12c1b37e fuse: Implement write_begin/write_end callbacks
The .write_begin and .write_end are requiered to use generic routines
(generic_file_aio_write --> ... --> generic_perform_write) for buffered
writes.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:49 +02:00
Maxim Patlasov
482fce55d2 fuse: restructure fuse_readpage()
Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:49 +02:00
Pavel Emelyanov
e7cc133c37 fuse: Flush files on wb close
Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:49 +02:00
Maxim Patlasov
b0aa760652 fuse: Trust kernel i_mtime only
Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:48 +02:00
Pavel Emelyanov
8373200b12 fuse: Trust kernel i_size only
Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:48 +02:00
Pavel Emelyanov
d5cd66c58e fuse: Connection bit for enabling writeback
Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:48 +02:00
Pavel Emelyanov
a92adc824e fuse: Prepare to handle short reads
A helper which gets called when read reports less bytes than was requested.
See patch "trust kernel i_size only" for details.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:47 +02:00
Pavel Emelyanov
650b22b941 fuse: Linking file to inode helper
When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:47 +02:00
Al Viro
58bfab395b ocfs2_file_aio_write(): switch to generic_perform_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:37 -04:00
Al Viro
aec605f429 ceph_aio_write(): switch to generic_perform_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:37 -04:00
Al Viro
0a64bc2c04 xfs_file_buffered_aio_write(): switch to generic_perform_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:36 -04:00
Al Viro
5cb6c6c7eb generic_file_direct_write(): get rid of ppos argument
always equal to &iocb->ki_pos.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:35 -04:00
Al Viro
867c4f9329 btrfs_file_aio_write(): get rid of ppos
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:35 -04:00
Al Viro
fcacafd269 kill the 5th argument of generic_file_buffered_write()
same story - it's &iocb->ki_pos in all cases

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:34 -04:00
Al Viro
41fc56d573 kill the 4th argument of __generic_file_aio_write()
It's always equal to &iocb->ki_pos, where iocb is the value of the 1st
argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:34 -04:00
Al Viro
920220c111 ocfs2: don't open-code kernel_recvmsg()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:32 -04:00
Al Viro
86d564c84c constify blk_rq_map_user_iov() and friends
sg_iovec array passed to it can be const

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:31 -04:00
Al Viro
66f5dcef13 ocfs2: don't open-code kernel_sendmsg()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:30 -04:00
Al Viro
ec69557982 read_code(): go through vfs_read() instead of calling the method directly
... and don't skip on sanity checks.  It's *not* a hot path, TYVM
(a couple of calls per a.out execve(), for pity sake) and headers of
random a.out binary are not to be trusted.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:24 -04:00
Al Viro
0165e8100b fold cifs_iovec_read() into its (only) caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:24 -04:00
Al Viro
7f25bba819 cifs_iovec_read: keep iov_iter between the calls of cifs_readdata_to_iov()
... we are doing them on adjacent parts of file, so what happens is that
each subsequent call works to rebuild the iov_iter to exact state it
had been abandoned in by previous one.  Just keep it through the entire
cifs_iovec_read().  And use copy_page_to_iter() instead of doing
kmap/copy_to_user/kunmap manually...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:23 -04:00
Al Viro
6130f5315e switch vmsplice_to_user() to copy_page_to_iter()
I've switched the sanity checks on iovec to rw_copy_check_uvector();
we might need to do a local analog, if any behaviour differences are
not actually bugfixes here...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:23 -04:00
Al Viro
637b58c288 switch pipe_read() to copy_page_to_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:22 -04:00
Al Viro
74027f4a18 cifs_iovec_read(): resubmit shouldn't restart the loop
... by that point the request we'd just resent is in the
head of the list anyway.  Just return to the beginning of
the loop body...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:22 -04:00
Al Viro
9e8c2af96e callers of iov_copy_from_user_atomic() don't need pagecache_disable()
... it does that itself (via kmap_atomic())

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:20 -04:00
Al Viro
c186afb4db switch ->is_partially_uptodate() to saner arguments
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:19 -04:00
Al Viro
fbb32750a6 pipe: kill ->map() and ->unmap()
all pipe_buffer_operations have the same instances of those...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:19 -04:00
Al Viro
58bda1da4b fuse/dev: use atomic maps
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:18 -04:00
David Howells
8ffcb32e05 VFS: Make delayed_free() call free_vfsmnt()
Make delayed_free() call free_vfsmnt() so that we don't have two functions
doing the same job.  This requires the calls to mnt_free_id() in free_vfsmnt()
to be moved into the callers of that function.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:18 -04:00
Al Viro
81c5a68478 cifs: ->rename() without ->lookup() makes no sense
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:17 -04:00
Al Viro
627bf81ac6 get rid of pointless checks for NULL ->i_op
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:16 -04:00
Al Viro
05faf3169f ntfs: don't put NULL into ->i_op/->i_fop
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:16 -04:00
Al Viro
5d826c847b new helper: readlink_copy()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:15 -04:00
Al Viro
7f4b36f9bb get rid of files_defer_init()
the only thing it's doing these days is calculation of
upper limit for fs.nr_open sysctl and that can be done
statically

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:14 -04:00
Al Viro
4d35950734 namei.c: move EXPORT_SYMBOL to corresponding definitions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:14 -04:00
Al Viro
0018d8bfc4 get_write_access() is inlined, exporting it is pointless
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:13 -04:00
Al Viro
3f4d5a0007 tidy do_dentry_open() up a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:13 -04:00
Al Viro
83f936c75e mark struct file that had write access grabbed by open()
new flag in ->f_mode - FMODE_WRITER.  Set by do_dentry_open() in case
when it has grabbed write access, checked by __fput() to decide whether
it wants to drop the sucker.  Allows to stop bothering with mnt_clone_write()
in alloc_file(), along with fewer special_file() checks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:12 -04:00
Al Viro
0ccb286346 fold __get_file_write_access() into its only caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:12 -04:00
Al Viro
4597e695b8 get rid of DEBUG_WRITECOUNT
it only makes control flow in __fput() and friends more convoluted.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:12 -04:00
Al Viro
dd20908a8a don't bother with {get,put}_write_access() on non-regular files
it's pointless and actually leads to wrong behaviour in at least one
moderately convoluted case (pipe(), close one end, try to get to
another via /proc/*/fd and run into ETXTBUSY).

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:11 -04:00
Al Viro
44ba8406d0 ncpfs: switch to sockfd_lookup()/sockfd_put()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:11 -04:00
Al Viro
c7999c3627 reduce m_start() cost...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:09 -04:00
Al Viro
f2ebb3a921 smarter propagate_mnt()
The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:08 -04:00
Linus Torvalds
7a48837732 Merge branch 'for-3.15/core' of git://git.kernel.dk/linux-block
Pull core block layer updates from Jens Axboe:
 "This is the pull request for the core block IO bits for the 3.15
  kernel.  It's a smaller round this time, it contains:

   - Various little blk-mq fixes and additions from Christoph and
     myself.

   - Cleanup of the IPI usage from the block layer, and associated
     helper code.  From Frederic Weisbecker and Jan Kara.

   - Duplicate code cleanup in bio-integrity from Gu Zheng.  This will
     give you a merge conflict, but that should be easy to resolve.

   - blk-mq notify spinlock fix for RT from Mike Galbraith.

   - A blktrace partial accounting bug fix from Roman Pen.

   - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"

* 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
  blk-mq: add REQ_SYNC early
  rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
  blk-mq: support partial I/O completions
  blk-mq: merge blk_mq_insert_request and blk_mq_run_request
  blk-mq: remove blk_mq_alloc_rq
  blk-mq: don't dump CPU -> hw queue map on driver load
  blk-mq: fix wrong usage of hctx->state vs hctx->flags
  blk-mq: allow blk_mq_init_commands() to return failure
  block: remove old blk_iopoll_enabled variable
  blktrace: fix accounting of partially completed requests
  smp: Rename __smp_call_function_single() to smp_call_function_single_async()
  smp: Remove wait argument from __smp_call_function_single()
  watchdog: Simplify a little the IPI call
  smp: Move __smp_call_function_single() below its safe version
  smp: Consolidate the various smp_call_function_single() declensions
  smp: Teach __smp_call_function_single() to check for offline cpus
  smp: Remove unused list_head from csd
  smp: Iterate functions through llist_for_each_entry_safe()
  block: Stop abusing rq->csd.list in blk-softirq
  block: Remove useless IPI struct initialization
  ...
2014-04-01 19:19:15 -07:00
Jaegeuk Kim
ce23447fe5 f2fs: fix to cover io->bio with io_rwsem
In the f2fs_wait_on_page_writeback, io->bio should be covered by io_rwsem.
Otherwise, the bio pointer can become a dangling pointer due to data races.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-02 09:56:27 +09:00
Chao Yu
d54c795b49 f2fs: fix error path when fail to read inline data
We should unlock page in ->readpage() path and also should unlock & release page
in error path of ->write_begin() to avoid deadlock or memory leak.
So let's add release code to fix the problem when we fail to read inline data.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-02 09:56:27 +09:00
Chao Yu
2d7b822ad9 f2fs: use list_for_each_entry{_safe} for simplyfying code
This patch use list_for_each_entry{_safe} instead of list_for_each{_safe} for
simplfying code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-02 09:56:27 +09:00
Chao Yu
cf0ee0f09b f2fs: avoid free slab cache under spinlock
Move kmem_cache_free out of spinlock protection region for better performance.

Change log from v1:
 o remove spinlock protection for kmem_cache_free in destroy_node_manager
suggested by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-02 09:56:12 +09:00
Eric Whitney
ad6599ab3a ext4: fix premature freeing of partial clusters split across leaf blocks
Xfstests generic/311 and shared/298 fail when run on a bigalloc file
system.  Kernel error messages produced during the tests report that
blocks to be freed are already on the to-be-freed list.  When e2fsck
is run at the end of the tests, it typically reports bad i_blocks and
bad free blocks counts.

The bug that causes these failures is located in ext4_ext_rm_leaf().
Code at the end of the function frees a partial cluster if it's not
shared with an extent remaining in the leaf.  However, if all the
extents in the leaf have been removed, the code dereferences an
invalid extent pointer (off the front of the leaf) when the check for
sharing is made.  This generally has the effect of unconditionally
freeing the partial cluster, which leads to the observed failures
when the partial cluster is shared with the last extent in the next
leaf.

Fix this by attempting to free the cluster only if extents remain in
the leaf.  Any remaining partial cluster will be freed if possible
when the next leaf is processed or when leaf removal is complete.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-04-01 19:49:30 -04:00
Linus Torvalds
158e0d3621 Driver core / sysfs patches for 3.15-rc1
Here's the big driver core / sysfs update for 3.15-rc1.
 
 Lots of kernfs updates to make it useful for other subsystems, and a few
 other tiny driver core patches.
 
 All have been in linux-next for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlM7A0wACgkQMUfUDdst+ynJNACfZlY+KNKIhNFt1OOW8rQfSZzy
 1PYAnjYuOoly01JlPrpJD5b4TdxaAq71
 =GVUg
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core and sysfs updates from Greg KH:
 "Here's the big driver core / sysfs update for 3.15-rc1.

  Lots of kernfs updates to make it useful for other subsystems, and a
  few other tiny driver core patches.

  All have been in linux-next for a while"

* tag 'driver-core-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (42 commits)
  Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
  kernfs: cache atomic_write_len in kernfs_open_file
  numa: fix NULL pointer access and memory leak in unregister_one_node()
  Revert "driver core: synchronize device shutdown"
  kernfs: fix off by one error.
  kernfs: remove duplicate dir.c at the top dir
  x86: align x86 arch with generic CPU modalias handling
  cpu: add generic support for CPU feature based module autoloading
  sysfs: create bin_attributes under the requested group
  driver core: unexport static function create_syslog_header
  firmware: use power efficient workqueue for unloading and aborting fw load
  firmware: give a protection when map page failed
  firmware: google memconsole driver fixes
  firmware: fix google/gsmi duplicate efivars_sysfs_init()
  drivers/base: delete non-required instances of include <linux/init.h>
  kernfs: fix kernfs_node_from_dentry()
  ACPI / platform: drop redundant ACPI_HANDLE check
  kernfs: fix hash calculation in kernfs_rename_ns()
  kernfs: add CONFIG_KERNFS
  sysfs, kobject: add sysfs wrapper for kernfs_enable_ns()
  ...
2014-04-01 16:28:19 -07:00
Jakub Sitnicki
3e0d8a0104 ext2: acl: remove unneeded include of linux/capability.h
acl.c has not been (directly) using the interface defined by
linux/capability.h header since commit 3bd858ab1c
("Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid
check"). Remove it.

Signed-off-by: Jakub Sitnicki <jsitnicki@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-04-01 23:36:43 +02:00
Linus Torvalds
1ead658124 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer changes from Thomas Gleixner:
 "This assorted collection provides:

   - A new timer based timer broadcast feature for systems which do not
     provide a global accessible timer device.  That allows those
     systems to put CPUs into deep idle states where the per cpu timer
     device stops.

   - A few NOHZ_FULL related improvements to the timer wheel

   - The usual updates to timer devices found in ARM SoCs

   - Small improvements and updates all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  tick: Remove code duplication in tick_handle_periodic()
  tick: Fix spelling mistake in tick_handle_periodic()
  x86: hpet: Use proper destructor for delayed work
  workqueue: Provide destroy_delayed_work_on_stack()
  clocksource: CMT, MTU2, TMU and STI should depend on GENERIC_CLOCKEVENTS
  timer: Remove code redundancy while calling get_nohz_timer_target()
  hrtimer: Rearrange comments in the order struct members are declared
  timer: Use variable head instead of &work_list in __run_timers()
  clocksource: exynos_mct: silence a static checker warning
  arm: zynq: Add support for cpufreq
  arm: zynq: Don't use arm_global_timer with cpufreq
  clocksource/cadence_ttc: Overhaul clocksource frequency adjustment
  clocksource/cadence_ttc: Call clockevents_update_freq() with IRQs enabled
  clocksource: Add Kconfig entries for CMT, MTU2, TMU and STI
  sh: Remove Kconfig entries for TMU, CMT and MTU2
  ARM: shmobile: Remove CMT, TMU and STI Kconfig entries
  clocksource: armada-370-xp: Use atomic access for shared registers
  clocksource: orion: Use atomic access for shared registers
  clocksource: timer-keystone: Delete unnecessary variable
  clocksource: timer-keystone: introduce clocksource driver for Keystone
  ...
2014-04-01 11:00:07 -07:00
Linus Torvalds
a21e40877a Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "The main purpose is to fix a full dynticks bug related to
  virtualization, where steal time accounting appears to be zero in
  /proc/stat even after a few seconds of competing guests running busy
  loops in a same host CPU.  It's not a regression though as it was
  there since the beginning.

  The other commits are preparatory work to fix the bug and various
  cleanups"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arch: Remove stub cputime.h headers
  sched: Remove needless round trip nsecs <-> tick conversion of steal time
  cputime: Fix jiffies based cputime assumption on steal accounting
  cputime: Bring cputime -> nsecs conversion
  cputime: Default implementation of nsecs -> cputime conversion
  cputime: Fix nsecs_to_cputime() return type cast
2014-04-01 10:16:10 -07:00
Miklos Szeredi
bd42998a6b ext4: add cross rename support
Implement RENAME_EXCHANGE flag in renameat2 syscall.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-01 17:08:44 +02:00
Miklos Szeredi
bd1af145b9 ext4: rename: split out helper functions
Cross rename (exchange source and dest) will need to call some of these
helpers for both source and dest, while overwriting rename currently only
calls them for one or the other.  This also makes the code easier to
follow.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-01 17:08:44 +02:00
Miklos Szeredi
0d7d5d678b ext4: rename: move EMLINK check up
Move checking i_nlink from after ext4_get_first_dir_block() to before.  The
check doesn't rely on the result of that function and the function only
fails on fs corruption, so the order shouldn't matter.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-01 17:08:44 +02:00
Miklos Szeredi
c0d268c366 ext4: rename: create ext4_renament structure for local vars
Need to split up ext4_rename() into helpers but there are too many local
variables involved, so create a new structure.  This also, apparently,
makes the generated code size slightly smaller.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-01 17:08:43 +02:00
Miklos Szeredi
da1ce0670c vfs: add cross-rename
If flags contain RENAME_EXCHANGE then exchange source and destination files.
There's no restriction on the type of the files; e.g. a directory can be
exchanged with a symlink.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01 17:08:43 +02:00
J. Bruce Fields
4fd699ae3f vfs: lock_two_nondirectories: allow directory args
lock_two_nondirectories warned if either of its args was a directory.
Instead just ignore the directory args.  This is needed for locking in
cross rename.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-01 17:08:43 +02:00
Miklos Szeredi
0b3974eb04 security: add flags to rename hooks
Add flags to security_path_rename() and security_inode_rename() hooks.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01 17:08:43 +02:00
Miklos Szeredi
0a7c3937a1 vfs: add RENAME_NOREPLACE flag
If this flag is specified and the target of the rename exists then the
rename syscall fails with EEXIST.

The VFS does the existence checking, so it is trivial to enable for most
local filesystems.  This patch only enables it in ext4.

For network filesystems the VFS check is not enough as there may be a race
between a remote create and the rename, so these filesystems need to handle
this flag in their ->rename() implementations to ensure atomicity.

Andy writes about why this is useful:

"The trivial answer: to eliminate the race condition from 'mv -i'.

Another answer: there's a common pattern to atomically create a file
with contents: open a temporary file, write to it, optionally fsync
it, close it, then link(2) it to the final name, then unlink the
temporary file.

The reason to use link(2) is because it won't silently clobber the destination.

This is annoying:
 - It requires an extra system call that shouldn't be necessary.
 - It doesn't work on (IMO sensible) filesystems that don't support
hard links (e.g. vfat).
 - It's not atomic -- there's an intermediate state where both files exist.
 - It's ugly.

The new rename flag will make this totally sensible.

To be fair, on new enough kernels, you can also use O_TMPFILE and
linkat to achieve the same thing even more cleanly."

Suggested-by: Andy Lutomirski <luto@amacapital.net> 
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01 17:08:43 +02:00
Miklos Szeredi
520c8b1650 vfs: add renameat2 syscall
Add new renameat2 syscall, which is the same as renameat with an added
flags argument.

Pass flags to vfs_rename() and to i_op->rename() as well.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01 17:08:42 +02:00
Miklos Szeredi
bc27027a73 vfs: rename: use common code for dir and non-dir
There's actually very little difference between vfs_rename_dir() and
vfs_rename_other() so move both inline into vfs_rename() which still stays
reasonably readable.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01 17:08:42 +02:00
Miklos Szeredi
de22a4c372 vfs: rename: move d_move() up
Move the d_move() in vfs_rename_dir() up, similarly to how it's done in
vfs_rename_other().  The next patch will consolidate these two functions
and this is the only structural difference between them.

I'm not sure if doing the d_move() after the dput is even valid.  But there
may be a logical explanation for that.  But moving the d_move() before the
dput() (and the mutex_unlock()) should definitely not hurt.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01 17:08:42 +02:00
Miklos Szeredi
44b1d53043 vfs: add d_is_dir()
Add d_is_dir(dentry) helper which is analogous to S_ISDIR().

To avoid confusion, rename d_is_directory() to d_can_lookup().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01 17:08:41 +02:00
Chao Yu
6e452d69d4 f2fs: avoid unneeded lookup when xattr name length is too long
In f2fs_setxattr we have limit this attribute name length, so we should also
check it in f2fs_getxattr to avoid useless lookup caused by invalid name length.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-01 18:54:24 +09:00
Chao Yu
df0f8dc0e1 f2fs: avoid unnecessary bio submit when wait page writeback
This patch introduce is_merged_page() to check whether current page is merged
in f2fs bio cache. When page is not in cache, we can avoid submitting bio cache,
resulting in having more chance to merge pages.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-01 18:53:41 +09:00
Jaegeuk Kim
3bb5e2c8fe f2fs: return -EIO when node id is not matched
During the cleaing of node segments, F2FS can get errored node blocks due to
data race between node page lock and its valid bitmap operations.
In that case, it needs to return an error to skip such the obsolete block copy.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-01 17:38:26 +09:00
Lukas Czerner
e5b30416f3 ext4: remove unneeded test of ret variable
Currently in ext4_fallocate() and ext4_zero_range() we're testing ret
variable along with new_size. However in ext4_fallocate() we just tested
ret before and in ext4_zero_range() if will always be zero when we get
there so there is no need to test it in both cases.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-01 00:59:21 -04:00
Linus Torvalds
9d919e8d5b Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "PREPARE_[DELAYED_]WORK() were used to change the work function of work
  items without fully reinitializing it; however, this makes workqueue
  consider the work item as a different one from before and allows the
  work item to start executing before the previous instance is finished
  which can lead to extremely subtle issues which are painful to debug.

  The interface has never been popular.  This pull request contains
  patches to remove existing usages and kill the interface.  As one of
  the changes was routed during the last devel cycle and another
  depended on a pending change in nvme, for-3.15 contains a couple merge
  commits.

  In addition, interfaces which were deprecated quite a while ago -
  __cancel_delayed_work() and WQ_NON_REENTRANT - are removed too"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: remove deprecated WQ_NON_REENTRANT
  workqueue: Spelling s/instensive/intensive/
  workqueue: remove PREPARE_[DELAYED_]WORK()
  staging/fwserial: don't use PREPARE_WORK
  afs: don't use PREPARE_WORK
  nvme: don't use PREPARE_WORK
  usb: don't use PREPARE_DELAYED_WORK
  floppy: don't use PREPARE_[DELAYED_]WORK
  ps3-vuart: don't use PREPARE_WORK
  wireless/rt2x00: don't use PREPARE_WORK in rt2800usb.c
  workqueue: Remove deprecated __cancel_delayed_work()
2014-03-31 15:08:51 -07:00
Linus Torvalds
1ce235faa8 - KGDB support for arm64
- PCI I/O space extended to 16M (in preparation of PCIe support patches)
 - Dropping ZONE_DMA32 in favour of ZONE_DMA (we only need one for the
   time being), together with swiotlb late initialisation to correctly
   setup the bounce buffer
 - DMA API cache maintenance support (not all ARMv8 platforms have
   hardware cache coherency)
 - Crypto extensions advertising via ELF_HWCAP2 for compat user space
 - Perf support for dwarf unwinding in compat mode
 - asm/tlb.h converted to the generic mmu_gather code
 - asm-generic rwsem implementation
 - Code clean-up
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.9 (GNU/Linux)
 
 iQIcBAABAgAGBQJTOaqsAAoJEGvWsS0AyF7xYNUP/3/IPySIB+/6pyUG6q7kvIpF
 Di93M+VdmnLEOKhhx/tjkiEmEQMp0hFPeOlQRWf/Ugg4ksulP6gRejdDEjIfkmsk
 LrRXLjvH79NDJbN0pTUXqGDvLLZ9Qnib+HEOuKABIYUrwhNKySBk+5omGfXFtwLR
 Mb5JxPX0kbBXOqbOX4RgANQoRlE8GxJR3V245zlGxA4klcN4IiaDy/99kj+kaeaa
 Cl8X9K2I550IZ2YUAWPOut2aee2qRFQtAhIDgVthTYlGRx7Y/rDLM16B8fFY/T0H
 7azIpSO5hk5lp8J3giJHYajlJlXNla5FeHQb8XAVnlyqFBmCUn0vvd2VbPvWREJp
 UD8t1vZZt/s2he6CVAQIfQghwLyzrpPa19KbnyI+3HtsZ+NS/puBJmcVKZ2PBY/L
 28BsRzB7BKAPEVhNmyPwFHNdZTvjaqYUCLhQ0uTp1sSHMcLeSs7+vyMR99f/0u9E
 doSYAeF41ZkxHXL5xEevdj4sFkCEY1XFxER1Y8VM1rqHTeGEoeYbdS/u9tEeBgit
 jBelvHAlNTBgbur2nW4E9fQpAF2CsvWnRq6lSmDRTkyjzcLUQqA8bsQJ3aUyJtZt
 j17kUIzSH1q7x3zAaWQcvMVeawdkv2+HanjuTOdeO2ehvyG71vvxA3RkCv8o5Jhh
 da+jAMhkpYQxk8mSKkWm
 =8+cB
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull ARM64 updates from Catalin Marinas:
 - KGDB support for arm64
 - PCI I/O space extended to 16M (in preparation of PCIe support
   patches)
 - Dropping ZONE_DMA32 in favour of ZONE_DMA (we only need one for the
   time being), together with swiotlb late initialisation to correctly
   setup the bounce buffer
 - DMA API cache maintenance support (not all ARMv8 platforms have
   hardware cache coherency)
 - Crypto extensions advertising via ELF_HWCAP2 for compat user space
 - Perf support for dwarf unwinding in compat mode
 - asm/tlb.h converted to the generic mmu_gather code
 - asm-generic rwsem implementation
 - Code clean-up

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (42 commits)
  arm64: Remove pgprot_dmacoherent()
  arm64: Support DMA_ATTR_WRITE_COMBINE
  arm64: Implement custom mmap functions for dma mapping
  arm64: Fix __range_ok macro
  arm64: Fix duplicated Kconfig entries
  arm64: mm: Route pmd thp functions through pte equivalents
  arm64: rwsem: use asm-generic rwsem implementation
  asm-generic: rwsem: de-PPCify rwsem.h
  arm64: enable generic CPU feature modalias matching for this architecture
  arm64: smp: make local symbol static
  arm64: debug: make local symbols static
  ARM64: perf: support dwarf unwinding in compat mode
  ARM64: perf: add support for frame pointer unwinding in compat mode
  ARM64: perf: add support for perf registers API
  arm64: Add boot time configuration of Intermediate Physical Address size
  arm64: Do not synchronise I and D caches for special ptes
  arm64: Make DMA coherent and strongly ordered mappings not executable
  arm64: barriers: add dmb barrier
  arm64: topology: Implement basic CPU topology support
  arm64: advertise ARMv8 extensions to 32-bit compat ELF binaries
  ...
2014-03-31 15:01:45 -07:00
Linus Torvalds
190f918660 Merge branch 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 compat wrapper rework from Heiko Carstens:
 "S390 compat system call wrapper simplification work.

  The intention of this work is to get rid of all hand written assembly
  compat system call wrappers on s390, which perform proper sign or zero
  extension, or pointer conversion of compat system call parameters.
  Instead all of this should be done with C code eg by using Al's
  COMPAT_SYSCALL_DEFINEx() macro.

  Therefore all common code and s390 specific compat system calls have
  been converted to the COMPAT_SYSCALL_DEFINEx() macro.

  In order to generate correct code all compat system calls may only
  have eg compat_ulong_t parameters, but no unsigned long parameters.
  Those patches which change parameter types from unsigned long to
  compat_ulong_t parameters are separate in this series, but shouldn't
  cause any harm.

  The only compat system calls which intentionally have 64 bit
  parameters (preadv64 and pwritev64) in support of the x86/32 ABI
  haven't been changed, but are now only available if an architecture
  defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.

  System calls which do not have a compat variant but still need proper
  zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
  get a proper wrapper function with the new s390 specific
  COMPAT_SYSCALL_WRAPx() macro:

     COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);

  which generates the following code (simplified):

     asmlinkage long sys_brk(unsigned long brk);
     asmlinkage long compat_sys_brk(long brk)
     {
         return sys_brk((u32)brk);
     }

  Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
  includes both linux/syscall.h and linux/compat.h, it will generate
  build errors, if the declaration of sys_brk() doesn't match, or if
  there exists a non-matching compat_sys_brk() declaration.

  In addition this will intentionally result in a link error if
  somewhere else a compat_sys_brk() function exists, which probably
  should have been used instead.  Two more BUILD_BUG_ONs make sure the
  size and type of each compat syscall parameter can be handled
  correctly with the s390 specific macros.

  I converted the compat system calls step by step to verify the
  generated code is correct and matches the previous code.  In fact it
  did not always match, however that was always a bug in the hand
  written asm code.

  In result we get less code, less bugs, and much more sanity checking"

* 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits)
  s390/compat: add copyright statement
  compat: include linux/unistd.h within linux/compat.h
  s390/compat: get rid of compat wrapper assembly code
  s390/compat: build error for large compat syscall args
  mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  ipc/compat: convert to COMPAT_SYSCALL_DEFINE
  fs/compat: convert to COMPAT_SYSCALL_DEFINE
  security/compat: convert to COMPAT_SYSCALL_DEFINE
  mm/compat: convert to COMPAT_SYSCALL_DEFINE
  net/compat: convert to COMPAT_SYSCALL_DEFINE
  kernel/compat: convert to COMPAT_SYSCALL_DEFINE
  fs/compat: optional preadv64/pwrite64 compat system calls
  ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t
  s390/compat: partial parameter conversion within syscall wrappers
  s390/compat: automatic zero, sign and pointer conversion of syscalls
  s390/compat: add sync_file_range and fallocate compat syscalls
  ...
2014-03-31 14:32:17 -07:00
Yan, Zheng
18df11d0ea nfsd4: fix memory leak in nfsd4_encode_fattr()
fh_put() does not free the temporary file handle.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-31 17:08:23 -04:00
Stanislav Kinsbursky
3064639423 nfsd: check passed socket's net matches NFSd superblock's one
There could be a case, when NFSd file system is mounted in network, different
to socket's one, like below:

"ip netns exec" creates new network and mount namespace, which duplicates NFSd
mount point, created in init_net context. And thus NFS server stop in nested
network context leads to RPCBIND client destruction in init_net.
Then, on NFSd start in nested network context, rpc.nfsd process creates socket
in nested net and passes it into "write_ports", which leads to RPCBIND sockets
creation in init_net context because of the same reason (NFSd monut point was
created in init_net context). An attempt to register passed socket in nested
net leads to panic, because no RPCBIND client present in nexted network
namespace.

This patch add check that passed socket's net matches NFSd superblock's one.
And returns -EINVAL error to user psace otherwise.

v2: Put socket on exit.

Reported-by: Weng Meiling <wengmeiling.weng@huawei.com>
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-31 16:58:16 -04:00
Linus Torvalds
7cc3afdf43 Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 EFI changes from Ingo Molnar:
 "The main changes:

  - Add debug code to the dump EFI pagetable - Borislav Petkov

  - Make 1:1 runtime mapping robust when booting on machines with lots
    of memory - Borislav Petkov

  - Move the EFI facilities bits out of 'x86_efi_facility' and into
    efi.flags which is the standard architecture independent place to
    keep EFI state, by Matt Fleming.

  - Add 'EFI mixed mode' support: this allows 64-bit kernels to be
    booted from 32-bit firmware.  This needs a bootloader that supports
    the 'EFI handover protocol'.  By Matt Fleming"

* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  x86, efi: Abstract x86 efi_early calls
  x86/efi: Restore 'attr' argument to query_variable_info()
  x86/efi: Rip out phys_efi_get_time()
  x86/efi: Preserve segment registers in mixed mode
  x86/boot: Fix non-EFI build
  x86, tools: Fix up compiler warnings
  x86/efi: Re-disable interrupts after calling firmware services
  x86/boot: Don't overwrite cr4 when enabling PAE
  x86/efi: Wire up CONFIG_EFI_MIXED
  x86/efi: Add mixed runtime services support
  x86/efi: Firmware agnostic handover entry points
  x86/efi: Split the boot stub into 32/64 code paths
  x86/efi: Add early thunk code to go from 64-bit to 32-bit
  x86/efi: Build our own EFI services pointer table
  efi: Add separate 32-bit/64-bit definitions
  x86/efi: Delete dead code when checking for non-native
  x86/mm/pageattr: Always dump the right page table in an oops
  x86, tools: Consolidate #ifdef code
  x86/boot: Cleanup header.S by removing some #ifdefs
  efi: Use NULL instead of 0 for pointer
  ...
2014-03-31 12:26:05 -07:00
Linus Torvalds
b3fd4ea9df Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "Main changes:

   - Torture-test changes, including refactoring of rcutorture and
     introduction of a vestigial locktorture.

   - Real-time latency fixes.

   - Documentation updates.

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
  rcu: Provide grace-period piggybacking API
  rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone
  rcu: Fix sparse warning for rcu_expedited from kernel/ksysfs.c
  notifier: Substitute rcu_access_pointer() for rcu_dereference_raw()
  Documentation/memory-barriers.txt: Clarify release/acquire ordering
  rcutorture: Save kvm.sh output to log
  rcutorture: Add a lock_busted to test the test
  rcutorture: Place kvm-test-1-run.sh output into res directory
  rcutorture: Rename TREE_RCU-Kconfig.txt
  locktorture: Add kvm-recheck.sh plug-in for locktorture
  rcutorture: Gracefully handle NULL cleanup hooks
  locktorture: Add vestigial locktorture configuration
  rcutorture: Introduce "rcu" directory level underneath configs
  rcutorture: Rename kvm-test-1-rcu.sh
  rcutorture: Remove RCU dependencies from ver_functions.sh API
  rcutorture: Create CFcommon file for common Kconfig parameters
  rcutorture: Create config files for scripted test-the-test testing
  rcutorture: Add an rcu_busted to test the test
  locktorture: Add a lock-torture kernel module
  rcutorture: Abstract kvm-recheck.sh
  ...
2014-03-31 11:05:24 -07:00
Steven Whitehouse
1b2ad41214 GFS2: Fix address space from page function
Now that rgrps use the address space which is part of the super
block, we need to update gfs2_mapping2sbd() to take account of
that. The only way to do that easily is to use a different set
of address_space_operations for rgrps.

Reported-by: Abhi Das <adas@redhat.com>
Tested-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-03-31 17:48:27 +01:00
Abhi Das
059788039f GFS2: Fix uninitialized VFS inode in gfs2_create_inode
When gfs2_create_inode() fails due to quota violation, the VFS
inode is not completely uninitialized. This can cause a list
corruption error.

This patch correctly uninitializes the VFS inode when a quota
violation occurs in the gfs2_create_inode codepath.

Resolves: rhbz#1059808
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-03-31 16:41:39 +01:00
Jeff Layton
29723adee1 locks: make locks_mandatory_area check for file-private locks
Allow locks_mandatory_area() to handle file-private locks correctly.
If there is a file-private lock set on an open file and we're doing I/O
via the same, then that should not cause anything to block.

Handle this by first doing a non-blocking FL_ACCESS check for a
file-private lock, and then fall back to checking for a classic POSIX
lock (and possibly blocking).

Note that this approach is subject to the same races that have always
plagued mandatory locking on Linux.

Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:43 -04:00
Jeff Layton
d7a06983a0 locks: fix locks_mandatory_locked to respect file-private locks
As Trond pointed out, you can currently deadlock yourself by setting a
file-private lock on a file that requires mandatory locking and then
trying to do I/O on it.

Avoid this problem by plumbing some knowledge of file-private locks into
the mandatory locking code. In order to do this, we must pass down
information about the struct file that's being used to
locks_verify_locked.

Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
2014-03-31 08:24:43 -04:00
Jeff Layton
90478939dc locks: require that flock->l_pid be set to 0 for file-private locks
Neil Brown suggested potentially overloading the l_pid value as a "lock
context" field for file-private locks. While I don't think we will
probably want to do that here, it's probably a good idea to ensure that
in the future we could extend this API without breaking existing
callers.

Typically the l_pid value is ignored for incoming struct flock
arguments, serving mainly as a place to return the pid of the owner if
there is a conflicting lock. For file-private locks, require that it
currently be set to 0 and return EINVAL if it isn't. If we eventually
want to make a non-zero l_pid mean something, then this will help ensure
that we don't break legacy programs that are using file-private locks.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:43 -04:00
Jeff Layton
5d50ffd7c3 locks: add new fcntl cmd values for handling file private locks
Due to some unfortunate history, POSIX locks have very strange and
unhelpful semantics. The thing that usually catches people by surprise
is that they are dropped whenever the process closes any file descriptor
associated with the inode.

This is extremely problematic for people developing file servers that
need to implement byte-range locks. Developers often need a "lock
management" facility to ensure that file descriptors are not closed
until all of the locks associated with the inode are finished.

Additionally, "classic" POSIX locks are owned by the process. Locks
taken between threads within the same process won't conflict with one
another, which renders them useless for synchronization between threads.

This patchset adds a new type of lock that attempts to address these
issues. These locks conflict with classic POSIX read/write locks, but
have semantics that are more like BSD locks with respect to inheritance
and behavior on close.

This is implemented primarily by changing how fl_owner field is set for
these locks. Instead of having them owned by the files_struct of the
process, they are instead owned by the filp on which they were acquired.
Thus, they are inherited across fork() and are only released when the
last reference to a filp is put.

These new semantics prevent them from being merged with classic POSIX
locks, even if they are acquired by the same process. These locks will
also conflict with classic POSIX locks even if they are acquired by
the same process or on the same file descriptor.

The new locks are managed using a new set of cmd values to the fcntl()
syscall. The initial implementation of this converts these values to
"classic" cmd values at a fairly high level, and the details are not
exposed to the underlying filesystem. We may eventually want to push
this handing out to the lower filesystem code but for now I don't
see any need for it.

Also, note that with this implementation the new cmd values are only
available via fcntl64() on 32-bit arches. There's little need to
add support for legacy apps on a new interface like this.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:43 -04:00
Jeff Layton
57b65325fe locks: skip deadlock detection on FL_FILE_PVT locks
It's not really feasible to do deadlock detection with FL_FILE_PVT
locks since they aren't owned by a single task, per-se. Deadlock
detection also tends to be rather expensive so just skip it for
these sorts of locks.

Also, add a FIXME comment about adding more limited deadlock detection
that just applies to ro -> rw upgrades, per Andy's request.

Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:43 -04:00
Jeff Layton
c1e62b8fc3 locks: pass the cmd value to fcntl_getlk/getlk64
Once we introduce file private locks, we'll need to know what cmd value
was used, as that affects the ownership and whether a conflict would
arise.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:43 -04:00
Jeff Layton
3fd80cddc6 locks: report l_pid as -1 for FL_FILE_PVT locks
FL_FILE_PVT locks are no longer tied to a particular pid, and are
instead inheritable by child processes. Report a l_pid of '-1' for
these sorts of locks since the pid is somewhat meaningless for them.

This precedent comes from FreeBSD. There, POSIX and flock() locks can
conflict with one another. If fcntl(F_GETLK, ...) returns a lock set
with flock() then the l_pid member cannot be a process ID because the
lock is not held by a process as such.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
Jeff Layton
c918d42a27 locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
In a later patch, we'll be adding a new type of lock that's owned by
the struct file instead of the files_struct. Those sorts of locks
will be flagged with a new FL_FILE_PVT flag.

Report these types of locks as "FLPVT" in /proc/locks to distinguish
them from "classic" POSIX locks.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
Jeff Layton
78ed8a1338 locks: rename locks_remove_flock to locks_remove_file
This function currently removes leases in addition to flock locks and in
a later patch we'll have it deal with file-private locks too. Rename it
to locks_remove_file to indicate that it removes locks that are
associated with a particular struct file, and not just flock locks.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
Jeff Layton
bce7560d49 locks: consolidate checks for compatible filp->f_mode values in setlk handlers
Move this check into flock64_to_posix_lock instead of duplicating it in
two places. This also fixes a minor wart in the code where we continue
referring to the struct flock after converting it to struct file_lock.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
J. Bruce Fields
ef12e72a01 locks: fix posix lock range overflow handling
In the 32-bit case fcntl assigns the 64-bit f_pos and i_size to a 32-bit
off_t.

The existing range checks also seem to depend on signed arithmetic
wrapping when it overflows.  In practice maybe that works, but we can be
more careful.  That also allows us to make a more reliable distinction
between -EINVAL and -EOVERFLOW.

Note that in the 32-bit case SEEK_CUR or SEEK_END might allow the caller
to set a lock with starting point no longer representable as a 32-bit
value.  We could return -EOVERFLOW in such cases, but the locks code is
capable of handling such ranges, so we choose to be lenient here.  The
only problem is that subsequent GETLK calls on such a lock will fail
with EOVERFLOW.

While we're here, do some cleanup including consolidating code for the
flock and flock64 cases.

Signed-off-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
Jeff Layton
8c3cac5e6a locks: eliminate BUG() call when there's an unexpected lock on file close
A leftover lock on the list is surely a sign of a problem of some sort,
but it's not necessarily a reason to panic the box. Instead, just log a
warning with some info about the lock, and then delete it like we would
any other lock.

In the event that the filesystem declares a ->lock f_op, we may end up
leaking something, but that's generally preferable to an immediate
panic.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
Jeff Layton
b03dfdec03 locks: add __acquires and __releases annotations to locks_start and locks_stop
...to make sparse happy.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
Jeff Layton
6ca10ed8ed locks: remove "inline" qualifier from fl_link manipulation functions
It's best to let the compiler decide that.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
Jeff Layton
46dad7603f locks: clean up comment typo
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2014-03-31 08:24:42 -04:00
Jeff Layton
24cbe7845e locks: close potential race between setlease and open
As Al Viro points out, there is an unlikely, but possible race between
opening a file and setting a lease on it. generic_add_lease is done with
the i_lock held, but the inode->i_flock check in break_lease is
lockless. It's possible for another task doing an open to do the entire
pathwalk and call break_lease between the point where generic_add_lease
checks for a conflicting open and adds the lease to the list. If this
occurs, we can end up with a lease set on the file with a conflicting
open.

To guard against that, check again for a conflicting open after adding
the lease to the i_flock list. If the above race occurs, then we can
simply unwind the lease setting and return -EAGAIN.

Because we take dentry references and acquire write access on the file
before calling break_lease, we know that if the i_flock list is empty
when the open caller goes to check it then the necessary refcounts have
already been incremented. Thus the additional check for a conflicting
open will see that there is one and the setlease call will fail.

Cc: Bruce Fields <bfields@fieldses.org>
Cc: David Howells <dhowells@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@fieldses.org>
2014-03-31 08:24:42 -04:00
Abhi Das
e9fb7c73a4 GFS2: Fix return value in slot_get()
ENOSPC was being returned in slot_get inspite of successful
execution of the function. This patch fixes this return
code.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-03-31 10:43:05 +01:00
Daniel Vetter
0654a65f26 Linux 3.14
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTOOOnAAoJEHm+PkMAQRiGsBAH/2PAOL3TbOG6tEedxQrTwsr2
 muRIRTVWawjT8/npbHupxGnAyAVdmdffBHpmCmcftKdKNryT3YZW8/JWoYc+WSlo
 3vTDJHDOYAe6yCBjjhYwcu150THBQdOymOi5mbbclo0XWYG18jd3+abYprRH6SiD
 XqNSzYqoiv91JHBAWKBIpo1cyRDuwoM7+jZ7gX41r2800EL7loY3e08cPDDNU6HA
 CKaLXMwLwYTefE+Wnr+4UUr08NbNBbBUKLUSXVqKKIpd+MtbyhV1SnWzz8VQSkag
 K/uzsnGnE7nrqoepMSx3nXxzOWxUSY2EMbwhEjaKK4xBq9C9pzv3sG/o2/IyopU=
 =Nuom
 -----END PGP SIGNATURE-----

Merge tag 'v3.14' into drm-intel-next-queued

Linux 3.14

The vt-d w/a merged late in 3.14-rc needs a bit of fine-tuning, hence
backmerge.

Conflicts:
	drivers/gpu/drm/i915/i915_gem_gtt.c
	drivers/gpu/drm/i915/intel_ddi.c
	drivers/gpu/drm/i915/intel_dp.c

All trivial adjacent lines changed type conflicts, so trivial git
doesn't even show them in the merg commit.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2014-03-31 10:45:15 +02:00
Grant Likely
d88cf7d7b4 Merge remote-tracking branch 'robh/for-next' into devicetree/next 2014-03-31 08:10:55 +01:00
Linus Torvalds
fedc1ed0f1 Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Switch mnt_hash to hlist, turning the races between __lookup_mnt() and
  hash modifications into false negatives from __lookup_mnt() (instead
  of hangs)"

On the false negatives from __lookup_mnt():
 "The *only* thing we care about is not getting stuck in __lookup_mnt().
  If it misses an entry because something in front of it just got moved
  around, etc, we are fine.  We'll notice that mount_lock mismatch and
  that'll be it"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch mnt_hash to hlist
  don't bother with propagate_mnt() unless the target is shared
  keep shadowed vfsmounts together
  resizable namespace.c hashes
2014-03-30 17:26:08 -07:00
Theodore Ts'o
00a1a053eb ext4: atomically set inode->i_flags in ext4_set_inode_flags()
Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-30 17:02:06 -07:00
Al Viro
38129a13e6 switch mnt_hash to hlist
fixes RCU bug - walking through hlist is safe in face of element moves,
since it's self-terminating.  Cyclic lists are not - if we end up jumping
to another hash chain, we'll loop infinitely without ever hitting the
original list head.

[fix for dumb braino folded]

Spotted by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:51 -04:00
Al Viro
0b1b901b5a don't bother with propagate_mnt() unless the target is shared
If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:50 -04:00
Al Viro
1d6a32acd7 keep shadowed vfsmounts together
preparation to switching mnt_hash to hlist

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:50 -04:00
Al Viro
0818bf27c0 resizable namespace.c hashes
* switch allocation to alloc_large_system_hash()
* make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
* switch mountpoint_hashtable from list_head to hlist_head

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:49 -04:00
Kinglong Mee
d531c008d7 NFSD/SUNRPC: Check rpc_xprt out of xs_setup_bc_tcp
Besides checking rpc_xprt out of xs_setup_bc_tcp,
increase it's reference (it's important).

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-30 10:47:36 -04:00
Kinglong Mee
2336745e87 NFSD: Clear wcc data between compound ops
Testing NFS4.0 by pynfs, I got some messeages as,
"nfsd: inode locked twice during operation."

When one compound RPC contains two or more ops that locks
the filehandle,the second op will cause the message.

As two SETATTR ops, after the first SETATTR, nfsd will not call
fh_put() to release current filehandle, it means filehandle have
unlocked with fh_post_saved = 1.
The second SETATTR find fh_post_saved = 1, and printk the message.

v2: introduce helper fh_clear_wcc().

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-30 10:47:34 -04:00
Trond Myklebust
a8a7c6776f nfsd: Don't return NFS4ERR_STALE_STATEID for NFSv4.1+
RFC5661 obsoletes NFS4ERR_STALE_STATEID in favour of NFS4ERR_BAD_STATEID.

Note that because nfsd encodes the clientid boot time in the stateid, we
can hit this error case in certain scenarios where the Linux client
state management thread exits early, before it has finished recovering
all state.

Reported-by: Idan Kedar <idank@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-30 10:47:33 -04:00
J. Bruce Fields
1bc49d83c3 nfsd4: fix nfs4err_resource in 4.1 case
encode_getattr, for example, can return nfserr_resource to indicate it
ran out of buffer space.  That's not a legal error in the 4.1 case.  And
in the 4.1 case, if we ran out of buffer space, we should have exceeded
a session limit too.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 21:24:56 -04:00
J. Bruce Fields
480efaee08 nfsd4: fix setclientid encode size
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 21:24:55 -04:00
J. Bruce Fields
1bed92cb3c nfsd4: remove redundant check from nfsd4_check_resp_size
cstate->slot and ->session are each set together in nfsd4_sequence.  If
one is non-NULL, so is the other.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 21:24:54 -04:00
J. Bruce Fields
d139977d00 nfsd4: use more generous NFS4_ACL_MAX
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 21:24:53 -04:00
J. Bruce Fields
0da7b19cc3 nfsd4: minor nfsd4_replay_cache_entry cleanup
Maybe this is comment true, who cares?  Handle this like any other
error.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 21:24:52 -04:00
J. Bruce Fields
3ca2eb9814 nfsd4: nfsd4_replay_cache_entry should be static
This isn't actually used anywhere else.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 21:24:51 -04:00
J. Bruce Fields
067e1ace46 nfsd4: update comments with obsolete function name
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 21:24:50 -04:00
Kinglong Mee
3f42d2c428 NFSD: Using free_conn free connection
Connection from alloc_conn must be freed through free_conn,
otherwise, the reference of svc_xprt will never be put.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 21:23:40 -04:00
Trond Myklebust
e911b8158e NFSv4: Fix a use-after-free problem in open()
If we interrupt the nfs4_wait_for_completion_rpc_task() call in
nfs4_run_open_task(), then we don't prevent the RPC call from
completing. So freeing up the opendata->f_attr.mdsthreshold
in the error path in _nfs4_do_open() leads to a use-after-free
when the XDR decoder tries to decode the mdsthreshold information
from the server.

Fixes: 82be417aa3 (NFSv4.1 cache mdsthreshold values on OPEN)
Tested-by: Steve Dickson <SteveD@redhat.com>
Cc: stable@vger.kernel.org # 3.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-03-28 20:12:10 -04:00
J. Bruce Fields
fbb74a34a5 nfsd: typo in nfsd_rename comment
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 18:02:11 -04:00
Kinglong Mee
4daeed25ad NFSD: simplify saved/current fh uses in nfsd4_proc_compound
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 18:01:40 -04:00
Sasha Levin
d9060742fb ocfs2: check if cluster name exists before deref
Commit c74a3bdd9b ("ocfs2: add clustername to cluster connection") is
trying to strlcpy a string which was explicitly passed as NULL in the
very same patch, triggering a NULL ptr deref.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: strlcpy (lib/string.c:388 lib/string.c:151)
  CPU: 19 PID: 19426 Comm: trinity-c19 Tainted: G        W     3.14.0-rc7-next-20140325-sasha-00014-g9476368-dirty #274
  RIP:  strlcpy (lib/string.c:388 lib/string.c:151)
  Call Trace:
   ocfs2_cluster_connect (fs/ocfs2/stackglue.c:350)
   ocfs2_cluster_connect_agnostic (fs/ocfs2/stackglue.c:396)
   user_dlm_register (fs/ocfs2/dlmfs/userdlm.c:679)
   dlmfs_mkdir (fs/ocfs2/dlmfs/dlmfs.c:503)
   vfs_mkdir (fs/namei.c:3467)
   SyS_mkdirat (fs/namei.c:3488 fs/namei.c:3472)
   tracesys (arch/x86/kernel/entry_64.S:749)

akpm: this patch probably disables the feature.  A temporary thing to
avoid triviel oopses.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-28 13:56:58 -07:00
Jeff Layton
679b033df4 lockd: ensure we tear down any live sockets when socket creation fails during lockd_up
We had a Fedora ABRT report with a stack trace like this:

kernel BUG at net/sunrpc/svc.c:550!
invalid opcode: 0000 [#1] SMP
[...]
CPU: 2 PID: 913 Comm: rpc.nfsd Not tainted 3.13.6-200.fc20.x86_64 #1
Hardware name: Hewlett-Packard HP ProBook 4740s/1846, BIOS 68IRR Ver. F.40 01/29/2013
task: ffff880146b00000 ti: ffff88003f9b8000 task.ti: ffff88003f9b8000
RIP: 0010:[<ffffffffa0305fa8>]  [<ffffffffa0305fa8>] svc_destroy+0x128/0x130 [sunrpc]
RSP: 0018:ffff88003f9b9de0  EFLAGS: 00010206
RAX: ffff88003f829628 RBX: ffff88003f829600 RCX: 00000000000041ee
RDX: 0000000000000000 RSI: 0000000000000286 RDI: 0000000000000286
RBP: ffff88003f9b9de8 R08: 0000000000017360 R09: ffff88014fa97360
R10: ffffffff8114ce57 R11: ffffea00051c9c00 R12: ffff88003f829600
R13: 00000000ffffff9e R14: ffffffff81cc7cc0 R15: 0000000000000000
FS:  00007f4fde284840(0000) GS:ffff88014fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4fdf5192f8 CR3: 00000000a569a000 CR4: 00000000001407e0
Stack:
 ffff88003f792300 ffff88003f9b9e18 ffffffffa02de02a 0000000000000000
 ffffffff81cc7cc0 ffff88003f9cb000 0000000000000008 ffff88003f9b9e60
 ffffffffa033bb35 ffffffff8131c86c ffff88003f9cb000 ffff8800a5715008
Call Trace:
 [<ffffffffa02de02a>] lockd_up+0xaa/0x330 [lockd]
 [<ffffffffa033bb35>] nfsd_svc+0x1b5/0x2f0 [nfsd]
 [<ffffffff8131c86c>] ? simple_strtoull+0x2c/0x50
 [<ffffffffa033c630>] ? write_pool_threads+0x280/0x280 [nfsd]
 [<ffffffffa033c6bb>] write_threads+0x8b/0xf0 [nfsd]
 [<ffffffff8114efa4>] ? __get_free_pages+0x14/0x50
 [<ffffffff8114eff6>] ? get_zeroed_page+0x16/0x20
 [<ffffffff811dec51>] ? simple_transaction_get+0xb1/0xd0
 [<ffffffffa033c098>] nfsctl_transaction_write+0x48/0x80 [nfsd]
 [<ffffffff811b8b34>] vfs_write+0xb4/0x1f0
 [<ffffffff811c3f99>] ? putname+0x29/0x40
 [<ffffffff811b9569>] SyS_write+0x49/0xa0
 [<ffffffff810fc2a6>] ? __audit_syscall_exit+0x1f6/0x2a0
 [<ffffffff816962e9>] system_call_fastpath+0x16/0x1b
Code: 31 c0 e8 82 db 37 e1 e9 2a ff ff ff 48 8b 07 8b 57 14 48 c7 c7 d5 c6 31 a0 48 8b 70 20 31 c0 e8 65 db 37 e1 e9 f4 fe ff ff 0f 0b <0f> 0b 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55
RIP  [<ffffffffa0305fa8>] svc_destroy+0x128/0x130 [sunrpc]
 RSP <ffff88003f9b9de0>

Evidently, we created some lockd sockets and then failed to create
others. make_socks then returned an error and we tried to tear down the
svc, but svc->sv_permsocks was not empty so we ended up tripping over
the BUG() in svc_destroy().

Fix this by ensuring that we tear down any live sockets we created when
socket creation is going to return an error.

Fixes: 786185b5f8 (SUNRPC: move per-net operations from...)
Reported-by: Raphos <raphoszap@laposte.net>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-03-28 10:43:08 -04:00