Commit Graph

50070 Commits

Author SHA1 Message Date
David Sterba
6165572c11 btrfs: use GFP_KERNEL in btrfs_init_dev_replace_tgtdev
The function is called from ioctl context and we don't hold any locks
that take part in writeback. Right now it's only fs_info::volume_mutex.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
David Sterba
6a44517d79 btrfs: use GFP_KERNEL in btrfs_calc_avail_data_space
We don't hold any locks here. Inidirectly called from statfs.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Nikolay Borisov
0eee8a494e btrfs: Use btrfs_space_info_used instead of opencoding it
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Anand Jain
4fc6441aac btrfs: wait part of the write_dev_flush() can be separated out
Submit and wait parts of write_dev_flush() can be split into two
separate functions for better readability.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Anand Jain
cea7c8bf77 btrfs: remove redundant null bdev counting during flush submission
There is no extra benefit to count null bdev during the submit loop,
as these null devices will be anyway checked during command
completion device loop just after the submit loop. We are holding the
device_list_mutex, the device->bdev status won't change in between.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Anand Jain
12b9bf0b94 btrfs: write_dev_flush does not return ENOMEM anymore
Since commit "btrfs: btrfs_io_bio_alloc never fails, skip error handling"
write_dev_flush will not return ENOMEM in the sending part. We do not
need to check for it in the callers.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Timofey Titovets
170607ebd9 Btrfs: compression must free at least one sector size
We already skip storing data where compression does not make the result
at least one byte less.  Let's make the logic better and check
that compression frees at least one sector size of bytes, otherwise it's
not that useful.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
David Sterba
c5e4c3d750 btrfs: sink gfp parameter to btrfs_io_bio_alloc
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
David Sterba
184f999e12 btrfs: add helper to initialize the non-bio part of btrfs_io_bio
We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
fa1bcbe0a5 btrfs: document mandatory order of bio in btrfs_io_bio
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
Liu Bo
ef7cdac101 Btrfs: skip checksum verification if IO error occurs
Currently dio read also goes to verify checksum if -EIO has been returned,
although it usually fails on checksum, it's not necessary at all, we could
directly check if there is another copy to read.

And with this, the behavior of dio read is now consistent with that of
buffered read.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use bool for uptodate ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
Liu Bo
e3d37faba2 Btrfs: tolerate errors if we have retried successfully
With raid1 profile, dio read isn't tolerating IO errors if read length is
less than the stripe length (64K).

Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags &
BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read
length is less than 64k.  In this case, if the underlying device returns
error somehow, bio->bi_error has recorded that error.

If we could recover the correct data from another copy in profile raid1/10/5/6,
with btrfs_subio_endio_read() returning 0, bio would have the correct data in
its vector, but bio->bi_error is not updated accordingly so that the following
dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed.

This fixes the problem by setting bio's error to 0 if a good copy has been
found.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
c821e7f3da btrfs: pass bytes to btrfs_bio_alloc
Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
9886b17433 btrfs: opencode trivial compressed_bio_alloc, simplify error handling
compressed_bio_alloc is now a trivial wrapper around btrfs_bio_alloc, no
point keeping it. The error handling can be simplified, as we know
btrfs_bio_alloc will never fail.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
9f2179a5e7 btrfs: remove redundant parameters from btrfs_bio_alloc
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.

submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
8b6c1d56f2 btrfs: sink gfp parameter to btrfs_bio_clone
All callers pass GFP_NOFS.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
e4f5690386 btrfs: btrfs_io_bio_alloc never fails, skip error handling
Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
3aa8e074ab btrfs: btrfs_bio_clone never fails, skip error handling
Update direct callers of btrfs_bio_clone that do error handling, that we
can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
0c4dd97c5e btrfs: btrfs_bio_alloc never fails, skip error handling
Update direct callers of btrfs_bio_alloc that do error handling, that we
can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
6e707bcd1f btrfs: bioset allocations will never fail, adapt our helpers
Christoph pointed out that bio allocations backed by a bioset will never
fail.  As we always use a bioset for all bio allocations, we can skip
the error handling.  This patch adjusts our low-level helpers, the
cascaded changes to all callers will come next.

CC: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
6acafd1eff btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace
The compression workspace buffers are larger than a page so we use
vmalloc, unconditionally. This is not always necessary as there might be
contiguous memory available.

Let's use the kvmalloc helpers that will try kmalloc first and fallback
to vmalloc. For that they require GFP_KERNEL flags. As we now have the
alloc_workspace calls protected by memalloc_nofs in the critical
contexts, we can safely use GFP_KERNEL.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
389a6cfc2a btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace
As alloc_workspace is now protected by memalloc_nofs where needed,
we can switch the kmalloc to use GFP_KERNEL.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
fe30853307 btrfs: add memalloc_nofs protections around alloc_workspace callback
The workspaces are preallocated at the beginning where we can safely use
GFP_KERNEL, but in some cases the find_workspace might reach the
allocation again, now in a more restricted context when the bios or
pages are being compressed.

To avoid potential lockup when alloc_workspace -> vmalloc would silently
use the GFP_KERNEL, add the memalloc_nofs helpers around the critical
call site.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
adf0212396 btrfs: adjust includes after vmalloc removal
As we don't use vmalloc/vzalloc/vfree directly in ctree.c, we can now
use the proper header that defines kvmalloc.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
f54de068dd btrfs: use GFP_KERNEL in init_ipath
Now that init_ipath is called either from a safe context or with
memalloc_nofs protection, we can switch to GFP_KERNEL allocations in
init_path and init_data_container.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
de2491fdef btrfs: scrub: add memalloc_nofs protection around init_ipath
init_ipath is called from a safe ioctl context and from scrub when
printing an error.  The protection is added for three reasons:

* init_data_container calls vmalloc and this does not work as expected
  in the GFP_NOFS context, so this silently does GFP_KERNEL and might
  deadlock in some cases
* keep the context constraint of GFP_NOFS, used by scrub
* we want to use GFP_KERNEL unconditionally inside init_ipath or its
  callees

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
f11f74416a btrfs: send: use kvmalloc in iterate_dir_item
We use a growing buffer for xattrs larger than a page size, at some
point vmalloc is unconditionally used for larger buffers. We can still
try to avoid it using the kvmalloc helper.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba
818e010bf9 btrfs: replace opencoded kvzalloc with the helper
The logic of kmalloc and vmalloc fallback is opencoded in
several places, we can now use the existing helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Timofey Titovets
1e9d7291e5 Btrfs: lzo: compressed data size must be less then input size
Logic already skips if compression makes data bigger, let's sync lzo
with zlib and also return error if compressed size is equal to
input size.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Guoqing Jiang
054ec2f626 btrfs: simplify code with bio_io_error
bio_io_error was introduced in the commit 4246a0b63b
("block: add a bi_error field to struct bio"), so use it to simplify
code.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Omar Sandoval
25ff17e82f Btrfs: use memalloc_nofs and kvzalloc() for free space tree bitmaps
First, instead of open-coding the vmalloc() fallback, use the new
kvzalloc() helper. Second, use memalloc_nofs_{save,restore}() instead of
GFP_NOFS, as vmalloc() uses some GFP_KERNEL allocations internally which
could lead to deadlocks.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba
4b5faeac46 btrfs: use generic slab for for btrfs_transaction
Observing the number of slab objects of btrfs_transaction, there's just
one active on an almost quiescent filesystem, and the number of objects
goes to about ten when sync is in progress. Then the nubmer goes down to
1.  This matches the expectations of the transaction lifetime.

For such use the separate slab cache is not justified, as we do not
reuse objects frequently. For the shortlived transaction, the generic
slab (size 512) should be ok. We can optimistically expect that the 512
slabs are not all used (fragmentation) and there are free slots to take
when we do the allocation, compared to potentially allocating a whole new
page for the separate slab.

We'll lose the stats about the object use, which could be added later if
we really need them.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba
3fb99303c6 btrfs: scrub: embed scrub_wr_ctx into scrub context
The structure scrub_wr_ctx is not used anywhere just the scrub context,
we can move the members there. The tgtdev is renamed so it's more clear
that it belongs to the "wr" part.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba
25cc1226c1 btrfs: scrub: use fs_info::sectorsize and drop it from scrub context
As we now have the node/block sizes in fs_info, we can use them and can
drop the local copies.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Yonghong Song
04a87e3472 Btrfs: add statx support
Return enhanced file attributes from the btrfs, including:
  (1). inode creation time as stx_btime, and
  (2). Certain BTRFS_INODE_xxx flags are mapped to stx_attributes flags.

Example output:
	[root@localhost ~]# cat t.sh
	touch t
	chattr +aic t
	~/linux/samples/statx/test-statx t
	chattr -aic t
	touch t
	echo "========================================"
	~/linux/samples/statx/test-statx t
	/bin/rm t
	[root@localhost ~]# ./t.sh
	statx(t) = 0
	results=fff
  	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 00:1c           Inode: 63962       Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2017-05-11 16:03:13.999856591-0700
	Modify: 2017-05-11 16:03:13.999856591-0700
	Change: 2017-05-11 16:03:14.000856663-0700
 	 Birth: 2017-05-11 16:03:13.999856591-0700
	Attributes: 0000000000000034 (........ ........ ........ ........ ........ ........ ........ .-ai.c..)
	========================================
	statx(t) = 0
	results=fff
	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 00:1c           Inode: 63962       Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2017-05-11 16:03:14.006857097-0700
	Modify: 2017-05-11 16:03:14.006857097-0700
	Change: 2017-05-11 16:03:14.006857097-0700
 	Birth: 2017-05-11 16:03:13.999856591-0700
	Attributes: 0000000000000000 (........ ........ ........ ........ ........ ........ ........ .---.-..)
	[root@localhost ~]#

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Timofey Titovets
036b0217ad Btrfs: lzo: fix typo in error message after failed deflate
Fix copy paste typo in debug message for lzo.c, lzo is not deflate.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Jeff Layton
3189ff7786 btrfs: btrfs_wait_tree_block_writeback can be void return
Nothing checks its return value.

Is it safe to skip checking return value of btrfs_wait_tree_block_writeback?

Liu Bo: I think yes, it's used in walk_log_tree which is called in two
places, free_log_tree and log replay.  For free_log_tree, it waits for
any running writeback of the extent buffer under freeing to finish in
case we need to access the eb pointer from page->private, and it's OK to
not check the return value, while for log replay, it's doesn't wait
because wc->wait is not set. So neither cares about the writeback error.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ added more explanation to changelog, from Liu Bo ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Nikolay Borisov
118c701e20 btrfs: remove __BTRFS_LEAF_DATA_SIZE
__BTRFS_LAF_DATA_SIZE is used only by BTRFS_LEAF_DATA_SIZE. Make the
latter subsume the former.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Nikolay Borisov
3d9ec8c49a btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET
Commit 5f39d397df ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Anand Jain
e1ddce71d6 btrfs: reduce arguments for decompress_bio ops
struct compressed_bio pointer can be used instead.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Anand Jain
8140dc30a4 btrfs: btrfs_decompress_bio() could accept compressed_bio instead
Instead of sending each argument of struct compressed_bio, send
the compressed_bio itself.

Also by having struct compressed_bio in btrfs_decompress_bio()
it would help tracing.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Nikolay Borisov
d2006e6d28 btrfs: Refactor update_space_info
Following the factoring out of the creation code udpate_space_info can
only be called for already-existing space_info structs. As such it
cannot fail.  Remove superfluous error handling and make the function
return void.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Nikolay Borisov
2be12ef79f btrfs: Separate space_info create/update
Currently the struct space_info creation code is intermixed in the
udpate_space_info function. There are well-defined points at which the
we actually want to create brand-new space_info structs (e.g. during
mount of the filesystem as well as sometimes when adding/initialising
new chunks). In such cases update_space_info is called with 0 as the
bytes parameter. All of this makes for spaghetti code.

Fix it by factoring out the creation code in a separate
create_space_info structure. This also allows to simplify the internals.
Also remove BUG_ON from do_alloc_chunk since the callers handle errors.
Furthermore it will make the update_space_info function not fail,
allowing us to remove error handling in callers. This will come in a
follow up patch.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Liu Bo
555ba411aa Btrfs: let btrfs_print_leaf print more about block group
This adds chunk_objectid and flags, with flags we can recognize whether
the block group is about data or metadata.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Liu Bo
28785f70ef Btrfs: skip commit transaction if we don't have enough pinned bytes
We commit transaction in order to reclaim space from pinned bytes because
it could process delayed refs, and in may_commit_transaction(), we check
first if pinned bytes are enough for the required space, we then check if
that plus bytes reserved for delayed insert are enough for the required
space.

This changes the code to the above logic.

Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reported-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
David Sterba
4e2814ef04 btrfs: scrub: simplify cleanup of wr_ctx in scrub_free_ctx
We don't need to take the mutex and zero out wr_cur_bio, as this is
called after the scrub finished.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
David Sterba
e241ddeb9c btrfs: scrub: inline helper scrub_free_wr_ctx
The helper scrub_free_wr_ctx is used only once and fits into
scrub_free_ctx as it continues sctx shutdown, no need to keep it
separate.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
David Sterba
8fcdac3f20 btrfs: scrub: inline helper scrub_setup_wr_ctx
The helper scrub_setup_wr_ctx is used only once and fits into
scrub_setup_ctx as it continues intialization, no need to keep it
separate.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Jeff Mahoney
c1c4919b11 btrfs: remove root usage from can_overcommit
can_overcommit using the root to determine the allocation profile
is the only use of a root in the call graph below reserve_metadata_bytes.

It turns out that we only need to know whether the allocation is for
the chunk root or not -- and we can pass that around as a bool instead.

This allows us to pull root usage out of the reservation path all the
way up to reserve_metadata_bytes itself, which uses it only to compare
against fs_info->chunk_root to set the bool.  In turn, this eliminates
a bunch of races where we use a particular root too early in the mount
process.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Jeff Mahoney
1b86826d12 btrfs: cleanup root usage by btrfs_get_alloc_profile
There are two places where we don't already know what kind of alloc
profile we need before calling btrfs_get_alloc_profile, but we need
access to a root everywhere we call it.

This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile()
and relegates btrfs_system_alloc_profile to a static for use in those
two cases.  The next patch will eliminate one of those.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
e03733da5a btrfs: fix bool type in btrfs_page_exists_in_range
We use only a simple bool indicator, int is not a problem here.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
c9fed2bb61 btrfs: remove unused member list from btrfs_end_io_wq
The end io work queue items have been tracked by the work queues since
"Btrfs: Add async worker threads for pre and post IO checksumming"
(8b71284292) (2008).

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
ee4ea69852 btrfs: remove unused members dir_path from recorded_ref
The two members do not seem to be used since the initial commit.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
b297c9f68f btrfs: remove unused member list from async_submit_bio
The list used to track checksums in the early version (2.6.29), but I
was able not pinpoint the commit that stopped using it. Everything
apparently works without it for a long time.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
106204f191 btrfs: remove unused member err from reada_extent
Seems to be unused since the initial commit, we ignore readahead errors
anyway, the full read will handle that if necessary.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Sahil Kang
0bef71093d btrfs: Remove unnecessary branching in free-space-tree.c
Both btrfs_create_free_space_tree and btrfs_clear_free_space_tree
contain:

  if (ret)
          return ret;

  return 0;

The if statement is only false when ret equals zero, and since we return
zero in such cases, we can safely remove the branching.

Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
e477094f0d Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partial
We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Arnd Bergmann
3c91ee6964 Btrfs: work around maybe-uninitialized warning
A rewrite of btrfs_submit_direct_hook appears to have introduced a warning:

fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook':
fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Where the 'bio' variable was previously initialized unconditionally, it
is now set in the "while (submit_len > 0)" loop that would never execute
if submit_len is zero.

Assuming this cannot happen in practice, we can avoid the warning
by simply replacing the while{} loop with a do{}while() loop so
the compiler knows that it will always be entered at least once.

Fixes changes introduced in "Btrfs: use bio_clone_bioset_partial to
simplify DIO submit".

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
3892ac9086 Btrfs: unify naming of btrfs_io_bio
All dio endio functions are using io_bio for struct btrfs_io_bio, this
makes btrfs_submit_direct to follow this convention.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
11b5616516 Btrfs: check-integrity use bvec_iter
Some check-integrity code depends on bio->bi_vcnt, this changes it to use
bio segments because some bios passing here may not have a reliable
bi_vcnt.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
629ebf4fad Btrfs: record error if one block has failed to retry
In the nocsum case of dio read endio, it returns immediately if an error
gets returned when repairing, which leaves the rest blocks unrepaired.  The
behavior is different from how buffered read endio works in the same case.
This changes it to record error only and go on repairing the rest blocks.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
17347cec15 Btrfs: change how we iterate bios in endio
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.

Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio.  Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
725130bac5 Btrfs: use bio_clone_bioset_partial to simplify DIO submit
Currently when mapping bio to limit bio to a single stripe length, we
split bio by adding page to bio one by one, but later we don't modify
the vector of bio at all, thus we can use bio_clone_fast to use the
original bio vector directly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Liu Bo
2f8e914042 Btrfs: new helper btrfs_bio_clone_partial
This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned
bio that only owns a part of the original bio's data.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Liu Bo
015c1bd9f1 Btrfs: use bio_clone_fast to clone our bio
For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.

Right now we use bio_clone_bioset to create a clone bio with iterating
bi_io_vec to initialize it.  This changes it to use bio_clone_fast()
which creates a clone bio but only copies the bi_io_vec pointer
instead of iterating bi_io_vec.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
7870d0822b Btrfs: don't pass the inode through clean_io_failure
Instead pass around the failure tree and the io tree.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
6ec656bc0f btrfs: remove inode argument from repair_io_failure
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
c6100a4b4e Btrfs: replace tree->mapping with tree->private_data
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks.  This works fine when everything that has
an extent_io_tree has an inode.  But we are going to remove the
btree_inode, so we need to change this.  Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead.  This had a lot of cascading changes but
should be relatively straightforward.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Sargun Dhillon
2723480a0f btrfs: Add quota_override knob into sysfs
This patch adds the read-write attribute quota_override into sysfs.
Any process which has CAP_SYS_RESOURCE can set this flag to on, and
once it is set to true, processes with CAP_SYS_RESOURCE can exceed
the quota.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Sargun Dhillon
f29efe2921 btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE
This patch introduces the quota override flag to btrfs_fs_info, and a
change to quota limit checking code to temporarily allow for quota to be
overridden for processes with CAP_SYS_RESOURCE.

It's useful for administrative programs, such as log rotation, that may
need to temporarily use more disk space in order to free up a greater
amount of overall disk space without yielding more disk space to the
rest of userland.

Eventually, we may want to add the idea of an operator-specific quota,
operator reserved space, or something else to allow for administrative
override, but this is perhaps the simplest solution.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Nikolay Borisov
a5ed45f822 btrfs: Convert fs_info->free_chunk_space to atomic64_t
The ->free_chunk_space variable is used to track the unallocated space
and access to it is protected by a spinlock, which is not used for
anything else.  Make the code a bit self-explanatory by switching the
variable to an atomic64_t type and kill the spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ not a performance critical code, use of atomic type is ok ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Anand Jain
401b41e5a8 btrfs: add framework to handle device flush error as a volume
This adds comments to the flush error handling part of the code, and
hopes to maintain the same logic with a framework which can be used to
handle the errors at the volume level.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Daichou
6b349dfe80 Btrfs: remove obsolete FIXMEs in qgroup ioctls
These FIXMEs were already addressed in 2013. All functions check for
qgroup existence:

* btrfs_add_qgroup_relation
* btrfs_ioctl_qgroup_create
* btrfs_limit_qgroup
* btrfs_del_qgroup_relation

Signed-off-by: Daichou <tommy0705c@gmail.com>
[ enhance and reformat changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Dan Carpenter
97d038562a Btrfs: remove an unused variable
"item" is never used.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:57 +02:00
Fabian Frederick
977ec79271 btrfs: kmap() can't fail
Remove NULL test on kmap() as it will always return a valid pointer.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:57 +02:00
Brian Foster
3d4b4a3e30 xfs: remove bli from AIL before release on transaction abort
When a buffer is modified, logged and committed, it ultimately ends
up sitting on the AIL with a dirty bli waiting for metadata
writeback. If another transaction locks and invalidates the buffer
(freeing an inode chunk, for example) in the meantime, the bli is
flagged as stale, the dirty state is cleared and the bli remains in
the AIL.

If a shutdown occurs before the transaction that has invalidated the
buffer is committed, the transaction is ultimately aborted. The log
items are flagged as such and ->iop_unlock() handles the aborted
items. Because the bli is clean (due to the invalidation),
->iop_unlock() unconditionally releases it. The log item may still
reside in the AIL, however, which means the I/O completion handler
may still run and attempt to access it. This results in assert
failure due to the release of the bli while still present in the AIL
and a subsequent NULL dereference and panic in the buffer I/O
completion handling. This can be reproduced by running generic/388
in repetition.

To avoid this problem, update xfs_buf_item_unlock() to first check
whether the bli is aborted and if so, remove it from the AIL before
it is released. This ensures that the bli is no longer accessed
during the shutdown sequence after it has been freed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
79e641ce29 xfs: release bli from transaction properly on fs shutdown
If a filesystem shutdown occurs with a buffer log item in the CIL
and a log force occurs, the ->iop_unpin() handler is generally
expected to tear down the bli properly. This entails freeing the bli
memory and releasing the associated hold on the buffer so it can be
released and the filesystem unmounted.

If this sequence occurs while ->bli_refcount is elevated (i.e.,
another transaction is open and attempting to modify the buffer),
however, ->iop_unpin() may not be responsible for releasing the bli.
Instead, the transaction may release the final ->bli_refcount
reference and thus xfs_trans_brelse() is responsible for tearing
down the bli.

While xfs_trans_brelse() does drop the reference count, it only
attempts to release the bli if it is clean (i.e., not in the
CIL/AIL). If the filesystem is shutdown and the bli is sitting dirty
in the CIL as noted above, this ends up skipping the last
opportunity to release the bli. In turn, this leaves the hold on the
buffer and causes an unmount hang. This can be reproduced by running
generic/388 in repetition.

Update xfs_trans_brelse() to handle this shutdown corner case
correctly. If the final bli reference is dropped and the filesystem
is shutdown, remove the bli from the AIL (if necessary) and release
the bli to drop the buffer hold and ensure an unmount does not hang.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Arnd Bergmann
0cbe48cc58 xfs: avoid harmless gcc-7 warnings
gcc-7 flags the use of integer math inside of a condition
as a potential bug:

fs/xfs/xfs_bmap_util.c: In function 'xfs_swap_extents_check_format':
fs/xfs/xfs_bmap_util.c:1619:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]
fs/xfs/xfs_bmap_util.c:1629:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]

There is already a helper function for testing the di_forkoff
field for zero, so let's use that instead to shut up the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Shan Hai
f990fc5ad1 xfs: remove lsn relevant fields from xfs_trans structure and its users
The t_lsn is not used anymore and the t_commit_lsn is used as a tmp
storage for the checkpoint sequence number only in the current code.

And the start/commit lsn are tracked as a transaction group tag in
the xfs_cil_ctx instead of a single transaction, so remove them from
the xfs_trans structure and their users to match with the design.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Christoph Hellwig
3398a4005f xfs: remove XFS_HSIZE
XFS_HSIZE is an extremly confusing way to calculate the size of handle_t.
Given that handle_t always only had two sizes, and one of them isn't
even covered by XFS_HSIZE to start with just remove the macro and use
a constant sizeof expression.

Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
d4ca1d550d xfs: dump transaction usage details on log reservation overrun
If a transaction log reservation overrun occurs, the ticket data
associated with the reservation is dumped in xfs_log_commit_cil().
This occurs long after the transaction items and details have been
removed from the transaction and effectively lost. This limited set
of ticket data provides very little information to support debugging
transaction overruns based on the typical report.

To improve transaction log reservation overrun reporting, create a
helper to dump transaction details such as log items, log vector
data, etc., as well as the underlying ticket data for the
transaction. Move the overrun detection from xfs_log_commit_cil() to
xlog_cil_insert_items() so it occurs prior to migration of the
logged items to the CIL. Call the new helper such that it is able to
dump this transaction data before it is lost.

Also, warn on overrun to provide callstack context for the offending
transaction and include a few additional messages from
xlog_cil_insert_items() to display the reservation consumed locally
for overhead such as log vector headers, split region headers and
the context ticket. This provides a complete general breakdown of
the reservation consumption of a transaction when/if it happens to
overrun the reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
e2f2342639 xfs: refactor xlog_cil_insert_items() to facilitate transaction dump
Transaction reservation overrun detection currently occurs too late
to print useful information about the offending transaction.
Ideally, the transaction data is printed before the associated log
items are moved from the transaction to the CIL, which occurs in
xlog_cil_insert_items(), such that details of the items logged by
the transaction are available for analysis.

Refactor xlog_cil_insert_items() to facilitate moving tx overrun
detection to this function. Update the function to track each bit of
extra log reservation stolen from the transaction (i.e., such as for
the CIL context ticket) and perform the log item migration as the
last operation before the CIL lock is released. This creates a
context where the transaction reservation consumption has been fully
calculated when the log items are moved to the CIL. This patch makes
no functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
7d2d565346 xfs: separate shutdown from ticket reservation print helper
xlog_print_tic_res() pre-dates delayed logging and the committed
items list (CIL) and thus retains some factoring warts, such as hard
coded function names in the output and the fact that it induces a
shutdown.

In preparation for more detailed logging of regular transaction
overrun situations, refactor xlog_print_tic_res() to be slightly
more generic. Reword some of the warning messages and pull the
shutdown into the callers.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
1040960efa xfs: define fatal assert build time tunable
While configurable at runtime, the DEBUG mode assert failure
behavior is usually either desired or not for a particular
situation. For example, developers using kernel modules may prefer
for fatal asserts to remain disabled across module reloads while QE
engineers doing broad regression testing may prefer to have fatal
asserts enabled on boot to facilitate data collection for bug
reports.

To provide a compromise/convenience for developers, create a Kconfig
option that sets the default value of the DEBUG mode 'bug_on_assert'
sysfs tunable. The default behavior remains to trigger kernel BUGs
on assert failures to preserve existing behavior across kernel
configuration updates with DEBUG mode enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
ccdab3d6e8 xfs: define bug_on_assert debug mode sysfs tunable
In DEBUG mode, assert failures unconditionally trigger a kernel BUG.
This is useful in diagnostic situations to panic a system and
collect detailed state information at the time of a failure.

This can also cause problems in cases where DEBUG mode code is
desired but it is preferable not trigger kernel BUGs on assert
failure. For example, during development of new code or during
certain xfstests tests that intentionally cause corruption and test
the kernel for survival (but otherwise may expect to trigger assert
failures).

To provide additional flexibility, create the
<sysfs>/fs/xfs/debug/bug_on_assert tunable to configure assert
failure behavior at runtime. This tunable is only available in DEBUG
mode and is enabled by default to preserve existing default
behavior. When disabled, assert failures in DEBUG mode result in
kernel warnings.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Darrick J. Wong
e1a4e37cc7 xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent
In a pathological scenario where we are trying to bunmapi a single
extent in which every other block is shared, it's possible that trying
to unmap the entire large extent in a single transaction can generate so
many EFIs that we overflow the transaction reservation.

Therefore, use a heuristic to guess at the number of blocks we can
safely unmap from a reflink file's data fork in an single transaction.
This should prevent problems such as the log head slamming into the tail
and ASSERTs that trigger because we've exceeded the transaction
reservation.

Note that since bunmapi can fail to unmap the entire range, we must also
teach the deferred unmap code to roll into a new transaction whenever we
get low on reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: random edits, all bugs are my fault]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-19 08:59:10 -07:00
Darrick J. Wong
d205a7d0ec xfs: refactor dir2 leaf readahead shadow buffer cleverness
Currently, the dir2 leaf block getdents function uses a complex state
tracking mechanism to create a shadow copy of the block mappings and
then uses the shadow copy to schedule readahead.  Since the read and
readahead functions are perfectly capable of reading the mappings
themselves, we can tear all that out in favor of a simpler function that
simply keeps pushing the readahead window further out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19 08:59:10 -07:00
Brian Foster
7912e7fef2 xfs: push buffer of flush locked dquot to avoid quotacheck deadlock
Reclaim during quotacheck can lead to deadlocks on the dquot flush
lock:

 - Quotacheck populates a local delwri queue with the physical dquot
   buffers.
 - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
   dirties all of the dquots.
 - Reclaim kicks in and attempts to flush a dquot whose buffer is
   already queud on the quotacheck queue. The flush succeeds but
   queueing to the reclaim delwri queue fails as the backing buffer is
   already queued. The flush unlock is now deferred to I/O completion
   of the buffer from the quotacheck queue.
 - The dqadjust bulkstat continues and dirties the recently flushed
   dquot once again.
 - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
   the flush lock to update the backing buffers with the in-core
   recalculated values. It deadlocks on the redirtied dquot as the
   flush lock was already acquired by reclaim, but the buffer resides
   on the local delwri queue which isn't submitted until the end of
   quotacheck.

This is reproduced by running quotacheck on a filesystem with a
couple million inodes in low memory (512MB-1GB) situations. This is
a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
buffer lists"), which removed a trylock and buffer I/O submission
from the quotacheck dquot flush sequence.

Quotacheck first resets and collects the physical dquot buffers in a
delwri queue. Then, it traverses the filesystem inodes via bulkstat,
updates the in-core dquots, flushes the corrected dquots to the
backing buffers and finally submits the delwri queue for I/O. Since
the backing buffers are queued across the entire quotacheck
operation, dquot reclaim cannot possibly complete a dquot flush
before quotacheck completes.

Therefore, quotacheck must submit the buffer for I/O in order to
cycle the flush lock and flush the dirty in-core dquot to the
buffer. Add a delwri queue buffer push mechanism to submit an
individual buffer for I/O without losing the delwri queue status and
use it from quotacheck to avoid the deadlock. This restores
quotacheck behavior to as before the regression was introduced.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Hugh Dickins
1be7107fbe mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 21:50:20 +08:00
NeilBrown
011067b056 blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
Linus Torvalds
6e20350659 A fix for an old ceph ->fh_to_* bug from Luis and two timestamp
fixups from Zheng, prompted by the ongoing y2038 work.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZRTAEAAoJEEp/3jgCEfOLNwcH/jfzGqhOS262fU5FCCowfNVJ
 9ANzXiRpWykaHR7iPXOTRmRUqCJOCzhogmwzjnl7bQUX7cJPsFN+R7l9KR+dCAUX
 300dplDWZ5oQCX2c7A7vzRCIgv4wjQjtS0mo+dY/EBCNYcynoAUVmbr/87Ezrroi
 qfmA6pnI6hI527RLBkwIObljoAiy11MjQ1xFj0zS2bckWxfCSauO1v1qSpMhawkn
 v4fAWEKz3y8oUG3MtT7/Ukx4/GJAOcksxKZf93AW0sNwQozCxvB40D/Dda3NcT4s
 xYVVymUTYGTg1I/CmZHZxSqJwtKUOZJLwMFTXEFyo6bQH0Vj2pw/HaRf8Q5ksOU=
 =ClP/
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix for an old ceph ->fh_to_* bug from Luis and two timestamp fixups
  from Zheng, prompted by the ongoing y2038 work"

* tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client:
  ceph: unify inode i_ctime update
  ceph: use current_kernel_time() to get request time stamp
  ceph: check i_nlink while converting a file handle to dentry
2017-06-18 08:23:02 +09:00
Al Viro
77e9ce327d ufs: fix the logics for tail relocation
* original hysteresis loop got broken by typo back in 2002; now
it never switches out of OPTTIME state.  Fixed.
* critical levels for switching from OPTTIME to OPTSPACE and back
ought to be calculated once, at mount time.
* we should use mul_u64_u32_div() for those calculations, now that
->s_dsize is 64bit.
* to quote Kirk McKusick (in 1995 FreeBSD commit message):
    The threshold for switching from time-space and space-time is too small
    when minfree is 5%...so make it stay at space in this case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17 17:22:42 -04:00
Al Viro
c0ef65d292 ufs_iget(): fail with -ESTALE on deleted inode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17 12:25:58 -04:00
Al Viro
23ac7cba73 fix signedness of timestamps on ufs1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17 12:25:13 -04:00
Linus Torvalds
adc311034c Changes since last update:
- Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZQhB9AAoJEPh/dxk0SrTrMWsP/A1gSHmaS2UI3WvXdqwAxldy
 GiDZ3oSAhn4SF7BY+AhIu+Ot9kwCyOVzd/RZ9zVYmRjTTFxYX7P099uua2e/mf70
 Z1o9oSk9H76srqo7L76h7lI2ONsgd28aqh082hmDrRUhhwy7LZThVV15ZPZFfgU4
 8Q17h0uRUVgSn8M8INsuiMpw1sCLJsXw/9Rb9iFMgi3tSJaGZ1Mm//nA1aeS8HFH
 xCKHYw7YUtVKtIVyVV1NGdtXhXZbNznJXelkUZLMnMgOOmAqWiUa8FOGPNbQEezX
 1VidjzGhxRF7uoUJNnnX5mM+26c8/Ip4cLtqvnFQo30bx+HXO3OR8jywQO+6DD9Q
 NxFHY5D/Peud96xsnq5mRzNMN5fqumyVAyUCXSebT3Oa42HMdva+66s9JoFfDehi
 Z7VmRdoryxexDRbhiQVev4xe20beLtOHAP2I5JrnJddrXb0aFWowLk776wa0uxD8
 7cbKgovikqSHk4/mqpyZ5iQeaufg/kOi2cNaTcAFeCbvbXYieeRXVlisMBh1DKec
 lSX5e4kNNS20VVbCYakAttK69lpZzuraYPDDsnb4HRlmt0VX12lYyqQwklY/YZ9S
 jDagtKWKDm/L/jq2j5Nd3uSycM+lMaq97mIMjzPRrnPjOriME1ZGLwEQbKfBLXnW
 3Flzt5C2Hk1Fb/VlNx4S
 =U99d
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "One more bugfix for you for 4.12-rc6 to fix something that came up in
  an earlier rc:

   - Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y"

* tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
2017-06-17 17:34:41 +09:00
Linus Torvalds
c8636b90a0 Merge branch 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ufs fixes from Al Viro:
 "Fix assorted ufs bugs: a couple of deadlocks, fs corruption in
  truncate(), oopsen on tail unpacking and truncate when racing with
  vmscan, mild fs corruption (free blocks stats summary buggered, *BSD
  fsck would complain and fix), several instances of broken logics
  around reserved blocks (starting with "check almost never triggers
  when it should" and then there are issues with sufficiently large
  UFS2)"

[ Note: ufs hasn't gotten any loving in a long time, because nobody
  really seems to use it. These ufs fixes are triggered by people
  actually caring now, not some sudden influx of new bugs.  - Linus ]

* 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs_truncate_blocks(): fix the case when size is in the last direct block
  ufs: more deadlock prevention on tail unpacking
  ufs: avoid grabbing ->truncate_mutex if possible
  ufs_get_locked_page(): make sure we have buffer_heads
  ufs: fix s_size/s_dsize users
  ufs: fix reserved blocks check
  ufs: make ufs_freespace() return signed
  ufs: fix logics in "ufs: make fsck -f happy"
2017-06-17 17:30:07 +09:00
Linus Torvalds
ccd3d905f7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of fixes; a leak in mntns_install() caught by Andrei (this
  cycle regression) + d_invalidate() softlockup fix - that had been
  reported by a bunch of people lately, but the problem is pretty old"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: don't forget to put old mntns in mntns_install
  Hang/soft lockup in d_invalidate with simultaneous calls
2017-06-17 17:26:53 +09:00
Andrea Arcangeli
64c2b20301 userfaultfd: shmem: handle coredumping in handle_userfault()
Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to
__get_user_pages().

shmem as opposed has no special FOLL_DUMP handling there so
handle_mm_fault() is invoked without mmap_sem and ends up calling
handle_userfault() that isn't expecting to be invoked without mmap_sem
held.

This makes handle_userfault() fail immediately if invoked through
shmem_vm_ops->fault during coredumping and solves the problem.

The side effect is a BUG_ON with no lock held triggered by the
coredumping process which exits.  Only 4.11 is affected, pre-4.11 anon
memory holes are skipped in __get_user_pages by checking FOLL_DUMP
explicitly against empty pagetables (mm/gup.c:no_page_table()).

It's zero cost as we already had a check for current->flags to prevent
futex to trigger userfaults during exit (PF_EXITING).

Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: <stable@vger.kernel.org>	[4.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-17 06:37:05 +09:00
Linus Torvalds
ab2789b72d A fix from Nic for a race seen in production (including a stable tag).
And while I'm sending you this I'm also sneaking in a trivial new helper
 from Bart so that we don't need inter-tree dependencies for the next merge
 window.
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllDnicLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOkPA//TMmDanqxLjjz12m9TiQoCjo/iFCtv9KpuJH/rdCz
 EnWK1GdGtWhR3Z1uk/Ss3zbBA/CwfUR/urVdc1P/aefLoVmsYOWQi1jsPHCHtFG6
 zkDYHr7qYqu91otaO0HgFrcOpuJe+LdbhwZndvUiTYJN8vNMRnQAnKdiEUEKmArq
 dBUj/H0JTbQwSXHZat2ZS9PwHsm7RGO+0qeixxc/HE730LF0TEwnteoy9jlu5d7U
 v1RZs9/zszmvQpWU34vPHCVH/sNfTMdVGPzc9+WNrOoxjM9vmhEOE0jTiclOcsCK
 sMAYHCG7woxkCPVZmxqgLx6P/9zZav6L2NZFPcT3z4jFq5Um+ugJ691f1oHaTq+L
 Bnn1DJdTl50wtMnb7yS1Uux+Y0OswKAXvDdC6NFPGJWwEnG41K3oL78Pq/vN7bKV
 ynKxRZciIsy/9S/Oyzp0oYV+l/cyScPVe/KfUN4zvIALi/mltMkAXYaZMEZDp7Vo
 w2TeJO7Nr3O75ghw/yCFHTWMAVbrTJg/ma1rkdUeekKYXix+4Bpr2XYqA3HHZCQY
 06pvIH+fZs1XshFlCs3RoWXvjdfjDgIO8zjrvSkTs8WUK4AxVNXIDtPDA6fpzcGz
 yZEehpdbPWPDvdd1C7TzEAi6lgOV/W5AsPUfk5KbLOaFzKWRe+FYtzDykGwamYeP
 Ov8=
 =NGL4
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs

Pull configfs updates from Christoph Hellwig:
 "A fix from Nic for a race seen in production (including a stable tag).

  And while I'm sending you this I'm also sneaking in a trivial new
  helper from Bart so that we don't need inter-tree dependencies for the
  next merge window"

* tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs:
  configfs: Introduce config_item_get_unless_zero()
  configfs: Fix race between create_link and configfs_rmdir
2017-06-16 18:45:47 +09:00
Christoph Hellwig
20223f0f39 fs: pass on flags in compat_writev
Fixes: 793b80ef14 ("vfs: pass a flags argument to vfs_readv/vfs_writev")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-16 18:40:51 +09:00
Dan Williams
81f558701a x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
The clear_pmem() helper simply combines a memset() plus a cache flush.
Now that the flush routine is optionally provided by the dax device
driver we can avoid unnecessary cache management on dax devices fronting
volatile memory.

With clear_pmem() gone we can follow on with a patch to make pmem cache
management completely defined within the pmem driver.

Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:35:24 -07:00
Dan Williams
6318770a7d filesystem-dax: convert to dax_flush()
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so the dax_flush()
helper skips cache management work when the underlying driver does not
specify a flush method.

We still do all the dirty tracking since the radix entry will already be
there for locking purposes. However, the work to clean the entry will be
a nop for some dax drivers.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:35:24 -07:00
Dan Williams
fec53774fd filesystem-dax: convert to dax_copy_from_iter()
Now that all possible providers of the dax_operations copy_from_iter
method are implemented, switch filesytem-dax to call the driver rather
than copy_to_iter_pmem.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:34:59 -07:00
David S. Miller
0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Andrei Vagin
4068367c9c fs: don't forget to put old mntns in mntns_install
Fixes: 4f757f3cbf ("make sure that mntns_install() doesn't end up with referral for root")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:53:05 -04:00
Al Viro
81be24d263 Hang/soft lockup in d_invalidate with simultaneous calls
It's not hard to trigger a bunch of d_invalidate() on the same
dentry in parallel.  They end up fighting each other - any
dentry picked for removal by one will be skipped by the rest
and we'll go for the next iteration through the entire
subtree, even if everything is being skipped.  Morevoer, we
immediately go back to scanning the subtree.  The only thing
we really need is to dissolve all mounts in the subtree and
as soon as we've nothing left to do, we can just unhash the
dentry and bugger off.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:52:09 -04:00
Linus Torvalds
54ed0f71f0 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes a bug on sparc where we may dereference freed stack memory"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: Work around deallocated stack frame reference gcc bug on sparc.
2017-06-15 17:54:51 +09:00
Al Viro
a8fad98483 ufs_truncate_blocks(): fix the case when size is in the last direct block
The logics when deciding whether we need to do anything with direct blocks
is broken when new size is within the last direct block.  It's better to
find the path to the last byte _not_ to be removed and use that instead
of the path to the beginning of the first block to be freed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 03:57:46 -04:00
Al Viro
289dec5b89 ufs: more deadlock prevention on tail unpacking
->s_lock is not needed for ufs_change_blocknr()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 00:42:56 -04:00
Al Viro
09bf4f5b6e ufs: avoid grabbing ->truncate_mutex if possible
tail unpacking is done in a wrong place; the deadlocks galore
is best dealt with by doing that in ->write_iter() (and switching
to iomap, while we are at it), but that's rather painful to
backport.  The trouble comes from grabbing pages that cover
the beginning of tail from inside of ufs_new_fragments(); ongoing
pageout of any of those is going to deadlock on ->truncate_mutex
with process that got around to extending the tail holding that
and waiting for page to get unlocked, while ->writepage() on
that page is waiting on ->truncate_mutex.

The thing is, we don't need ->truncate_mutex when the fragment
we are trying to map is within the tail - the damn thing is
allocated (tail can't contain holes).

Let's do a plain lookup and if the fragment is present, we can
just pretend that we'd won the race in almost all cases.  The
only exception is a fragment between the end of tail and the
end of block containing tail.

Protect ->i_lastfrag with ->meta_lock - read_seqlock_excl() is
sufficient.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 00:41:18 -04:00
Al Viro
267309f394 ufs_get_locked_page(): make sure we have buffer_heads
callers rely upon that, but find_lock_page() racing with attempt of
page eviction by memory pressure might have left us with
	* try_to_free_buffers() successfully done
	* __remove_mapping() failed, leaving the page in our mapping
	* find_lock_page() returning an uptodate page with no
buffer_heads attached.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 23:32:19 -04:00
Al Viro
c596961d1b ufs: fix s_size/s_dsize users
For UFS2 we need 64bit variants; we even store them in uspi, but
use 32bit ones instead.  One wrinkle is in handling of reserved
space - recalculating it every time had been stupid all along, but
now it would become really ugly.  Just calculate it once...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 16:43:03 -04:00
Al Viro
b451cec4bb ufs: fix reserved blocks check
a) honour ->s_minfree; don't just go with default (5)
b) don't bother with capability checks until we know we'll need them

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:46:05 -04:00
Al Viro
fffd70f588 ufs: make ufs_freespace() return signed
as it is, checking that its return value is <= 0 is useless and
that's how it's being used.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:36:31 -04:00
Al Viro
96ecff1422 ufs: fix logics in "ufs: make fsck -f happy"
Storing stats _only_ at new locations is wrong for UFS1; old
locations should always be kept updated.  The check for "has
been converted to use of new locations" is also wrong - it
should be "->fs_maxbsize is equal to ->fs_bsize".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:17:32 -04:00
Yan, Zheng
4ca2fea6f8 ceph: unify inode i_ctime update
Current __ceph_setattr() can set inode's i_ctime to current_time(),
req->r_stamp or attr->ia_ctime. These time stamps may have minor
differences. It may cause potential problem.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:37:23 +02:00
Yan, Zheng
56199016e8 ceph: use current_kernel_time() to get request time stamp
ceph uses ktime_get_real_ts() to get request time stamp. In most
other cases, current_kernel_time() is used to get time stamp for
filesystem operations (called by current_time()).

There is granularity difference between ktime_get_real_ts() and
current_kernel_time(). The later one can be up to one jiffy behind
the former one. This can causes inode's ctime to go back.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:33:23 +02:00
Luis Henriques
03f219041f ceph: check i_nlink while converting a file handle to dentry
Converting a file handle to a dentry can be done call after the inode
unlink.  This means that __fh_to_dentry() requires an extra check to
verify the number of links is not 0.

The issue can be easily reproduced using xfstest generic/426, which does
something like:

    name_to_handle_at(&fh)
    echo 3 > /proc/sys/vm/drop_caches
    unlink()
    open_by_handle_at(&fh)

The call to open_by_handle_at() should fail, as the file doesn't exist
anymore.

Link: http://tracker.ceph.com/issues/19958
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:32:43 +02:00
Jeff Layton
f73127356f fs/fcntl: return -ESRCH in f_setown when pid/pgid can't be found
The current implementation of F_SETOWN doesn't properly vet the argument
passed in and only returns an error if INT_MIN is passed in. If the
argument doesn't specify a valid pid/pgid, then we just end up cleaning
out the file->f_owner structure.

What we really want is to only clean that out only in the case where
userland passed in an argument of 0. For anything else, we want to
return ESRCH if it doesn't refer to a valid pid.

The relevant POSIX spec page is here:

    http://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html

Cc: Jiri Slaby <jslaby@suse.cz>
Cc: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 09:11:54 -04:00
Jiri Slaby
fc3dc67471 fs/fcntl: f_setown, avoid undefined behaviour
fcntl(0, F_SETOWN, 0x80000000) triggers:
UBSAN: Undefined behaviour in fs/fcntl.c:118:7
negation of -2147483648 cannot be represented in type 'int':
CPU: 1 PID: 18261 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
...
Call Trace:
...
 [<ffffffffad8f0868>] ? f_setown+0x1d8/0x200
 [<ffffffffad8f19a9>] ? SyS_fcntl+0x999/0xf30
 [<ffffffffaed1fb00>] ? entry_SYSCALL_64_fastpath+0x23/0xc1

Fix that by checking the arg parameter properly (against INT_MAX) before
"who = -who". And return immediatelly with -EINVAL in case it is wrong.
Note that according to POSIX we can return EINVAL:
    http://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html

    [EINVAL]
        The cmd argument is F_SETOWN and the value of the argument
        is not valid as a process or process group identifier.

[v2] returns an error, v1 used to fail silently
[v3] implement proper check for the bad value INT_MIN

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 08:46:45 -04:00
Jiri Slaby
393cc3f511 fs/fcntl: f_setown, allow returning error
Allow f_setown to return an error value. We will fail in the next patch
with EINVAL for bad input to f_setown, so tile the path for the later
patch.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 08:46:36 -04:00
Jan Kara
fd3cfad374 udf: Convert udf_disk_stamp_to_time() to use mktime64()
Convert udf_disk_stamp_to_time() to use mktime64() to simplify the code.
As a bonus we get working timestamp conversion for dates before epoch
and after 2038 (both of which are allowed by UDF standard).

Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:02 +02:00
Jan Kara
3c399fa40f udf: Use time64_to_tm for timestamp conversion
UDF on-disk time stamp is stored in a form very similar to struct tm.
Use time64_to_tm() for conversion of seconds since epoch to year, month,
... format and then just copy this as necessary to UDF on-disk
structure to simplify the code.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:02 +02:00
Jan Kara
f2e9535589 udf: Fix deadlock between writeback and udf_setsize()
udf_setsize() called truncate_setsize() with i_data_sem held. Thus
truncate_pagecache() called from truncate_setsize() could lock a page
under i_data_sem which can deadlock as page lock ranks below
i_data_sem - e. g. writeback can hold page lock and try to acquire
i_data_sem to map a block.

Fix the problem by moving truncate_setsize() calls from under
i_data_sem. It is safe for us to change i_size without holding
i_data_sem as all the places that depend on i_size being stable already
hold inode_lock.

CC: stable@vger.kernel.org
Fixes: 7e49b6f248
Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:01 +02:00
Jan Kara
146c4ad6ec udf: Use i_size_read() in udf_adinicb_writepage()
We don't hold inode_lock in udf_adinicb_writepage() so use i_size_read()
to get i_size. This cannot cause real problems is i_size is guaranteed
to be small but let's be careful.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:01 +02:00
Jan Kara
9795e0e8ac udf: Fix races with i_size changes during readpage
__udf_adinicb_readpage() uses i_size several times. When truncate
changes i_size while the function is running, it can observe several
different values and thus e.g. expose uninitialized parts of page to
userspace. Also use i_size_read() in the function since it does not hold
inode_lock. Since i_size is guaranteed to be small, this cannot really
cause any issues even on 32-bit archs but let's be careful.

CC: stable@vger.kernel.org
Fixes: 9c2fc0de1a
Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:01 +02:00
Jan Kara
a247f7236d udf: Remove unused UDF_DEFAULT_BLOCKSIZE
The define is unused. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-13 14:59:14 +02:00
Christoph Hellwig
fdd050b5b3 Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base 2017-06-13 11:45:14 +02:00
Bob Peterson
df68f20f56 GFS2: Remove gl_list from glock structure
The gl_list is no longer used nor needed in the glock structure,
so this patch eliminates it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-06-12 14:39:12 -05:00
Bob Peterson
d87d62b75d GFS2: Withdraw when directory entry inconsistencies are detected
This patch prints an inode consistency error and withdraws the file
system when directory entry counts are mismatched.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-06-12 14:38:53 -05:00
Jens Axboe
8f66439eec Linux 4.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
 biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
 kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
 W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
 =m2Gx
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc5' into for-4.13/block

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-12 08:30:13 -06:00
Bart Van Assche
19e72d3abb configfs: Introduce config_item_get_unless_zero()
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
[hch: minor style tweak]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-12 13:20:20 +02:00
Nicholas Bellinger
ba80aa909c configfs: Fix race between create_link and configfs_rmdir
This patch closes a long standing race in configfs between
the creation of a new symlink in create_link(), while the
symlink target's config_item is being concurrently removed
via configfs_rmdir().

This can happen because the symlink target's reference
is obtained by config_item_get() in create_link() before
the CONFIGFS_USET_DROPPING bit set by configfs_detach_prep()
during configfs_rmdir() shutdown is actually checked..

This originally manifested itself on ppc64 on v4.8.y under
heavy load using ibmvscsi target ports with Novalink API:

[ 7877.289863] rpadlpar_io: slot U8247.22L.212A91A-V1-C8 added
[ 7879.893760] ------------[ cut here ]------------
[ 7879.893768] WARNING: CPU: 15 PID: 17585 at ./include/linux/kref.h:46 config_item_get+0x7c/0x90 [configfs]
[ 7879.893811] CPU: 15 PID: 17585 Comm: targetcli Tainted: G           O 4.8.17-customv2.22 #12
[ 7879.893812] task: c00000018a0d3400 task.stack: c0000001f3b40000
[ 7879.893813] NIP: d000000002c664ec LR: d000000002c60980 CTR: c000000000b70870
[ 7879.893814] REGS: c0000001f3b43810 TRAP: 0700   Tainted: G O     (4.8.17-customv2.22)
[ 7879.893815] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28222242  XER: 00000000
[ 7879.893820] CFAR: d000000002c664bc SOFTE: 1
                GPR00: d000000002c60980 c0000001f3b43a90 d000000002c70908 c0000000fbc06820
                GPR04: c0000001ef1bd900 0000000000000004 0000000000000001 0000000000000000
                GPR08: 0000000000000000 0000000000000001 d000000002c69560 d000000002c66d80
                GPR12: c000000000b70870 c00000000e798700 c0000001f3b43ca0 c0000001d4949d40
                GPR16: c00000014637e1c0 0000000000000000 0000000000000000 c0000000f2392940
                GPR20: c0000001f3b43b98 0000000000000041 0000000000600000 0000000000000000
                GPR24: fffffffffffff000 0000000000000000 d000000002c60be0 c0000001f1dac490
                GPR28: 0000000000000004 0000000000000000 c0000001ef1bd900 c0000000f2392940
[ 7879.893839] NIP [d000000002c664ec] config_item_get+0x7c/0x90 [configfs]
[ 7879.893841] LR [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
[ 7879.893842] Call Trace:
[ 7879.893844] [c0000001f3b43ac0] [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
[ 7879.893847] [c0000001f3b43b10] [c000000000329770] do_dentry_open+0x2c0/0x460
[ 7879.893849] [c0000001f3b43b70] [c000000000344480] path_openat+0x210/0x1490
[ 7879.893851] [c0000001f3b43c80] [c00000000034708c] do_filp_open+0xfc/0x170
[ 7879.893853] [c0000001f3b43db0] [c00000000032b5bc] do_sys_open+0x1cc/0x390
[ 7879.893856] [c0000001f3b43e30] [c000000000009584] system_call+0x38/0xec
[ 7879.893856] Instruction dump:
[ 7879.893858] 409d0014 38210030 e8010010 7c0803a6 4e800020 3d220000 e94981e0 892a0000
[ 7879.893861] 2f890000 409effe0 39200001 992a0000 <0fe00000> 4bffffd0 60000000 60000000
[ 7879.893866] ---[ end trace 14078f0b3b5ad0aa ]---

To close this race, go ahead and obtain the symlink's target
config_item reference only after the existing CONFIGFS_USET_DROPPING
check succeeds.

This way, if configfs_rmdir() wins create_link() will return -ENONET,
and if create_link() wins configfs_rmdir() will return -EBUSY.

Reported-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Tested-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
2017-06-12 13:20:10 +02:00
Linus Torvalds
5e38b72ac1 Fix various bug fixes in ext4 caused by races and memory allocation
failures.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlk9it4ACgkQ8vlZVpUN
 gaNE2wgAiuo9pHO5cUzRhP5HG8Jz4vE2l6mpuPq2JbhT+xGFbvjX+UhA+DN6Cw35
 EK+MnBC6h2wFoQrEOLbavNWc94nb6HZWw2riheK4sst80hBpeclwInCVCw1DLmu6
 +Nx8fzXVqjSf57Qi8CR09AGovSFBgfAJFi8aJTe8KXaSRPx48bWFxK6glgIG5gRw
 VEOyEg5kPRYoUNA8ewsunC67V32ljYF2IZSzTTlrcqaLsXi+SWeiJ2hceYfgI0qz
 fZB1EFBvmGBsdsFM2SHiJpME6fzMEQqx7oTyNC8DhCxMRVd29WjxeqCwcU4nV0Q5
 jJaCKJftJ/q/eLKn7ksezhoL0A96BA==
 =Z9eH
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix various bug fixes in ext4 caused by races and memory allocation
  failures"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix fdatasync(2) after extent manipulation operations
  ext4: fix data corruption for mmap writes
  ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
  ext4: fix quota charging for shared xattr blocks
  ext4: remove redundant check for encrypted file on dio write path
  ext4: remove unused d_name argument from ext4_search_dir() et al.
  ext4: fix off-by-one error when writing back pages before dio read
  ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
  ext4: keep existing extra fields when inode expands
  ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
  ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()
  ext4: fix SEEK_HOLE
  jbd2: preserve original nofs flag during journal restart
  ext4: clear lockdep subtype for quota files on quota off
2017-06-11 11:57:47 -07:00
Linus Torvalds
5faab9e0f0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull UFS fixes from Al Viro:
 "This is just the obvious backport fodder; I'm pretty sure that there
  will be more - definitely so wrt performance and quite possibly
  correctness as well"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs: we need to sync inode before freeing it
  excessive checks in ufs_write_failed() and ufs_evict_inode()
  ufs_getfrag_block(): we only grab ->truncate_mutex on block creation path
  ufs_extend_tail(): fix the braino in calling conventions of ufs_new_fragments()
  ufs: set correct ->s_maxsize
  ufs: restore maintaining ->i_blocks
  fix ufs_isblockset()
  ufs: restore proper tail allocation
2017-06-10 11:09:23 -07:00
Linus Torvalds
66cea28a94 Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Some fixes that Dave Sterba collected.

  We've been hitting an early enospc problem on production machines that
  Omar tracked down to an old int->u64 mistake. I waited a bit on this
  pull to make sure it was really the problem from production, but it's
  on ~2100 hosts now and I think we're good.

  Omar also noticed a commit in the queue would make new early ENOSPC
  problems. I pulled that out for now, which is why the top three
  commits are younger than the rest.

  Otherwise these are all fixes, some explaining very old bugs that
  we've been poking at for a while"

* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix delalloc accounting leak caused by u32 overflow
  Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io
  btrfs: tree-log.c: Wrong printk information about namelen
  btrfs: fix race with relocation recovery and fs_root setup
  btrfs: fix memory leak in update_space_info failure path
  btrfs: use correct types for page indices in btrfs_page_exists_in_range
  btrfs: fix incorrect error return ret being passed to mapping_set_error
  btrfs: Make flush bios explicitely sync
  btrfs: fiemap: Cache and merge fiemap extent before submit it to user
2017-06-10 11:06:05 -07:00
Al Viro
67a70017fa ufs: we need to sync inode before freeing it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-10 12:02:28 -04:00
Al Viro
464d62421c select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 23:56:19 -04:00
Al Viro
babef37dcc excessive checks in ufs_write_failed() and ufs_evict_inode()
As it is, short copy in write() to append-only file will fail
to truncate the excessive allocated blocks.  As the matter of
fact, all checks in ufs_truncate_blocks() are either redundant
or wrong for that caller.  As for the only other caller
(ufs_evict_inode()), we only need the file type checks there.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
Al Viro
006351ac8e ufs_getfrag_block(): we only grab ->truncate_mutex on block creation path
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
Al Viro
940ef1a0ed ufs_extend_tail(): fix the braino in calling conventions of ufs_new_fragments()
... and it really needs splitting into "new" and "extend" cases, but that's for
later

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
Al Viro
6b0d144fa7 ufs: set correct ->s_maxsize
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
Al Viro
eb315d2ae6 ufs: restore maintaining ->i_blocks
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
Al Viro
414cf7186d fix ufs_isblockset()
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
Al Viro
8785d84d00 ufs: restore proper tail allocation
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
Scott Mayhew
0b4d3452b8 security/selinux: allow security_sb_clone_mnt_opts to enable/disable native labeling behavior
When an NFSv4 client performs a mount operation, it first mounts the
NFSv4 root and then does path walk to the exported path and performs a
submount on that, cloning the security mount options from the root's
superblock to the submount's superblock in the process.

Unless the NFS server has an explicit fsid=0 export with the
"security_label" option, the NFSv4 root superblock will not have
SBLABEL_MNT set, and neither will the submount superblock after cloning
the security mount options.  As a result, setxattr's of security labels
over NFSv4.2 will fail.  In a similar fashion, NFSv4.2 mounts mounted
with the context= mount option will not show the correct labels because
the nfs_server->caps flags of the cloned superblock will still have
NFS_CAP_SECURITY_LABEL set.

Allowing the NFSv4 client to enable or disable SECURITY_LSM_NATIVE_LABELS
behavior will ensure that the SBLABEL_MNT flag has the correct value
when the client traverses from an exported path without the
"security_label" option to one with the "security_label" option and
vice versa.  Similarly, checking to see if SECURITY_LSM_NATIVE_LABELS is
set upon return from security_sb_clone_mnt_opts() and clearing
NFS_CAP_SECURITY_LABEL if necessary will allow the correct labels to
be displayed for NFSv4.2 mounts mounted with the context= mount option.

Resolves: https://github.com/SELinuxProject/selinux-kernel/issues/35

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Stephen Smalley <sds@tycho.nsa.gov>
Tested-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-06-09 16:17:47 -04:00
Omar Sandoval
70e7af244f Btrfs: fix delalloc accounting leak caused by u32 overflow
btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().

In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().

The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:

    WARN_ON(fs_info->delalloc_block_rsv.size > 0);
    WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
    WARN_ON(space_info->bytes_pinned > 0 ||
            space_info->bytes_reserved > 0 ||
            space_info->bytes_may_use > 0);

Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.

Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:36 -07:00
Liu Bo
452e62b71f Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io
Before this, we use 'filled' mode here, ie. if all range has been
filled with EXTENT_DEFRAG bits, get to clear it, but if the defrag
range joins the adjacent delalloc range, then we'll have EXTENT_DEFRAG
bits in extent_state until releasing this inode's pages, and that
prevents extent_data from being freed.

This clears the bit if any was found within the ordered extent.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:29 -07:00
Su Yue
286b92f43c btrfs: tree-log.c: Wrong printk information about namelen
In verify_dir_item, it wants to printk name_len of dir_item but
printk data_len acutally.

Fix it by calling btrfs_dir_name_len instead of btrfs_dir_data_len.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:07 -07:00
Christoph Hellwig
4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
36ffc6c1c0 block_dev: propagate bio_iov_iter_get_pages error in __blkdev_direct_IO
Once we move the block layer to its own status code we'll still want to
propagate the bio_iov_iter_get_pages, so restructure __blkdev_direct_IO
to take ret into account when returning the errno.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
d5245d7674 fs: simplify dio_bio_complete
Only read bio->bi_error once in the common path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
4055351cdb fs: remove the unused error argument to dio_end_io()
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
f729b66fca gfs2: remove the unused sd_log_error field
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Aleksa Sarai
5f0f187fd0 tty: add compat_ioctl callbacks
In order to avoid future diversions between fs/compat_ioctl.c and
drivers/tty/pty.c, define .compat_ioctl callbacks for the relevant
tty_operations structs. Since both pty_unix98_ioctl() and
pty_bsd_ioctl() are compatible between 32-bit and 64-bit userspace no
special translation is required.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:27:20 +02:00
Brian Foster
95989c46d2 xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
The 0-day kernel test robot reports assertion failures on
!CONFIG_SMP kernels due to failed spin_is_locked() checks. As it
turns out, spin_is_locked() is hardcoded to return zero on
!CONFIG_SMP kernels and so this function cannot be relied on to
verify spinlock state in this configuration.

To avoid this problem, replace the associated asserts with lockdep
variants that do the right thing regardless of kernel configuration.
Drop the one assert that checks for an unlocked lock as there is no
suitable lockdep variant for that case. This moves the spinlock
checks from XFS debug code to lockdep, but generally provides the
same level of protection.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-08 08:23:07 -07:00
David Miller
d41519a69b crypto: Work around deallocated stack frame reference gcc bug on sparc.
On sparc, if we have an alloca() like situation, as is the case with
SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
memory.  The result can be that the value is clobbered if a trap
or interrupt arrives at just the right instruction.

It only occurs if the function ends returning a value from that
alloca() area and that value can be placed into the return value
register using a single instruction.

For example, in lib/libcrc32c.c:crc32c() we end up with a return
sequence like:

        return  %i7+8
         lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],

%o5 holds the base of the on-stack area allocated for the shash
descriptor.  But the return released the stack frame and the
register window.

So if an intererupt arrives between 'return' and 'lduw', then
the value read at %o5+16 can be corrupted.

Add a data compiler barrier to work around this problem.  This is
exactly what the gcc fix will end up doing as well, and it absolutely
should not change the code generated for other cpus (unless gcc
on them has the same bug :-)

With crucial insight from Eric Sandeen.

Cc: <stable@vger.kernel.org>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-06-08 17:36:03 +08:00
David Howells
e754eba685 rxrpc: Provide a cmsg to specify the amount of Tx data for a call
Provide a control message that can be specified on the first sendmsg() of a
client call or the first sendmsg() of a service response to indicate the
total length of the data to be transmitted for that call.

Currently, because the length of the payload of an encrypted DATA packet is
encrypted in front of the data, the packet cannot be encrypted until we
know how much data it will hold.

By specifying the length at the beginning of the transmit phase, each DATA
packet length can be set before we start loading data from userspace (where
several sendmsg() calls may contribute to a particular packet).

An error will be returned if too little or too much data is presented in
the Tx phase.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-06-07 17:15:46 +01:00
Benjamin Coddington
501e7a4689 NFSv4.2: Don't send mode again in post-EXCLUSIVE4_1 SETATTR with umask
Now that we have umask support, we shouldn't re-send the mode in a SETATTR
following an exclusive CREATE, or we risk having the same problem fixed in
commit 5334c5bdac ("NFS: Send attributes in OPEN request for
NFS4_CREATE_EXCLUSIVE4_1"), which is that files with S_ISGID will have that
bit stripped away.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: dff25ddb48 ("nfs: add support for the umask attribute")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-06-05 12:23:15 -04:00
Christoph Hellwig
01633fd254 overlayfs: use uuid_t instead of uuid_be
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:13 +02:00
Christoph Hellwig
85787090a2 fs: switch ->s_uuid to uuid_t
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers.  More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:12 +02:00
Amir Goldstein
d905fdaaa7 xfs: use the common helper uuid_is_null()
Use the common helper uuid_is_null() and remove the xfs specific
helper uuid_is_nil().

The common helper does not check for the NULL pointer value as
xfs helper did, but xfs code never calls the helper with a pointer
that can be NULL.

Conform comments and warning strings to use the term 'null uuid'
instead of 'nil uuid', because this is the terminology used by
lib/uuid.c and its users. It is also the terminology used in
userspace by libuuid and xfsprogs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: remove now unused uuid.[ch]]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:08 +02:00
Christoph Hellwig
cb0ba6cc22 xfs: remove uuid_getnodeuniq and xfs_uu_t
Opencode uuid_getnodeuniq in the only caller, and directly decode
the uuid_t representation instead of using a structure cast for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-05 16:59:07 +02:00
Christoph Hellwig
df33767d9f uuid: hoist helpers uuid_equal() and uuid_copy() from xfs
These helper are used to compare and copy two uuid_t type objects.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: also provide the respective guid_ versions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:04 +02:00
Christoph Hellwig
f9727a17db uuid: rename uuid types
Our "little endian" UUID really is a Wintel GUID, so rename it and its
helpers such (guid_t).  The big endian UUID is the only true one, so
give it the name uuid_t.  The uuid_le and uuid_be names are retained for
now, but will hopefully go away soon.  The exception to that are the _cmp
helpers that will be replaced by better primitives ASAP and thus don't
get the new names.

Also the _to_bin helpers are named to match the better named uuid_parse
routine in userspace.

Also remove the existing typedef in XFS that's now been superceeded by
the generic type name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[andy: also update the UUID_LE/UUID_BE macros including fallout]
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-05 16:58:59 +02:00
Christoph Hellwig
12ce5f8c5c nfsd: namespace-prefix uuid_parse
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-05 16:56:38 +02:00
Christoph Hellwig
b1f359f980 xfs: use uuid_be to implement the uuid_t type
Use the generic Linux definition to implement our UUID type, this will
allow using more generic infrastructure in the future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-05 16:56:36 +02:00
Amir Goldstein
dfd7487e99 xfs: use uuid_copy() helper to abstract uuid_t
uuid_t definition is about to change.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-05 16:56:35 +02:00
Christoph Hellwig
41bb26f8db uuid,afs: move struct uuid_v1 back into afs
This essentially is a partial revert of commit ff548773
("afs: Move UUID struct to linux/uuid.h") and moves struct uuid_v1 back into
fs/afs as struct afs_uuid.  It however keeps it as big endian structure
so that we can use the normal uuid generation helpers when casting to/from
struct afs_uuid.

The V1 uuid intrepretation in struct form isn't really useful to the
rest of the kernel, and not really compatible to it either, so move it
back to AFS instead of polluting the global uuid.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Howells <dhowells@redhat.com>
2017-06-05 16:56:34 +02:00
Richard Narron
239e250e4a fs/ufs: Set UFS default maximum bytes per file
This fixes a problem with reading files larger than 2GB from a UFS-2
file system:

    https://bugzilla.kernel.org/show_bug.cgi?id=195721

The incorrect UFS s_maxsize limit became a problem as of commit
c2a9737f45 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
which started using s_maxbytes to avoid a page index overflow in
do_generic_file_read().

That caused files to be truncated on UFS-2 file systems because the
default maximum file size is 2GB (MAX_NON_LFS) and UFS didn't update it.

Here I simply increase the default to a common value used by other file
systems.

Signed-off-by: Richard Narron <comet.berkeley@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Will B <will.brokenbourgh2877@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.9 and backports of c2a9737f45
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-04 16:33:54 -07:00
Linus Torvalds
125f42b0e2 NFS client bugfixes for Linux 4.12
Bugfixes include:
 
 - Fix a typo in commit e092693443 that breaks copy offload
 - Fix the connect error propagation in xs_tcp_setup_socket()
 - Fix a lock leak in nfs40_walk_client_list
 - Verify that pNFS requests lie within the offset range of the layout segment.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZM0YBAAoJEGcL54qWCgDyLUwQALaPEVp00UMdDR0in7MIFKsO
 2mgi7pOyn6po3EjxKbGtjAbL4nSlVxdaFpCIGg47YXrl9/95Zjjmyke+iwRdnMsa
 ZPyXwfhVRa80fxbOogAverNCnCptoHoG7EzdWuCTcOOxMxR3Ixs7wVJXrs+7ig+r
 IdvIAyTsiDYuP5yVp5KkmJCtLGc0Ze20rb7VgdQJfdiLibWvfYCLZ9CgfAQkdAMU
 RIlbT0/BG13XDqwh/C2V1vLge0VfpT5p8qbIb/kFyQ0ZJUUiicGGGjp3u/yj0aG9
 ljldI34WmQpsy+nCNN4dEgsF461ECvWLwRZnnpN9nv7VurUBpJNUqHLnubvDbzhh
 w8QX54ceEWuQAjg96keNuYOhoG53Omle2/Cm+nmiJOmShJbJ0yh4OcB9DYe0gdYa
 5YXbKRjPvf/HfdE7PPpvbPG2E211zfvkLdHnFxswggWyGrh23kqlWrpcHpZomGNW
 GbJLfIfhyEfBjCPdNJT3Tzvewo2LkcTNLb+3mJhkxOegkdops8vGYA9G2mba3Daj
 1HWl1yFAdzlEf2H1Cb8Y2ZrJKHAmaYBKBkKZYUeAcr6EtoxNqnNMP+PEDcVIzPKg
 6Jq7DiYwYksK+XDWK9G4QBguKKGLvYtv0MIA3QDX+bBGLo+eFYxc2iaaJefYNdkK
 +vdLHclg/YpepLg+Ui21
 =P2bm
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Bugfixes include:

   - Fix a typo in commit e092693443 ("NFS append COMMIT after
     synchronous COPY") that breaks copy offload

   - Fix the connect error propagation in xs_tcp_setup_socket()

   - Fix a lock leak in nfs40_walk_client_list

   - Verify that pNFS requests lie within the offset range of the layout
     segment"

* tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: Mark unnecessarily extern functions as static
  SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
  NFSv4.0: Fix a lock leak in nfs40_walk_client_list
  pnfs: Fix the check for requests in range of layout segment
  xprtrdma: Delete an error message for a failed memory allocation in xprt_rdma_bc_setup()
  pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
  NFS fix COMMIT after COPY
2017-06-04 11:56:53 -07:00
Al Viro
ae2a9762d6 compat statfs: switch to copy_to_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-04 13:51:34 -04:00
Jan Kara
4f253e1eb6 nfs: Mark unnecessarily extern functions as static
nfs_initialise_sb() and nfs_clone_super() are declared as extern even
though they are used only in fs/nfs/super.c. Mark them as static.

Also remove explicit 'inline' directive from nfs_initialise_sb() and
leave it upto compiler to decide whether inlining is worth it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-06-03 16:06:38 -04:00
Linus Torvalds
f219764920 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  scripts/gdb: make lx-dmesg command work (reliably)
  mm: consider memblock reservations for deferred memory initialization sizing
  mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
  mlock: fix mlock count can not decrease in race condition
  mm/migrate: fix refcount handling when !hugepage_migration_supported()
  dax: fix race between colliding PMD & PTE entries
  mm: avoid spurious 'bad pmd' warning messages
  mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once
  pcmcia: remove left-over %Z format
  slub/memcg: cure the brainless abuse of sysfs attributes
  initramfs: fix disabling of initramfs (and its compression)
  mm: clarify why we want kmalloc before falling backto vmallock
  frv: declare jiffies to be located in the .data section
  include/linux/gfp.h: fix ___GFP_NOLOCKDEP value
  ksm: prevent crash after write_protect_page fails
2017-06-02 15:49:46 -07:00
Ross Zwisler
e2093926a0 dax: fix race between colliding PMD & PTE entries
We currently have two related PMD vs PTE races in the DAX code.  These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.

Here is the first race:

  CPU 0					CPU 1

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
    handle_pte_fault()
      passes check for pmd_devmap()

					(private mapping read)
					__handle_mm_fault()
					  create_huge_pmd()
					    dax_iomap_pmd_fault() inserts PMD

      dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
      			  installed in our page tables at this spot.

Here's the second race:

  CPU 0					CPU 1

  (private mapping read)
  __handle_mm_fault()
    passes check for pmd_none()
    create_huge_pmd()
      dax_iomap_pmd_fault() inserts PMD

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

    handle_pte_fault()
      dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c.  This
means for instance that this code in __handle_mm_fault() can run:

	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent.  This is the cause of
the 2nd race.  The first race is similar - there is the following check
in handle_pte_fault():

	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
			return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault.  This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
  BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.

[ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
  Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Eryu Guan <eguan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02 15:07:37 -07:00
Linus Torvalds
e6e6d07436 Changes since last update:
- Fix an unmount hang due to a race in io buffer accounting.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZMKVEAAoJEPh/dxk0SrTrBYcQAKSpzE8C9wDBw6cyxP3kwrTr
 FSQiSr7flnGBHwy2U0UC/SIFIwYxvW4BTnXJWADyqtnvLWP1+TC7UY1oNpkTsbkK
 KLsWgz3aOcT/8sb346PzFDAuxof2lkv3xFPRBFaoeSkybxWqLz6BWsbmaJNH/wqy
 W3k3H241mAftEiv1i9IUlAZMXE31qywIKzzUJvkOglXS8OdVFfMPQvUz6epU2LWA
 I2tBip936Sl45vLu6ubqoRpk8dWNuPPX+f4YXl8dVeqRKTYhviMwgYD4rlljb6Ti
 kIRG9HYg1GVZo5z/5unAjyEaKzYoRrXnO5Lg+i09NIhezlDhB2HJ+k71NljoeHoe
 YCwqumQIGgnxdFu+FP10tKh2EWvDp80SQxgzIvr+FCCKJdsdNYyftRh4CtsCPJSG
 xWHT1jgovygHsBEEmG2LS9mCXKkyWgMkHNMBu3Yy/F/4HGzrPjcU3F+x90OmOo7J
 S26kEwsAoo+Q5Is8QkmqrnD+CQ7jwXEv9Mw3UqRwQ7UagRdR2nI8CIGEC7W+42Mm
 Gd3TtAyJCbhZWXNq7pLeTnGu7JY3/dhR/8VSW+mIKtvFg7v9O1wZBYId8vTwZN1+
 8jgnW0h6myE10YKU5bc1TZeYYAkWA+JLRKxoexL3QD8jWeffyZgMNWPM2rb+4Jjp
 2wwCHMPvHE8X7a2urTW3
 =wRbJ
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fix from Darrick Wong:
 "I've one more bugfix for you for 4.12-rc4: Fix an unmount hang due to
  a race in io buffer accounting"

* tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: use ->b_state to fix buffer I/O accounting release race
2017-06-02 12:29:03 -07:00
Linus Torvalds
3b1e342be2 Revert patch accidentally included in the merge window pull request, and
fix a crash that was likely a result of buggy client behavior.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZMHDGAAoJECebzXlCjuG+wVAP/RC2THsrHEfWQSrc+/wkKron
 7PUZo6VRhoasjBInSJB/tdy+Yb82NbfLoXfJ71ddAwRUZlte74aI762HHuMdWtHY
 8mCum5ea1AfRX5N/L/isO6lh4utO0vQEJ8r+P095d3EDwl0DnLYC3JVlKd/1r2VS
 ELy8DZkyaVHZO9xiT+mnRgsq4aMjxG3F7DTHpcKDDFzG5Ts00zBQIXDu/rKmw3fD
 WEuQjjrit1gFrUIUzJbSqwSokDCcf7v9HtGTI5+t+pIZ4Q2SyuKuTZvjtg+hb7Qa
 K+F2SNNQsqfTW65zllhVR3gYpCykoqYPAJDw9MlqLN5tCmXFZLYhFHEUFx5kuobx
 7+Dc3z1o5BgOiXcnKBVe+uONxXxcMYXLbU0e5Gac39GYW5xWzrU1+O6mMi0Q01YS
 QsGRZEqHE2/3j1TAl0Q2SqT8gtG+A7piU4s5VavIHKIzI3/WubZ1GjLQ+RfXjuNa
 DvkcAvSYfHyxzdWlyxjkzM09edt6SN3yEYdIRv9hiJEbUO3itVm9ycXTHLJUQUL0
 sfVeXkm49e8gZZxHn+XuJubkT8HYlDGLQVSzK1zWFgt+zxd9LiP9iY+zs+vL9ryJ
 DM9VmlJxZvNx9T7zSradW7gbIwOgxmBfRHFD05oODS1Tymb029akuU0YACb0sVnQ
 LzDaZejUmURp7vlUffFp
 =wznG
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Revert patch accidentally included in the merge window pull request,
  and fix a crash that was likely a result of buggy client behavior"

* tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux:
  nfsd4: fix null dereference on replay
  nfsd: Revert "nfsd: check for oversized NFSv2/v3 arguments"
2017-06-01 16:24:48 -07:00
Linus Torvalds
2f48641cfc Use designated initializers for mtk-vcodec, powerplay, amdgpu, and sgi-xp.
Use ERR_CAST() to avoid cross-structure cast in ocf2, ntfs, and NFS.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZMHWdAAoJEIly9N/cbcAmWOYP/i45fa6JG7Aw9N59Uz4sqeUQ
 ZUlvAUek6GkaGijCPtDYjy0cVj2Cc3QZLSRq9dDw/rU66Mc0ybYWHtIIwJy4ZjVe
 D4w2Cs7K1oSOnhJnPTjQSKuMD81PF75NLChf3XSfLvtOWVIqW33EzLIu5lJ1rc1x
 wh1fEAsJXGA9xklmW+m8Vn1FoS1a1j+9zuCEmGpveOkk6UKhhp73Ke8PP4uK9ld+
 saApe/iH0JdTP6I7030A8hXwz7ZCYbMicw1kVpnsn4rM24p+k3Y2/OrFT2tY6/Y6
 fzkTuVL7omQmUWph9zX6SYPg2GACEBTLb5V1YJ6zDUUzucu7vjfsvsTHXZb1gq2j
 i8hZ6XsNOMWYJiOkOOSKM0rpjG6WSvF/sGc78ap7NJ4QPZ2/h3BTOXfk/ye/xQmL
 WidEESJ4srInpi5ju8JTWHe27aydwiUUF91Y+gFv4G6CGU6/5vjUzOsgeiMxt0JN
 lPaTjjL4lBHI2yohx2Wqy88yYWulK3LB0Hzt9XcSGMBA58H9d0CV0ZTkH3dJJkpC
 QCM+Kt1DPy5A2RPC2APrPPCJsQycX9PSDeRaWkTxHnNLftpq65h1pAKjMcqsUPgb
 HEEMLIBGqm871dr3+aPJPfG3Qil9ANBscDRbHXugCFTseFQO6M26KAxWGN+6LIQp
 6Z0GUaPgJEua9ejodq4m
 =R3qn
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc-plugin prepwork from Kees Cook:
 "Use designated initializers for mtk-vcodec, powerplay, amdgpu, and
  sgi-xp. Use ERR_CAST() to avoid cross-structure cast in ocf2, ntfs,
  and NFS.

  Christoph Hellwig recommended that I send these fixes now, rather than
  waiting for the v4.13 merge window. These are all initializer and cast
  fixes needed for the future randstruct plugin that haven't been picked
  up by the respective maintainers"

* tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  mtk-vcodec: Use designated initializers
  drm/amd/powerplay: Use designated initializers
  drm/amdgpu: Use designated initializers
  sgi-xp: Use designated initializers
  ocfs2: Use ERR_CAST() to avoid cross-structure cast
  ntfs: Use ERR_CAST() to avoid cross-structure cast
  NFS: Use ERR_CAST() to avoid cross-structure cast
2017-06-01 16:17:42 -07:00
Bart Van Assche
30181faae3 nfsd: Check queue type before submitting a SCSI request
Since using scsi_req() is only allowed against request queues for
which struct scsi_request is the first member of their private
request data, refuse to submit SCSI commands against a queue for
which this is not the case.

References: commit 82ed4db499 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Omar Sandoval <osandov@fb.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-01 13:10:46 -06:00
Linus Torvalds
0bb230399f Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull Reiserfs and GFS2 fixes from Jan Kara:
 "Fixes to GFS2 & Reiserfs for the fallout of the recent WRITE_FUA
  cleanup from Christoph.

  Fixes for other filesystems were already merged by respective
  maintainers."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: Make flush bios explicitely sync
  gfs2: Make flush bios explicitely sync
2017-06-01 10:45:27 -07:00
Christoph Hellwig
94073ad77f fs/locks: don't mess with the address limit in compat_fcntl64
Instead write a proper compat syscall that calls common helpers.

[ jlayton: fix pointer dereferencing in fixup_compat_flock ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-01 11:29:07 -04:00
Jeff Mahoney
a9b3311ef3 btrfs: fix race with relocation recovery and fs_root setup
If we have to recover relocation during mount, we'll ultimately have to
evict the orphan inode.  That goes through the reservation dance, where
priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
to be valid.  That's the next thing to be set up during mount, so we
crash, almost always in flush_space trying to join the transaction
but priority_reclaim_metadata_space is possible as well.  This call
path has been problematic in the past WRT whether ->fs_root is valid
yet.  Commit 957780eb27 (Btrfs: introduce ticketed enospc
infrastructure) added new users that are called in the direct path
instead of the async path that had already been worked around.

The thing is that we don't actually need the fs_root, specifically, for
anything.  We either use it to determine whether the root is the
chunk_root for use in choosing an allocation profile or as a root to pass
btrfs_join_transaction before immediately committing it.  Anything that
isn't the chunk root works in the former case and any root works in
the latter.

A simple fix is to use a root we know will always be there: the
extent_root.

Cc: <stable@vger.kernel.org> # v4.8+
Fixes: 957780eb27 (Btrfs: introduce ticketed enospc infrastructure)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:55 +02:00
Jeff Mahoney
896533a7da btrfs: fix memory leak in update_space_info failure path
If we fail to add the space_info kobject, we'll leak the memory
for the percpu counter.

Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs)
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:31 +02:00
David Sterba
cc2b702c52 btrfs: use correct types for page indices in btrfs_page_exists_in_range
Variables start_idx and end_idx are supposed to hold a page index
derived from the file offsets. The int type is not the right one though,
offsets larger than 1 << 44 will get silently trimmed off the high bits.
(1 << 44 is 16TiB)

What can go wrong, if start is below the boundary and end gets trimmed:
- if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
- the final check "if (page->index <= end_idx)" will unexpectedly fail

The function will return false, ie. "there's no page in the range",
although there is at least one.

btrfs_page_exists_in_range is used to prevent races in:

* in hole punching, where we make sure there are not pages in the
  truncated range, otherwise we'll wait for them to finish and redo
  truncation, but we're going to replace the pages with holes anyway so
  the only problem is the intermediate state

* lock_extent_direct: we want to make sure there are no pages before we
  lock and start DIO, to prevent stale data reads

For practical occurence of the bug, there are several constaints.  The
file must be quite large, the affected range must cross the 16TiB
boundary and the internal state of the file pages and pending operations
must match.  Also, we must not have started any ordered data in the
range, otherwise we don't even reach the buggy function check.

DIO locking tries hard in several places to avoid deadlocks with
buffered IO and avoids waiting for ranges. The worst consequence seems
to be stale data read.

CC: Liu Bo <bo.li.liu@oracle.com>
CC: stable@vger.kernel.org	# 3.16+
Fixes: fc4adbff82 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:17 +02:00
Kees Cook
d3762358a7 pstore: Fix format string to use %u for record id
The format string for record->id (u64) was using %lld instead of %llu.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-31 10:13:45 -07:00
Kees Cook
c7f3c595f6 pstore: Populate pstore record->time field
The current time will be initially available in the record->time field
for all pstore_read() and pstore_write() calls. Backends can either
update the field during read(), or use the field during write() instead
of fetching time themselves.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-31 10:13:44 -07:00
Kees Cook
e581ca813a pstore: Create common record initializer
In preparation for setting timestamps in the pstore core, create a common
initializer routine, instead of using static initializers.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-31 10:13:44 -07:00
Kees Cook
656de42e83 pstore: Avoid potential infinite loop
If a backend does not correctly iterate through its records, pstore will
get stuck loading entries. Detect this with a large record count, and
announce if we ever hit the limit. This will let future backend reading
bugs less annoying to debug. Additionally adjust the error about
pstore_mkfile() failing.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-31 10:13:42 -07:00
Douglas Anderson
f6525b96dd pstore: Fix leaked pstore_record in pstore_get_backend_records()
When the "if (record->size <= 0)" test is true in
pstore_get_backend_records() it's pretty clear that nobody holds a
reference to the allocated pstore_record, yet we don't free it.

Let's free it.

Fixes: 2a2b0acf76 ("pstore: Allocate records on heap instead of stack")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
2017-05-31 10:10:09 -07:00
Ankit Kumar
4a16d1cb24 pstore: Don't warn if data is uncompressed and type is not PSTORE_TYPE_DMESG
commit 9abdcccc3d ("pstore: Extract common arguments into structure")
moved record decompression to function. decompress_record() gets
called without checking type and compressed flag. Warning will be
reported if data is uncompressed. Pstore type PSTORE_TYPE_PPC_OPAL,
PSTORE_TYPE_PPC_COMMON doesn't contain compressed data and warning get
printed part of dmesg.

Partial dmesg log:
[   35.848914] pstore: ignored compressed record type 6
[   35.848927] pstore: ignored compressed record type 8

Above warning should not get printed as it is known that data won't be
compressed for above type and it is valid condition.

This patch returns if data is not compressed and print warning only if
data is compressed and type is not PSTORE_TYPE_DMESG.

Reported-by: Anton Blanchard <anton@au1.ibm.com>
Signed-off-by: Ankit Kumar <ankit@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Fixes: 9abdcccc3d ("pstore: Extract common arguments into structure")
Cc: stable@vger.kernel.org
2017-05-31 10:09:32 -07:00
Linus Torvalds
d602fb6844 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Fix regressions:

   - missing CONFIG_EXPORTFS dependency

   - failure if upper fs doesn't support xattr

   - bad error cleanup

  This also adds the concept of "impure" directories complementing the
  "origin" marking introduced in -rc1. Together they enable getting
  consistent st_ino and d_ino for directory listings.

  And there's a bug fix and a cleanup as well"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: filter trusted xattr for non-admin
  ovl: mark upper merge dir with type origin entries "impure"
  ovl: mark upper dir with type origin entries "impure"
  ovl: remove unused arg from ovl_lookup_temp()
  ovl: handle rename when upper doesn't support xattr
  ovl: don't fail copy-up if upper doesn't support xattr
  ovl: check on mount time if upper fs supports setting xattr
  ovl: fix creds leak in copy up error path
  ovl: select EXPORTFS
2017-05-31 08:29:02 -07:00
Brian Foster
63db7c815b xfs: use ->b_state to fix buffer I/O accounting release race
We've had user reports of unmount hangs in xfs_wait_buftarg() that
analysis shows is due to btp->bt_io_count == -1. bt_io_count
represents the count of in-flight asynchronous buffers and thus
should always be >= 0. xfs_wait_buftarg() waits for this value to
stabilize to zero in order to ensure that all untracked (with
respect to the lru) buffers have completed I/O processing before
unmount proceeds to tear down in-core data structures.

The value of -1 implies an I/O accounting decrement race. Indeed,
the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
(where the buffer lock is no longer held) means that bp->b_flags can
be updated from an unsafe context. While a user-level reproducer is
currently not available, some intrusive hacks to run racing buffer
lookups/ioacct/releases from multiple threads was used to
successfully manufacture this problem.

Existing callers do not expect to acquire the buffer lock from
xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
this context. It turns out that we already have separate buffer
state bits and associated serialization for dealing with buffer LRU
state in the form of ->b_state and ->b_lock. Therefore, replace the
_XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
accounting wrappers appropriately and make sure they are used with
the correct locking. This ensures that buffer in-flight state can be
modified at buffer release time without racing with modifications
from a buffer lock holder.

Fixes: 9c7504aa72 ("xfs: track and serialize in-flight async buffers against unmount")
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Libor Pechacek <lpechacek@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-31 08:22:52 -07:00
Linus Torvalds
f511c0b17b "Yes, people use FOLL_FORCE ;)"
This effectively reverts commit 8ee74a91ac ("proc: try to remove use
of FOLL_FORCE entirely")

It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:

Keno Fischer:
 "We used these semantics as a hardening mechanism in the julia JIT. By
  opening /proc/self/mem and using these semantics, we could avoid
  needing RWX pages, or a dual mapping approach. We do have fallbacks to
  these other methods (though getting EIO here actually causes an assert
  in released versions - we'll updated that to make sure to take the
  fall back in that case).

  Nevertheless the /proc/self/mem approach was our favored approach
  because it a) Required an attacker to be able to execute syscalls
  which is a taller order than getting memory write and b) didn't double
  the virtual address space requirements (as a dual mapping approach
  would).

  I think in general this feature is very useful for anybody who needs
  to precisely control the execution of some other process. Various
  debuggers (gdb/lldb/rr) certainly fall into that category, but there's
  another class of such processes (wine, various emulators) which may
  want to do that kind of thing.

  Now, I suspect most of these will have the other process under ptrace
  control, so maybe allowing (same_mm || ptraced) would be ok, but at
  least for the sandbox/remote-jit use case, it would be perfectly
  reasonable to not have the jit server be a ptracer"

Robert O'Callahan:
 "We write to readonly code and data mappings via /proc/.../mem in lots
  of different situations, particularly when we're adjusting program
  state during replay to match the recorded execution.

  Like Julia, we can add workarounds, but they could be expensive."

so not only do people use FOLL_FORCE for both reads and writes, but they
use it for both the local mm and remote mm.

With these comments in mind, we likely also cannot add the "are we
actively ptracing" check either, so this keeps the new code organization
and does not do a real revert that would add back the original comment
about "Maybe we should limit FOLL_FORCE to actual ptrace users?"

Reported-by: Keno Fischer <keno@juliacomputing.com>
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-30 12:38:59 -07:00
Jan Kara
67a7d5f561 ext4: fix fdatasync(2) after extent manipulation operations
Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.

Fix the problem by properly marking that file data has changed in these
operations.

CC: stable@vger.kernel.org
Fixes: a4bb6b64e3
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-29 13:24:55 -04:00
Miklos Szeredi
a082c6f680 ovl: filter trusted xattr for non-admin
Filesystems filter out extended attributes in the "trusted." domain for
unprivlieged callers.

Overlay calls underlying filesystem's method with elevated privs, so need
to do the filtering in overlayfs too.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29 15:15:27 +02:00
Amir Goldstein
f3a1568582 ovl: mark upper merge dir with type origin entries "impure"
An upper dir is marked "impure" to let ovl_iterate() know that this
directory may contain non pure upper entries whose d_ino may need to be
read from the origin inode.

We already mark a non-merge dir "impure" when moving a non-pure child
entry inside it, to let ovl_iterate() know not to iterate the non-merge
dir directly.

Mark also a merge dir "impure" when moving a non-pure child entry inside
it and when copying up a child entry inside it.

This can be used to optimize ovl_iterate() to perform a "pure merge" of
upper and lower directories, merging the content of the directories,
without having to read d_ino from origin inodes.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29 11:48:00 +02:00
Kees Cook
7585d12f65 ocfs2: Use ERR_CAST() to avoid cross-structure cast
When trying to propagate an error result, the error return path attempts
to retain the error, but does this with an open cast across very different
types, which the upcoming structure layout randomization plugin flags as
being potentially dangerous in the face of randomization. This is a false
positive, but what this code actually wants to do is use ERR_CAST() to
retain the error value.

Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28 10:11:49 -07:00
Kees Cook
fee2aa7538 ntfs: Use ERR_CAST() to avoid cross-structure cast
When trying to propagate an error result, the error return path attempts
to retain the error, but does this with an open cast across very different
types, which the upcoming structure layout randomization plugin flags as
being potentially dangerous in the face of randomization. This is a false
positive, but what this code actually wants to do is use ERR_CAST() to
retain the error value.

Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28 10:11:48 -07:00
Kees Cook
fe3b81b446 NFS: Use ERR_CAST() to avoid cross-structure cast
When the call to nfs_devname() fails, the error path attempts to retain
the error via the mnt variable, but this requires a cast across very
different types (char * to struct vfsmount *), which the upcoming
structure layout randomization plugin flags as being potentially
dangerous in the face of randomization. This is a false positive, but
what this code actually wants to do is retain the error value, so this
patch explicitly sets it, instead of using what seems to be an
unexpected cast.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-28 10:11:47 -07:00
Al Viro
4d7edbc34c nfsd_readlink(): switch to vfs_get_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-27 16:11:23 -04:00
Christoph Hellwig
a75d30c772 fs/locks: pass kernel struct flock to fcntl_getlk/setlk
This will make it easier to implement a sane compat fcntl syscall.

[ jlayton: fix undeclared identifiers in 32-bit fcntl64 syscall handler ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-05-27 06:07:19 -04:00
Mauro Carvalho Chehab
80b79dd0e2 fs: locks: Fix some troubles at kernel-doc comments
There are a few syntax violations that cause outputs of
a few comments to not be properly parsed in ReST format.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-05-27 06:07:18 -04:00
Jan Kara
a056bdaae7 ext4: fix data corruption for mmap writes
mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.

Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.

Reported-by: Michael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d51883
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26 17:45:45 -04:00
Jan Kara
4f8caa60a5 ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
When ext4_map_blocks() is called with EXT4_GET_BLOCKS_ZERO to zero-out
allocated blocks and these blocks are actually converted from unwritten
extent the following race can happen:

CPU0					CPU1

page fault				page fault
...					...
ext4_map_blocks()
  ext4_ext_map_blocks()
    ext4_ext_handle_unwritten_extents()
      ext4_ext_convert_to_initialized()
	- zero out converted extent
	ext4_zeroout_es()
	  - inserts extent as initialized in status tree

					ext4_map_blocks()
					  ext4_es_lookup_extent()
					    - finds initialized extent
					write data
  ext4_issue_zeroout()
    - zeroes out new extent overwriting data

This problem can be reproduced by generic/340 for the fallocated case
for the last block in the file.

Fix the problem by avoiding zeroing out the area we are mapping with
ext4_map_blocks() in ext4_ext_convert_to_initialized(). It is pointless
to zero out this area in the first place as the caller asked us to
convert the area to initialized because he is just going to write data
there before the transaction finishes. To achieve this we delete the
special case of zeroing out full extent as that will be handled by the
cases below zeroing only the part of the extent that needs it. We also
instruct ext4_split_extent() that the middle of extent being split
contains data so that ext4_split_extent_at() cannot zero out full extent
in case of ENOSPC.

CC: stable@vger.kernel.org
Fixes: 12735f8819
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26 17:40:52 -04:00
Linus Torvalds
cdbe020678 Changed since last update:
- Fix indlen block reservation accounting bug when splitting delalloc extent
 - Fix warnings about unused variables that appeared in -rc1.
 - Don't spew errors when bmapping a local format directory
 - Fix an off-by-one error in a delalloc eof assertion
 - Make fsmap only return inode information for CAP_SYS_ADMIN
 - Fix a potential mount time deadlock recovering cow extents
 - Fix unaligned memory access in _btree_visit_blocks
 - Fix various SEEK_HOLE/SEEK_DATA bugs
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZJwnxAAoJEPh/dxk0SrTr/TMQAKP6OMsjYxpro+1Uif+oPTQ6
 vvUfXJMWLKc07QI/czwLDY4A36h2TZjNxpBJypSfVumlD82ZPa8gp6XFWngwIUb4
 3G+A9zq4Fviq8Vzz3G75C8Q49h8IpmU3SimTlhS1BIcxe+upu2qplzM3yc6/T4MB
 WTTqtjL3SaW5D2v0ZdPL9ulQKKAlL1WfbZV9dDJ4UiRw5Jlwj2Udg6HnbRvfrcZF
 IziYlidrTIt64ecA9GqR32soXqFBGPKo6Wp9Pk+iWLlsfM6qcCt1m+yfM1JonRGA
 wycygcrrjfR/lFHMQCGonLs1ajC6isLeMZ804P6OP2q6kfdtersedvY7XSoYsEJ4
 ok4J3fiyqYgMGhPz7x0Y8IH9+gdudn7+fHiC5/RNkolEy8AbPPe21XhFDVxeTkCs
 4GAHNGQfOEK2PT69Ya81taVzT/TpuIGIkUAaDH8vsfxwcVunM08/OffsCiinLMJx
 bt3G7fH3wJ+VuYJS92amj3k6n6EAeHYc0dAVGd5e8dtN25079nBm+EP0Wp+j8uVl
 PwaJjde68wxWUvuYXVK1a8vietRS7xChyta34cYcStd4wWu1knccpN/mjQnK/ucB
 4etZspB1rQQx08KBqHVq8t508PA7nWtFxjE91JYkpvbyYym1WEH8Mz7rbVBI6NjS
 Y/8+uPhFq2BU1b9skj0U
 =pDjl
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fixes from Darrick Wong:
 "A few miscellaneous bug fixes & cleanups:

   - Fix indlen block reservation accounting bug when splitting delalloc
     extent

   - Fix warnings about unused variables that appeared in -rc1.

   - Don't spew errors when bmapping a local format directory

   - Fix an off-by-one error in a delalloc eof assertion

   - Make fsmap only return inode information for CAP_SYS_ADMIN

   - Fix a potential mount time deadlock recovering cow extents

   - Fix unaligned memory access in _btree_visit_blocks

   - Fix various SEEK_HOLE/SEEK_DATA bugs"

* tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff()
  xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()
  xfs: Fix missed holes in SEEK_HOLE implementation
  xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
  xfs: fix unaligned access in xfs_btree_visit_blocks
  xfs: avoid mount-time deadlock in CoW extent recovery
  xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN
  xfs: bad assertion for delalloc an extent that start at i_size
  xfs: fix warnings about unused stack variables
  xfs: BMAPX shouldn't barf on inline-format directories
  xfs: fix indlen accounting error on partial delalloc conversion
2017-05-26 12:13:08 -07:00
Al Viro
8d1a81a852 sanitize do_i2c_smbus_ioctl()
no need to mess with __copy_in_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-25 17:52:59 -04:00
Jan Kara
a54fba8f5a xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff()
Currently several places in xfs_find_get_desired_pgoff() handle the case
of a missing page. Make them all handled in one place after the loop has
terminated.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Jan Kara
d7fd24257a xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()
There is an off-by-one error in loop termination conditions in
xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Jan Kara
5375023ae1 xfs: Fix missed holes in SEEK_HOLE implementation
XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as
can be seen by the following command:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
       -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec)
Whence	Result
HOLE	139264

Where we can see that hole at offset 56k was just ignored by SEEK_HOLE
implementation. The bug is in xfs_find_get_desired_pgoff() which does
not properly detect the case when pages are not contiguous.

Fix the problem by properly detecting when found page has larger offset
than expected.

CC: stable@vger.kernel.org
Fixes: d126d43f63
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Eryu Guan
8affebe16d xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
xfs_find_get_desired_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size XFS on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
  	    -c "seek -d 0" /mnt/xfs/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is uncovered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Eric Sandeen
a4d768e702 xfs: fix unaligned access in xfs_btree_visit_blocks
This structure copy was throwing unaligned access warnings on sparc64:

Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]

xfs_btree_copy_ptrs does a memcpy, which avoids it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Tahsin Erdogan
b8cb5a545c ext4: fix quota charging for shared xattr blocks
ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr
block when new references are made. However if dquot_initialize() hasn't
been called on an inode, request for charging is effectively ignored
because ext4_inode_info->i_dquot is not initialized yet.

Add dquot_initialize() to call paths that lead to ext4_xattr_block_set().

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:24:07 -04:00
Eric Biggers
c41d342b39 ext4: remove redundant check for encrypted file on dio write path
Currently we don't allow direct I/O on encrypted regular files, so in
such cases we return 0 early in ext4_direct_IO().  There was also an
additional BUG_ON() check in ext4_direct_IO_write(), but it can never be
hit because of the earlier check for the exact same condition in
ext4_direct_IO().  There was also no matching check on the read path,
which made the write path specific check seem very ad-hoc.

Just remove the unnecessary BUG_ON().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: David Gstir <david@sigma-star.at>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:20:31 -04:00
Eric Biggers
d6b975504e ext4: remove unused d_name argument from ext4_search_dir() et al.
Now that we are passing a struct ext4_filename, we do not need to pass
around the original struct qstr too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:10:49 -04:00
Eric Biggers
e5465795ca ext4: fix off-by-one error when writing back pages before dio read
The 'lend' argument of filemap_write_and_wait_range() is inclusive, so
we need to subtract 1 from pos + count.

Note that 'count' is guaranteed to be nonzero since
ext4_file_read_iter() returns early when given a 0 count.

Fixes: 16c5468859 ("ext4: Allow parallel DIO reads")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:05:29 -04:00
Eryu Guan
624327f879 ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
  	    -c "seek -d 0" /mnt/ext4/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:02:20 -04:00
Luis Henriques
42c99fc4c7 ceph: check that the new inode size is within limits in ceph_fallocate()
Currently the ceph client doesn't respect the rlimit in fallocate.  This
means that a user can allocate a file with size > RLIMIT_FSIZE.  This
patch adds the call to inode_newsize_ok() to verify filesystem limits and
ulimits.  This should make ceph successfully run xfstest generic/228.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-24 18:10:54 +02:00
Trond Myklebust
b49c15f97c NFSv4.0: Fix a lock leak in nfs40_walk_client_list
Xiaolong Ye's kernel test robot detected the following Oops:
[  299.158991] BUG: scheduling while atomic: mount.nfs/9387/0x00000002
[  299.169587] 2 locks held by mount.nfs/9387:
[  299.176165]  #0:  (nfs_clid_init_mutex){......}, at: [<ffffffff8130cc92>] nfs4_discover_server_trunking+0x47/0x1fc
[  299.201802]  #1:  (&(&nn->nfs_client_lock)->rlock){......}, at: [<ffffffff813125fa>] nfs40_walk_client_list+0x2e9/0x338
[  299.221979] CPU: 0 PID: 9387 Comm: mount.nfs Not tainted 4.11.0-rc7-00021-g14d1bbb #45
[  299.235584] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[  299.251176] Call Trace:
[  299.255192]  dump_stack+0x61/0x7e
[  299.260416]  __schedule_bug+0x65/0x74
[  299.266208]  __schedule+0x5d/0x87c
[  299.271883]  schedule+0x89/0x9a
[  299.276937]  schedule_timeout+0x232/0x289
[  299.283223]  ? detach_if_pending+0x10b/0x10b
[  299.289935]  schedule_timeout_uninterruptible+0x2a/0x2c
[  299.298266]  ? put_rpccred+0x3e/0x115
[  299.304327]  ? schedule_timeout_uninterruptible+0x2a/0x2c
[  299.312851]  msleep+0x1e/0x22
[  299.317612]  nfs4_discover_server_trunking+0x102/0x1fc
[  299.325644]  nfs4_init_client+0x13f/0x194

It looks as if we recently added a spin_lock() leak to
nfs40_walk_client_list() when cleaning up the code.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Fixes: 14d1bbb0ca ("NFS: Create a common nfs4_match_client() function")
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 08:05:16 -04:00
Benjamin Coddington
08cb5b0f05 pnfs: Fix the check for requests in range of layout segment
It's possible and acceptable for NFS to attempt to add requests beyond the
range of the current pgio->pg_lseg, a case which should be caught and
limited by the pg_test operation.  However, the current handling of this
case replaces pgio->pg_lseg with a new layout segment (after a WARN) within
that pg_test operation.  That will cause all the previously added requests
to be submitted with this new layout segment, which may not be valid for
those requests.

Fix this problem by only returning zero for the number of bytes to coalesce
from pg_test for this case which allows any previously added requests to
complete on the current layout segment.  The check for requests starting
out of range of the layout segment moves to pg_init, so that the
replacement of pgio->pg_lseg will be done when the next request is added.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:55:02 -04:00
Dan Carpenter
662f9a105b pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
If xdr_inline_decode() fails then we end up returning ERR_PTR(0).  The
caller treats NULL returns as -ENOMEM so it doesn't really hurt runtime,
but obviously we intended to set an error code here.

Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:52:54 -04:00
Olga Kornievskaia
6d3b5d8d8d NFS fix COMMIT after COPY
Fix a typo in the commit e092693443
"NFS append COMMIT after synchronous COPY"

Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: e092693443 ("NFS append COMMIT after synchronous COPY")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:52:48 -04:00
Jan Kara
d8747d642e reiserfs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
CC: reiserfs-devel@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2017-05-24 13:35:20 +02:00
Jan Kara
0f0b9b63e1 gfs2: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
CC: stable@vger.kernel.org
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-05-24 13:35:20 +02:00
Eric Biggers
aaebdee8b8 f2fs: don't bother checking for encryption key in ->write_iter()
Since only an open file can be written to, and we only allow open()ing
an encrypted file when its key is available, there is no need to check
for the key again before permitting each ->write_iter().

This code was also broken in that it wouldn't actually have failed if
the key was in fact unavailable.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:11:08 -07:00
Eric Biggers
b82a6ea6ec f2fs: don't bother checking for encryption key in ->mmap()
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().

This f2fs copy of this code was also broken in that it wouldn't actually
have failed if the key was in fact unavailable.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:10:36 -07:00
Chao Yu
6afae6336a f2fs: wait discard IO completion without cmd_lock held
Wait discard IO completion outside cmd_lock to avoid long latency
of holding cmd_lock in IO busy scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:10:03 -07:00
Chao Yu
e31b982157 f2fs: wake up all waiters in f2fs_submit_discard_endio
There could be more than one waiter waiting discard IO completion, so we
need use complete_all() instead of complete() in f2fs_submit_discard_endio
to avoid hungtask.

Fixes: 	ec9895add2 ("f2fs: don't hold cmd_lock during waiting discard
command")
Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:54 -07:00
Chao Yu
04dfc23006 f2fs: show more info if fail to issue discard
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:45 -07:00
Chao Yu
fb830fc5cf f2fs: introduce io_list for serialize data/node IOs
Serialize data/node IOs by using fifo list instead of mutex lock,
it will help to enhance concurrency of f2fs, meanwhile keeping LFS
IO semantics.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:03 -07:00
Chao Yu
e41e6d75e5 f2fs: split wio_mutex
Split wio_mutex to adjust different temperature bio cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:23 -07:00
Yunlei He
963932a93c f2fs: combine huge num of discard rb tree consistence checks
Came across a hungtask caused by huge number of rb tree traversing
during adding discard addrs in cp. This patch combine these consistence
checks and move it to discard thread.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:19 -07:00
Yunlei He
dad48e7312 f2fs: fix a bug caused by NULL extent tree
Thread A:					Thread B:

-f2fs_remount
    -sbi->mount_opt.opt = 0;
						<--- -f2fs_iget
						         -do_read_inode
							     -f2fs_init_extent_tree
							         -F2FS_I(inode)->extent_tree is NULL
        -default_options && parse_options
	    -remount return
						<---  -f2fs_map_blocks
						          -f2fs_lookup_extent_tree
                                                              -f2fs_bug_on(sbi, !et);

The same problem with f2fs_new_inode.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:18 -07:00
Jaegeuk Kim
1d7be27082 f2fs: try to freeze in gc and discard threads
This allows to freeze gc and discard threads.

Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:18 -07:00
Yunlei He
b7b7c4cf1c f2fs: add a new function get_ssr_cost
This patch add a new method get_ssr_cost to select
SSR segment more accurately.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:17 -07:00
Hou Pengyang
bd80a4b981 f2fs: declare load_free_nid_bitmap static
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:16 -07:00
Jaegeuk Kim
cc15620bc8 f2fs: avoid f2fs_lock_op for IPU writes
Currently, if we do get_node_of_data before f2fs_lock_op, there may be dead lock
as follows, where process A would be in infinite loop, and B will NOT be awaked.

Process A(cp):            Process B:
f2fs_lock_all(sbi)
                        get_dnode_of_data <---- lock dn.node_page
flush_nodes             f2fs_lock_op

So, this patch adds f2fs_trylock_op to avoid f2fs_lock_op done by IPU.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:15 -07:00
Jaegeuk Kim
a912b54d3a f2fs: split bio cache
Split DATA/NODE type bio cache according to different temperature,
so write IOs with the same temperature can be merged in corresponding
bio cache as much as possible, otherwise, different temperature write
IOs submitting into one bio cache will always cause split of bio.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:39 -07:00
Jaegeuk Kim
81377bd628 f2fs: use fio instead of multiple parameters
This patch just changes using fio instead of parameters.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:38 -07:00
Jaegeuk Kim
b9109b0e49 f2fs: remove unnecessary read cases in merged IO flow
Merged IO flow doesn't need to care about read IOs.

f2fs_submit_merged_bio -> f2fs_submit_merged_write
f2fs_submit_merged_bios -> f2fs_submit_merged_writes
f2fs_submit_merged_bio_cond -> f2fs_submit_merged_write_cond

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:37 -07:00
Jaegeuk Kim
1919ffc0d7 f2fs: use f2fs_submit_page_bio for ra_meta_pages
This patch avoids to use f2fs_submit_merged_bio for read, which was the only
read case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:36 -07:00
Weichao Guo
e5dbd9563e f2fs: make sure f2fs_gc returns consistent errno
By default, f2fs_gc returns -EINVAL in general error cases, e.g., no victim
was selected. However, the default errno may be overwritten in two cases:
gc_more and BG_GC -> FG_GC. We should return consistent errno in such cases.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:35 -07:00
Chao Yu
1c6d8ee4b8 f2fs: support statx
Last kernel has already support new syscall statx() in commit a528d35e8b
("statx: Add a system call to make enhanced file info available"), with
this interface we can show more file info including file creation and some
attribute flags to user.

This patch tries to support this functionality.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:34 -07:00
Jaegeuk Kim
93607124c5 f2fs: load inode's flag from disk
This patch fixes missing inode flag loaded from disk, reported by Tom.

[tom@localhost ~]$ sudo mount /dev/loop0 /mnt/
[tom@localhost ~]$ sudo chown tom:tom /mnt/
[tom@localhost ~]$ touch /mnt/testfile
[tom@localhost ~]$ sudo chattr +i /mnt/testfile
[tom@localhost ~]$ echo test > /mnt/testfile
bash: /mnt/testfile: Operation not permitted
[tom@localhost ~]$ rm /mnt/testfile
rm: cannot remove '/mnt/testfile': Operation not permitted
[tom@localhost ~]$ sudo umount /mnt/
[tom@localhost ~]$ sudo mount /dev/loop0 /mnt/
[tom@localhost ~]$ lsattr /mnt/testfile
----i-------------- /mnt/testfile
[tom@localhost ~]$ echo test > /mnt/testfile
[tom@localhost ~]$ rm /mnt/testfile
[tom@localhost ~]$ sudo umount /mnt/

Cc: stable@vger.kernel.org
Reported-by: Tom Yan <tom.ty89@outlook.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:31 -07:00
J. Bruce Fields
9a307403d3 nfsd4: fix null dereference on replay
if we receive a compound such that:

	- the sessionid, slot, and sequence number in the SEQUENCE op
	  match a cached succesful reply with N ops, and
	- the Nth operation of the compound is a PUTFH, PUTPUBFH,
	  PUTROOTFH, or RESTOREFH,

then nfsd4_sequence will return 0 and set cstate->status to
nfserr_replay_cache.  The current filehandle will not be set.  This will
cause us to call check_nfsd_access with first argument NULL.

To nfsd4_compound it looks like we just succesfully executed an
operation that set a filehandle, but the current filehandle is not set.

Fix this by moving the nfserr_replay_cache earlier.  There was never any
reason to have it after the encode_op label, since the only case where
he hit that is when opdesc->op_func sets it.

Note that there are two ways we could hit this case:

	- a client is resending a previously sent compound that ended
	  with one of the four PUTFH-like operations, or
	- a client is sending a *new* compound that (incorrectly) shares
	  sessionid, slot, and sequence number with a previously sent
	  compound, and the length of the previously sent compound
	  happens to match the position of a PUTFH-like operation in the
	  new compound.

The second is obviously incorrect client behavior.  The first is also
very strange--the only purpose of a PUTFH-like operation is to set the
current filehandle to be used by the following operation, so there's no
point in having it as the last in a compound.

So it's likely this requires a buggy or malicious client to reproduce.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@kernel.vger.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-23 14:20:58 -04:00
Eric W. Biederman
296990deb3 mnt: Make propagate_umount less slow for overlapping mount propagation trees
Andrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.

Make the walk of the mount propagation trees nearly linear by
remembering which mounts have already been visited, allowing
subsequent walks to detect when walking a mount propgation tree or a
subtree of a mount propgation tree would be duplicate work and to skip
them entirely.

Walk the list of mounts whose propgatation trees need to be traversed
from the mount highest in the mount tree to mounts lower in the mount
tree so that odds are higher that the code will walk the largest trees
first, allowing later tree walks to be skipped entirely.

Add cleanup_umount_visitation to remover the code's memory of which
mounts have been visited.

Add the functions last_slave and skip_propagation_subtree to allow
skipping appropriate parts of the mount propagation tree without
needing to change the logic of the rest of the code.

A script to generate overlapping mount propagation trees:

$ cat runs.h
set -e
mount -t tmpfs zdtm /mnt
mkdir -p /mnt/1 /mnt/2
mount -t tmpfs zdtm /mnt/1
mount --make-shared /mnt/1
mkdir /mnt/1/1

iteration=10
if [ -n "$1" ] ; then
	iteration=$1
fi

for i in $(seq $iteration); do
	mount --bind /mnt/1/1 /mnt/1/1
done

mount --rbind /mnt/1 /mnt/2

TIMEFORMAT='%Rs'
nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 ))
echo -n "umount -l /mnt/1 -> $nr        "
time umount -l /mnt/1

nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l )
time umount -l /mnt/2

$ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done

Here are the performance numbers with and without the patch:

     mhash |  8192   |  8192  | 1048576 | 1048576
    mounts | before  | after  |  before | after
    ------------------------------------------------
      1025 |  0.040s | 0.016s |  0.038s | 0.019s
      2049 |  0.094s | 0.017s |  0.080s | 0.018s
      4097 |  0.243s | 0.019s |  0.206s | 0.023s
      8193 |  1.202s | 0.028s |  1.562s | 0.032s
     16385 |  9.635s | 0.036s |  9.952s | 0.041s
     32769 | 60.928s | 0.063s | 44.321s | 0.064s
     65537 |         | 0.097s |         | 0.097s
    131073 |         | 0.233s |         | 0.176s
    262145 |         | 0.653s |         | 0.344s
    524289 |         | 2.305s |         | 0.735s
   1048577 |         | 7.107s |         | 2.603s

Andrei Vagin reports fixing the performance problem is part of the
work to fix CVE-2016-6213.

Cc: stable@vger.kernel.org
Fixes: a05964f391 ("[PATCH] shared mounts handling: umount")
Reported-by: Andrei Vagin <avagin@openvz.org>
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 08:41:17 -05:00
Eric W. Biederman
99b19d1647 mnt: In propgate_umount handle visiting mounts in any order
While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.

This failure to unmount all of the necessary
mounts can be reproduced with:

$ cat locked_mounts_test.sh

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%

$ unshare -Urm ./locked_mounts_test.sh

This failure is corrected by removing the prepass that marks mounts
that may be umounted.

A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities.  This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.

A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.

A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.

While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.

Cc: stable@vger.kernel.org
Fixes: 0c56fe3142 ("mnt: Don't propagate unmounts to locked mounts")
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 08:41:16 -05:00
Eric W. Biederman
570487d3fa mnt: In umount propagation reparent in a separate pass
It was observed that in some pathlogical cases that the current code
does not unmount everything it should.  After investigation it
was determined that the issue is that mnt_change_mntpoint can
can change which mounts are available to be unmounted during mount
propagation which is wrong.

The trivial reproducer is:
$ cat ./pathological.sh

mount -t tmpfs test-base /mnt
cd /mnt
mkdir 1 2 1/1
mount --bind 1 1
mount --make-shared 1
mount --bind 1 2
mount --bind 1/1 1/1
mount --bind 1/1 1/1
echo
grep test-base /proc/self/mountinfo
umount 1/1
echo
grep test-base /proc/self/mountinfo

$ unshare -Urm ./pathological.sh

The expected output looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

The output without the fix looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

That last mount in the output was in the propgation tree to be unmounted but
was missed because the mnt_change_mountpoint changed it's parent before the walk
through the mount propagation tree observed it.

Cc: stable@vger.kernel.org
Fixes: 1064f874ab ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 08:40:32 -05:00
Konstantin Khlebnikov
887a973061 ext4: keep existing extra fields when inode expands
ext4_expand_extra_isize() should clear only space between old and new
size.

Fixes: 6dd4ee7cab # v2.6.23
Cc: stable@vger.kernel.org
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:36:23 -04:00
Konstantin Khlebnikov
9651e6b2e2 ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
I've got another report about breaking ext4 by ENOMEM error returned from
ext4_mb_load_buddy() caused by memory shortage in memory cgroup.
This time inside ext4_discard_preallocations().

This patch replaces ext4_error() with ext4_warning() where errors returned
from ext4_mb_load_buddy() are not fatal and handled by caller:
* ext4_mb_discard_group_preallocations() - called before generating ENOSPC,
  we'll try to discard other group or return ENOSPC into user-space.
* ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl.

Some callers cannot handle errors, thus __GFP_NOFAIL is used for them:
* ext4_discard_preallocations()
* ext4_mb_discard_lg_preallocations()

Fixes: adb7ef600c ("ext4: use __GFP_NOFAIL in ext4_free_blocks()")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:35:23 -04:00
Jan Kara
3f1d5bad3f ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()
There is an off-by-one error in loop termination conditions in
ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:34:23 -04:00
Jan Kara
7d95eddf31 ext4: fix SEEK_HOLE
Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence	Result
HOLE	0

Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
       -c "seek -h 0" file

wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
Whence	Result
HOLE	139264

Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.

The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).

Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.

CC: stable@vger.kernel.org
Fixes: c8c0df241c
CC: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:33:23 -04:00
Tahsin Erdogan
b4709067ac jbd2: preserve original nofs flag during journal restart
When a transaction starts, start_this_handle() saves current
PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
Journal restart is a special case that calls start_this_handle() without
stopping the transaction. start_this_handle() isn't aware that the
original value is already stored so it overwrites it with current value.

For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
at the end:

  jbd2_journal_start()
  jbd2__journal_restart()
  jbd2_journal_stop()

Make jbd2__journal_restart() restore the original value before calling
start_this_handle().

Fixes: 81378da64d ("jbd2: mark the transaction context with the scope GFP_NOFS context")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-21 22:32:23 -04:00
Jan Kara
964edf66bf ext4: clear lockdep subtype for quota files on quota off
Quota files have special ranking of i_data_sem lock. We inform lockdep
about it when turning on quotas however when turning quotas off, we
don't clear the lockdep subclass from i_data_sem lock and thus when the
inode gets later reused for a normal file or directory, lockdep gets
confused and complains about possible deadlocks. Fix the problem by
resetting lockdep subclass of i_data_sem on quota off.

Cc: stable@vger.kernel.org
Fixes: daf647d2dd
Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:31:23 -04:00
Linus Torvalds
894e21642d Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A small collection of fixes that should go into this cycle.

   - a pull request from Christoph for NVMe, which ended up being
     manually applied to avoid pulling in newer bits in master. Mostly
     fibre channel fixes from James, but also a few fixes from Jon and
     Vijay

   - a pull request from Konrad, with just a single fix for xen-blkback
     from Gustavo.

   - a fuseblk bdi fix from Jan, fixing a regression in this series with
     the dynamic backing devices.

   - a blktrace fix from Shaohua, replacing sscanf() with kstrtoull().

   - a request leak fix for drbd from Lars, fixing a regression in the
     last series with the kref changes. This will go to stable as well"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvmet: release the sq ref on rdma read errors
  nvmet-fc: remove target cpu scheduling flag
  nvme-fc: stop queues on error detection
  nvme-fc: require target or discovery role for fc-nvme targets
  nvme-fc: correct port role bits
  nvme: unmap CMB and remove sysfs file in reset path
  blktrace: fix integer parse
  fuseblk: Fix warning in super_setup_bdi_name()
  block: xen-blkback: add null check to avoid null pointer dereference
  drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()
2017-05-20 16:12:30 -07:00
Linus Torvalds
8c3fc1643d Merge branch 'libnvdimm-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "A couple of compile fixes.

  With the removal of the ->direct_access() method from
  block_device_operations in favor of a new dax_device + dax_operations
  we broke two configurations.

  The CONFIG_BLOCK=n case is fixed by compiling out the block+dax
  helpers in the dax core. Configurations with FS_DAX=n EXT4=y / XFS=y
  and DAX=m fail due to the helpers the builtin filesystem needs being
  in a module, so we stub out the helpers in the FS_DAX=n case."

* 'libnvdimm-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case
  dax: fix false CONFIG_BLOCK dependency
2017-05-19 17:35:34 -07:00
Darrick J. Wong
3ecb3ac7b9 xfs: avoid mount-time deadlock in CoW extent recovery
If a malicious user corrupts the refcount btree to cause a cycle between
different levels of the tree, the next mount attempt will deadlock in
the CoW recovery routine while grabbing buffer locks.  We can use the
ability to re-grab a buffer that was previous locked to a transaction to
avoid deadlocks, so do that here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-05-19 08:12:49 -07:00
Amir Goldstein
ee1d6d37b6 ovl: mark upper dir with type origin entries "impure"
When moving a merge dir or non-dir with copy up origin into a non-merge
upper dir (a.k.a pure upper dir), we are marking the target parent dir
"impure". ovl_iterate() iterates pure upper dirs directly, because there is
no need to filter out whiteouts and merge dir content with lower dir. But
for the case of an "impure" upper dir, ovl_iterate() will not be able to
iterate the real upper dir directly, because it will need to lookup the
origin inode and use it to fill d_ino.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Miklos Szeredi
3d27573ce3 ovl: remove unused arg from ovl_lookup_temp()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Amir Goldstein
21a2287811 ovl: handle rename when upper doesn't support xattr
On failure to set opaque/redirect xattr on rename, skip setting xattr and
return -EXDEV.

On failure to set opaque xattr when creating a new directory, -EIO is
returned instead of -EOPNOTSUPP.

Any failure to set those xattr will be recorded in super block and
then setting any xattr on upper won't be attempted again.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Jonathan Corbet
6312811be2 Merge remote-tracking branch 'mauro-exp/docbook3' into death-to-docbook
Mauro says:

This patch series convert the remaining DocBooks to ReST.

The first version was originally
send as 3 patch series:

   [PATCH 00/36] Convert DocBook documents to ReST
   [PATCH 0/5] Convert more books to ReST
   [PATCH 00/13] Get rid of DocBook

The lsm book was added as if it were a text file under
Documentation. The plan is to merge it with another file
under Documentation/security, after both this series and
a security Documentation patch series gets merged.

It also adjusts some Sphinx-pedantic errors/warnings on
some kernel-doc markups.

I also added some patches here to add PDF output for all
existing ReST books.
2017-05-18 11:03:08 -06:00
Miklos Szeredi
6266d465bd ovl: don't fail copy-up if upper doesn't support xattr
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-18 16:11:24 +02:00
Amir Goldstein
82b749b2c6 ovl: check on mount time if upper fs supports setting xattr
xattr are needed by overlayfs for setting opaque dir, redirect dir
and copy up origin.

Check at mount time by trying to set the overlay.opaque xattr on the
workdir and if that fails issue a warning message.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-18 16:11:24 +02:00
Amir Goldstein
8137ae26d2 ovl: fix creds leak in copy up error path
Fixes: 42f269b925 ("ovl: rearrange code in ovl_copy_up_locked()")
Cc: <stable@vger.kernel.org> # v4.11
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-18 16:11:24 +02:00
Jan Kara
69c8ebf832 fuseblk: Fix warning in super_setup_bdi_name()
Commit 5f7f7543f5 "fuse: Convert to separately allocated bdi" didn't
properly handle fuseblk filesystem. When fuse_bdi_init() is called for
that filesystem type, sb->s_bdi is already initialized (by
set_bdev_super()) to point to block device's bdi and consequently
super_setup_bdi_name() complains about this fact when reseting bdi to
the private one.

Fix the problem by properly dropping bdi reference in fuse_bdi_init()
before creating a private bdi in super_setup_bdi_name().

Fixes: 5f7f7543f5 ("fuse: Convert to separately allocated bdi")
Reported-by: Rakesh Pandit <rakesh@tuxera.com>
Tested-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-17 08:10:57 -06:00
Jin Qian
15d3042a93 f2fs: sanity check checkpoint segno and blkoff
Make sure segno and blkoff read from raw image are valid.

Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@google.com>
[Jaegeuk Kim: adjust minor coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-16 13:29:39 -07:00
J. Bruce Fields
9512a16b0e nfsd: Revert "nfsd: check for oversized NFSv2/v3 arguments"
This reverts commit 51f5677777 "nfsd: check for oversized NFSv2/v3
arguments", which breaks support for NFSv3 ACLs.

That patch was actually an earlier draft of a fix for the problem that
was eventually fixed by e6838a29ec "nfsd: check for oversized NFSv2/v3
arguments".  But somehow I accidentally left this earlier draft in the
branch that was part of my 2.12 pull request.

Reported-by: Eryu Guan <eguan@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-16 16:16:30 -04:00
Darrick J. Wong
ea9a46e1c4 xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN
There were a number of handwaving complaints that one could "possibly"
use inode numbers and extent maps to fingerprint a filesystem hosting
multiple containers and somehow use the information to guess at the
contents of other containers and attack them.  Despite the total lack of
any demonstration that this is actually possible, it's easier to
restrict access now and broaden it later, so use the rmapbt fsmap
backends only if the caller has CAP_SYS_ADMIN.  Unprivileged users will
just have to make do with only getting the free space and static
metadata placement information.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-05-16 12:26:16 -07:00
Zorro Lang
892d2a5f70 xfs: bad assertion for delalloc an extent that start at i_size
By run fsstress long enough time enough in RHEL-7, I find an
assertion failure (harder to reproduce on linux-4.11, but problem
is still there):

  XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c

The assertion is in xfs_getbmap() funciton:

  if (map[i].br_startblock == DELAYSTARTBLOCK &&
-->   map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
          ASSERT((iflags & BMV_IF_DELALLOC) != 0);

When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the
startoff is just at EOF. But we only need to make sure delalloc
extents that are within EOF, not include EOF.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-16 09:24:36 -07:00
Darrick J. Wong
6e747506dd xfs: fix warnings about unused stack variables
Reduce stack usage and get rid of compiler warnings by eliminating
unused variables.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-05-16 09:24:36 -07:00
Darrick J. Wong
6eadbf4c8b xfs: BMAPX shouldn't barf on inline-format directories
When we're fulfilling a BMAPX request, jump out early if the data fork
is in local format.  This prevents us from hitting a debugging check in
bmapi_read and barfing errors back to userspace.  The on-disk extent
count check later isn't sufficient for IF_DELALLOC mode because da
extents are in memory and not on disk.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-16 09:24:36 -07:00
Brian Foster
0daaecacb8 xfs: fix indlen accounting error on partial delalloc conversion
The delalloc -> real block conversion path uses an incorrect
calculation in the case where the middle part of a delalloc extent
is being converted. This is documented as a rare situation because
XFS generally attempts to maximize contiguity by converting as much
of a delalloc extent as possible.

If this situation does occur, the indlen reservation for the two new
delalloc extents left behind by the conversion of the middle range
is calculated and compared with the original reservation. If more
blocks are required, the delta is allocated from the global block
pool. This delta value can be characterized as the difference
between the new total requirement (temp + temp2) and the currently
available reservation minus those blocks that have already been
allocated (startblockval(PREV.br_startblock) - allocated).

The problem is that the current code does not account for previously
allocated blocks correctly. It subtracts the current allocation
count from the (new - old) delta rather than the old indlen
reservation. This means that more indlen blocks than have been
allocated end up stashed in the remaining extents and free space
accounting is broken as a result.

Fix up the calculation to subtract the allocated block count from
the original extent indlen and thus correctly allocate the
reservation delta based on the difference between the new total
requirement and the unused blocks from the original reservation.
Also remove a bogus assert that contradicts the fact that the new
indlen reservation can be larger than the original indlen
reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-16 09:24:35 -07:00
Colin Ian King
bff5baf8aa btrfs: fix incorrect error return ret being passed to mapping_set_error
The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.

Detected by CoverityScan, CID#1414312 ("Logically dead code")

Fixes: 5dca6eea91 ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:42:10 +02:00
Jan Kara
8d91012528 btrfs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Fixes: b685d3d65a
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:42:01 +02:00
Qu Wenruo
4751832da9 btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
 # mount /dev/vdb5 /mnt/btrfs
 # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        25088..25215       128   0x1
 # umount /mnt/btrfs
 # mount /dev/vdb5 /mnt/btrfs
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..31]:         25088..25119        32   0x0
   1: [32..63]:        25120..25151        32   0x0
   2: [64..95]:        25152..25183        32   0x0
   3: [96..127]:       25184..25215        32   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        25088..25215       128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   |     |- btrfs_get_extent(start=0 len=64k)
   |        |  Found on-disk (ino, EXTENT_DATA, 0)
   |        |- add_extent_mapping()
   |        |- Return (em->start=0, len=16k)
   |
   |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
   |
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   |     |- btrfs_get_extent(start=16k len=48k)
   |        |  Found on-disk (ino, EXTENT_DATA, 16k)
   |        |- add_extent_mapping()
   |        |  |- try_merge_map()
   |        |     Merge with previous em start=0 len=16k
   |        |     resulting em start=0 len=32k
   |        |- Return (em->start=0, len=32K)    << Merged result
   |- Stripe off the unrelated range (0~16K) of return em
   |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
      ^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.

It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:41:53 +02:00
Mauro Carvalho Chehab
e1511a840a fs: fix the location of the kernel-api book
The kernel-api book is now part of the core-api. Update its
location.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:23 -03:00
Mauro Carvalho Chehab
e1b4fc7add fs: update location of filesystems documentation
The filesystem documentation was moved from DocBook to
Documentation/filesystems/. Update it at the sources.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:22 -03:00
Mauro Carvalho Chehab
df1b560a4a fs: jbd2: escape a string with special chars on a kernel-doc
kernel-doc will try to interpret a foo() string, except if
properly escaped.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:11 -03:00
Mauro Carvalho Chehab
f16df9f765 fs: eventfd: fix identation on kernel-doc
Sphinx require explicit tags in order to use a list of possible
values, otherwise it produces this error:

	./fs/eventfd.c:219: WARNING: Option list ends without a blank line; unexpected unindent.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:10 -03:00
Mauro Carvalho Chehab
0117d4272b fs: add a blank lines on some kernel-doc comments
Sphinx gets confused when it finds identation without a
good reason for it and without a preceding blank line:

	./fs/mpage.c:347: ERROR: Unexpected indentation.
	./fs/namei.c:4303: ERROR: Unexpected indentation.
	./fs/fs-writeback.c:2060: ERROR: Unexpected indentation.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:10 -03:00
Mauro Carvalho Chehab
91e4775d0f fs: jbd2: make jbd2_journal_start() kernel-doc parseable
kernel-doc script expects that a function documentation to
be just before the function, otherwise it will be ignored.

So, move the kernel-doc markup to the right place.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:09 -03:00
Linus Torvalds
1319a2856d Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "A set of minor cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] Minor cleanup of xattr query function
  fs: cifs: transport: Use time_after for time comparison
  SMB2: Fix share type handling
  cifs: cifsacl: Use a temporary ops variable to reduce code length
  Don't delay freeing mids when blocked on slow socket write of request
  CIFS: silence lockdep splat in cifs_relock_file()
2017-05-15 15:27:02 -07:00
Christoph Hellwig
bb2a8b0cd1 nfsd4: const-ify nfsd4_ops
nfsd4_ops contains function pointers, and marking it as constant avoids
it being able to be used as an attach vector for code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:32 +02:00
Christoph Hellwig
e9679189e3 sunrpc: mark all struct svc_version instances as const
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:31 +02:00
Christoph Hellwig
860bda29b9 sunrpc: mark all struct svc_procinfo instances as const
struct svc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:31 +02:00
Christoph Hellwig
7fd38af9ca sunrpc: move pc_count out of struct svc_procinfo
pc_count is the only writeable memeber of struct svc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct svc_procinfo, and into a
separate writable array that is pointed to by struct svc_version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:30 +02:00
Christoph Hellwig
eb69853da9 nfsd4: properly type op_func callbacks
Pass union nfsd4_op_u to the op_func callbacks instead of using unsafe
function pointer casts.

It also adds two missing structures to struct nfsd4_op.u to facilitate
this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:29 +02:00
Christoph Hellwig
1c1226385b nfsd4: remove nfsd4op_rsize
Except for a lot of unnecessary casts this typedef only has one user,
so remove the casts and expand it in struct nfsd4_operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:28 +02:00
Christoph Hellwig
57832e7bd8 nfsd4: properly type op_get_currentstateid callbacks
Pass union nfsd4_op_u to the op_set_currentstateid callbacks instead of
using unsafe function pointer casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:27 +02:00
Christoph Hellwig
b60e985980 nfsd4: properly type op_set_currentstateid callbacks
Given the args union in struct nfsd4_op a name, and pass it to the
op_set_currentstateid callbacks instead of using unsafe function
pointer casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:27 +02:00
Christoph Hellwig
63f8de3795 sunrpc: properly type pc_encode callbacks
Drop the resp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:25 +02:00
Christoph Hellwig
026fec7e7c sunrpc: properly type pc_decode callbacks
Drop the argp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:24 +02:00
Christoph Hellwig
8537488b5a sunrpc: properly type pc_release callbacks
Drop the p and resp arguments as they are always NULL or can trivially
be derived from the rqstp argument.  With that all functions now have the
same prototype, and we can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:23 +02:00
Christoph Hellwig
a6beb73272 sunrpc: properly type pc_func callbacks
Drop the argp and resp arguments as they can trivially be derived from
the rqstp argument.  With that all functions now have the same prototype,
and we can remove the unsafe casting to svc_procfunc as well as the
svc_procfunc typedef itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:23 +02:00
Christoph Hellwig
9482c9c15c nfsd: remove the unused PROC() macro in nfs3proc.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:22 +02:00
Christoph Hellwig
f7235b6bc5 nfsd: use named initializers in PROC()
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:21 +02:00
Christoph Hellwig
02be49f6b7 nfsd4: const-ify nfs_cb_version4
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:20 +02:00
Christoph Hellwig
499b498810 sunrpc: mark all struct rpc_procinfo instances as const
struct rpc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:20 +02:00
Christoph Hellwig
f700c72dd2 nfs: use ARRAY_SIZE() in the nfsacl_version3 declaration
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:19 +02:00
Christoph Hellwig
1c5876ddbd sunrpc: move p_count out of struct rpc_procinfo
p_count is the only writeable memeber of struct rpc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct rpc_procinfo, and into a
separate writable array that is pointed to by struct rpc_version and
indexed by p_statidx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:18 +02:00
Christoph Hellwig
cdfa31e93f lockd: fix some weird indentation
Remove double indentation of a few struct rpc_version and
struct rpc_program instance.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:17 +02:00
Christoph Hellwig
f4dac4ade5 nfs: don't cast callback decode/proc/encode routines
Instead declare all functions with the proper methods signature.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:16 +02:00
Christoph Hellwig
18d9cff400 nfs: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:16 +02:00
Christoph Hellwig
1fa2339123 lockd: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:15 +02:00
Christoph Hellwig
d39916c487 nfsd: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-05-15 17:42:14 +02:00
Christoph Hellwig
1502c81b44 nfsd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-05-15 17:42:10 +02:00
Christoph Hellwig
0096d39b96 nfs: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:09 +02:00
Christoph Hellwig
bf96391e7b lockd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:09 +02:00
Arnd Bergmann
72d42504bd ovl: select EXPORTFS
We get a link error when EXPORTFS is not enabled:

ERROR: "exportfs_encode_fh" [fs/overlayfs/overlay.ko] undefined!
ERROR: "exportfs_decode_fh" [fs/overlayfs/overlay.ko] undefined!

This adds a Kconfig 'select' statement for overlayfs, the same way that
it is done for the other users of exportfs.

Fixes: 3a1e819b4e ("ovl: store file handle of lower inode on copy up")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-15 10:53:07 +02:00
Dan Williams
f5705aa8cf dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case
Tetsuo reports:

  fs/built-in.o: In function `xfs_file_iomap_end':
  xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
  fs/built-in.o: In function `xfs_file_iomap_begin':
  xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
  make: *** [vmlinux] Error 1
  $ grep DAX .config
  CONFIG_DAX=m
  # CONFIG_DEV_DAX is not set
  # CONFIG_FS_DAX is not set

When FS_DAX=n we can/must throw away the dax code in filesystems.
Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
nops in the FS_DAX=n case.

Cc: <linux-xfs@vger.kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Fixes: ef51042472 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-13 17:52:16 -07:00
Linus Torvalds
b53c4d5eb7 This pull request contains updates for both UBI and UBIFS:
- New config option CONFIG_UBIFS_FS_SECURITY
 - Minor improvements
 - Random fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJZFuwKAAoJEEtJtSqsAOnWYrUP/0/y7PEh0ZGdi4kkQy/CnuJr
 pmybsQ0TbLoljahuDXqShKkMNvuXIvKSKcHROIsXreG+DCfC3v/srZlvRt7UCPOE
 QVvjh0sQTaMUrfcaTTM9g3Im/BZX9MueTaSF2Rgx1lF+R2t3InW1bv9hvmQxfoEA
 N75tJgH69mii5pDWuGgLLjmmxhbSkMGpM31QeO5DUaLRqdXcc5L5iK5Hnd+Wtj81
 oSB5RsergCfk17jaWH2e7G03LB2tm6AhM5oksTOpZ9+OIW9GOiUMfjYFC2ZYRwzx
 zHhnh0rGPfFv0jO5u4CXtWaQDfxyw6Z7XLK+Xo1RemkhM/7AQl2xetfIVDXErgoA
 NxxN/a8MWcEpJ2x6y/Z8740HXjyQjt9h3nHzlVPNP8hz68J796E7UzRjCQtf7Iyh
 xqhfjMabxfBqcLkTESvgmjcuwo1IkqOaFBjIw2Cd2nfBEkCKzoaINjRHitgUGj/z
 Mm1CJNWvaK6QTdZ3iCETCyPQI02A+4ZXhDf/QZS3wRAMc1v45pS/dVeBn+0F8Nrc
 ASiQwcd7u1IfJa3A6d6DgMECUWBXjc1GGMfMyhS/ta56pOfe1RyR3bg9WuISqUMe
 86id9tiSs7cP2UVFTrFFFWAO3rATj+9cOO9f2LTujPzcd88cJhKSykaLPmELfyE9
 YUPw9lpExwyXLn7S46LQ
 =9ZJe
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:

 - new config option CONFIG_UBIFS_FS_SECURITY

 - minor improvements

 - random fixes

* tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
  ubi: Add debugfs file for tracking PEB state
  ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
  ubifs: Remove unnecessary assignment
  ubifs: Fix cut and paste error on sb type comparisons
  ubi: fastmap: Fix slab corruption
  ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
  ubi: Make mtd parameter readable
  ubi: Fix section mismatch
2017-05-13 10:23:12 -07:00
Linus Torvalds
1251704a63 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, docs: update memory.stat description with workingset* entries
  mm: vmscan: scan until it finds eligible pages
  mm, thp: copying user pages must schedule on collapse
  dax: fix PMD data corruption when fault races with write
  dax: fix data corruption when fault races with write
  ext4: return to starting transaction in ext4_dax_huge_fault()
  mm: fix data corruption due to stale mmap reads
  dax: prevent invalidation of mapped DAX entries
  Tigran has moved
  mm, vmalloc: fix vmalloc users tracking properly
  mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
  gcov: support GCC 7.1
  mm, vmstat: Remove spurious WARN() during zoneinfo print
  time: delete current_fs_time()
  hwpoison, memcg: forcibly uncharge LRU pages
2017-05-13 09:49:35 -07:00
Steve French
67b4c889cc [CIFS] Minor cleanup of xattr query function
Some minor cleanup of cifs query xattr functions (will also make
SMB3 xattr implementation cleaner as well).

Signed-off-by: Steve French <steve.french@primarydata.com>
2017-05-12 20:59:10 -05:00
Karim Eshapa
4328fea77c fs: cifs: transport: Use time_after for time comparison
Use time_after kernel macro for time comparison
that has safety check.

Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-12 19:56:44 -05:00
Christophe JAILLET
cd1230070a SMB2: Fix share type handling
In fs/cifs/smb2pdu.h, we have:
#define SMB2_SHARE_TYPE_DISK    0x01
#define SMB2_SHARE_TYPE_PIPE    0x02
#define SMB2_SHARE_TYPE_PRINT   0x03

Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can
never trigger and printer share would be interpreted as disk share.

So, test the ShareType value for equality instead.

Fixes: faaf946a7d ("CIFS: Add tree connect/disconnect capability for SMB2")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-12 19:55:56 -05:00
Joe Perches via samba-technical
ecdcf622eb cifs: cifsacl: Use a temporary ops variable to reduce code length
Create an ops variable to store tcon->ses->server->ops and cache
indirections and reduce code size a trivial bit.

$ size fs/cifs/cifsacl.o*
   text	   data	    bss	    dec	    hex	filename
   5338	    136	      8	   5482	   156a	fs/cifs/cifsacl.o.new
   5371	    136	      8	   5515	   158b	fs/cifs/cifsacl.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-12 19:45:18 -05:00
Ross Zwisler
876f29460c dax: fix PMD data corruption when fault races with write
This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.

Currently DAX PMD read fault can race with write(2) in the following
way:

CPU1 - write(2)                 CPU2 - read fault
                                dax_iomap_pmd_fault()
                                  ->iomap_begin() - sees hole

dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate

                                  grab_mapping_entry()
				  - we add huge zero page to the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6 ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
Jan Kara
13e451fdc1 dax: fix data corruption when fault races with write
Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
Jan Kara
fb26a1cbed ext4: return to starting transaction in ext4_dax_huge_fault()
DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes.  To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
Jan Kara
cd656375f9 mm: fix data corruption due to stale mmap reads
Currently, we didn't invalidate page tables during invalidate_inode_pages2()
for DAX.  That could result in e.g. 2MiB zero page being mapped into
page tables while there were already underlying blocks allocated and
thus data seen through mmap were different from data seen by read(2).
The following sequence reproduces the problem:

 - open an mmap over a 2MiB hole

 - read from a 2MiB hole, faulting in a 2MiB zero page

 - write to the hole with write(3p). The write succeeds but we
   incorrectly leave the 2MiB zero page mapping intact.

 - via the mmap, read the data that was just written. Since the zero
   page mapping is still intact we read back zeroes instead of the new
   data.

Fix the problem by unconditionally calling invalidate_inode_pages2_range()
in dax_iomap_actor() for new block allocations and by properly
invalidating page tables in invalidate_inode_pages2_range() for DAX
mappings.

Fixes: c6dcf52c23
Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Ross Zwisler
4636e70bb0 dax: prevent invalidation of mapped DAX entries
Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
v4.

This series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of
sync with block mappings in the filesystem and thus data seen through
mmap is different from data seen through read(2).

The series passes testing with t_mmap_stale test program from Ross and
also other mmap related tests on DAX filesystem.

This patch (of 4):

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

  invalidate_mapping_pages()
    invalidate_exceptional_entry()
      dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages()
there is an additional criteria which is that the page must not be
mapped.  This is noted in the comments above invalidate_mapping_pages()
and is checked in invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry.  This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to
its comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Fixes: c6dcf52c23 ("mm: Invalidate DAX radix tree entries only if appropriate")
Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>    [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Andrew Morton
cea582247a Tigran has moved
Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Linus Torvalds
0fcc3ab23d Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Incremental fixes and a small feature addition on top of the main
  libnvdimm 4.12 pull request:

   - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
     The size regression is fixed by moving all dax helpers into the
     dax-core and only specifying "select DAX" for FS_DAX and
     dax-capable drivers. He also asked for clarification of the
     NR_DEV_DAX config option which, on closer look, does not need to be
     a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
     for good measure.

   - Ben's attention to detail on -stable patch submissions caught a
     case where the recent fixes to arch_copy_from_iter_pmem() missed a
     condition where we strand dirty data in the cache. This is tagged
     for -stable and will also be included in the rework of the pmem api
     to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

   - Vishal adds a feature that missed the initial pull due to pending
     review feedback. It allows the kernel to clear media errors when
     initializing a BTT (atomic sector update driver) instance on a pmem
     namespace.

   - Ross noticed that the dax_device + dax_operations conversion broke
     __dax_zero_page_range(). The nvdimm unit tests fail to check this
     path, but xfstests immediately trips over it. No excuse for missing
     this before submitting the 4.12 pull request.

  These all pass the nvdimm unit tests and an xfstests spot check. The
  set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  filesystem-dax: fix broken __dax_zero_page_range() conversion
  libnvdimm, btt: ensure that initializing metadata clears poison
  libnvdimm: add an atomic vs process context flag to rw_bytes
  x86, pmem: Fix cache flushing for iovec write < 8 bytes
  device-dax: kill NR_DEV_DAX
  block, dax: move "select DAX" from BLOCK to FS_DAX
  device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
2017-05-12 15:43:10 -07:00
Linus Torvalds
050453295f Merge branch 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Making sure that something like a referral point won't end up as pwd
  or root.

  The main part is the last commit (fixing mntns_install()); that one
  fixes a hard-to-hit race. The fchdir() commit is making fchdir(2) a
  bit more robust - it should be impossible to get opened files (even
  O_PATH ones) for referral points in the first place, so the existing
  checks are OK, but checking the same thing as in chdir(2) is just as
  cheap.

  The path_init() commit removes a redundant check that shouldn't have
  been there in the first place"

* 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make sure that mntns_install() doesn't end up with referral for root
  path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
  make sure that fchdir() won't accept referral points, etc.
2017-05-12 11:39:59 -07:00
Linus Torvalds
9786e34e0a MTD updates for 4.12-rc1:
NAND, from Boris:
 """
  - some minor fixes/improvements on existing drivers (fsmc, gpio, ifc,
    davinci, brcmnand, omap)
  - a huge cleanup/rework of the denali driver accompanied with core
    fixes/improvements to simplify the driver code
  - a complete rewrite of the atmel driver to support new DT bindings
    make future evolution easier
  - the addition of per-vendor detection/initialization steps to avoid
    extending the nand_ids table with more extended-id entries
 """
 
 SPI NOR, from Cyrille:
 """
 - fixes in the hisi SPI controller driver.
 - fixes in the intel SPI controller driver.
 - fixes in the Mediatek SPI controller driver.
 - fixes to some SPI flash memories not supported the Chip Erase command.
 - add support to some new memory parts (Winbond, Macronix, Micron, ESMT).
 - add new driver for the STM32 QSPI controller.
 """
 
 And a few fixes for Gemini and Versatile platforms on physmap-of
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZE86yAAoJEFySrpd9RFgtlOoP/1o1s8dlKdd4TazdoxBTL2wy
 C4wPkqPWyfREcD5ZUYJgr6ENI2OnEwcAxAt2CXnqegx+ZIPToBW4/WK9gj/TNLRx
 AfSOz+EPPzo5uZwJPnfocgIFYuhsspymvmISwv66kPbjfkrSjo1l/K9nem3gh7an
 IkQdVVq8brvxNeDZOAzbsT2Y5DZNfs00g1jLXkcQrpfM0sWKcbHIUa0BTWy4WKGV
 ElTr+xh7QHh/Pd9/A5znd3xX54w5+YR/xe38jSBfTb0vEgw/RIfhIcnvxQ8G/7Se
 jE0+8GR5ZJGKwA9Xk5nFzS2G3uECMFNS75KfxkZ0LlEE6ivUvpDbokCbIU4bDOCt
 /8bWQf9AGA3gLHGgNUQTSt5HrkBXTGp917jtAZbI/y2MzTkLw3aAZ/m/j37vv9ON
 ezeGRO6VWK3bcimLFrt6KO5emYstmm4Tp4rRe3jakH7eyTlINDsecKtuMo2xVzyZ
 kK3tnDMdEntECAiKh3ndRdAUL3fs+/IdzWTAxnF9VQFQs1YxiZ1K8kY/zcN+rzbn
 CVkEhdm+tdDBx8XgOdfnOTGRAJ07dGOoDhLPR4/egC/ta6GIRkHQjFSwsW7bD9p9
 phHH6nQX9Bpza1JV/xvljezoHjvZkny4UhRpLgYMowb41DXv7os7ZV+g7kf5sd0i
 mGzCH46j0DmWQ1u5/Q6j
 =dxj5
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20170510' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 "NAND, from Boris:
   - some minor fixes/improvements on existing drivers (fsmc, gpio, ifc,
     davinci, brcmnand, omap)
   - a huge cleanup/rework of the denali driver accompanied with core
     fixes/improvements to simplify the driver code
   - a complete rewrite of the atmel driver to support new DT bindings
     make future evolution easier
   - the addition of per-vendor detection/initialization steps to avoid
     extending the nand_ids table with more extended-id entries

  SPI NOR, from Cyrille:
   - fixes in the hisi, intel and Mediatek SPI controller drivers
   - fixes to some SPI flash memories not supporting the Chip Erase
     command.
   - add support to some new memory parts (Winbond, Macronix, Micron,
     ESMT).
   - add new driver for the STM32 QSPI controller

  And a few fixes for Gemini and Versatile platforms on physmap-of"

* tag 'for-linus-20170510' of git://git.infradead.org/linux-mtd: (100 commits)
  MAINTAINERS: Update NAND subsystem git repositories
  mtd: nand: gpio: update binding
  mtd: nand: add ooblayout for old hamming layout
  mtd: oxnas_nand: Allocating more than necessary in probe()
  dt-bindings: mtd: Document the STM32 QSPI bindings
  mtd: mtk-nor: set controller's address width according to nor flash
  mtd: spi-nor: add driver for STM32 quad spi flash controller
  mtd: nand: brcmnand: Check flash #WP pin status before nand erase/program
  mtd: nand: davinci: add comment on NAND subpage write status on keystone
  mtd: nand: omap2: Fix partition creation via cmdline mtdparts
  mtd: nand: NULL terminate a of_device_id table
  mtd: nand: Fix a couple error codes
  mtd: nand: allow drivers to request minimum alignment for passed buffer
  mtd: nand: allocate aligned buffers if NAND_OWN_BUFFERS is unset
  mtd: nand: denali: allow to override revision number
  mtd: nand: denali_dt: use pdev instead of ofdev for platform_device
  mtd: nand: denali_dt: remove dma-mask DT property
  mtd: nand: denali: support 64bit capable DMA engine
  mtd: nand: denali_dt: enable HW_ECC_FIXUP for Altera SOCFPGA variant
  mtd: nand: denali: support HW_ECC_FIXUP capability
  ...
2017-05-11 10:44:22 -07:00
Dan Williams
e84b83b9ee filesystem-dax: fix broken __dax_zero_page_range() conversion
The conversion of __dax_zero_page_range() to 'struct dax_operations'
caused it to frequently fail. The mistake was treating the @size
parameter as a dax mapping length rather than just a length of the
clear_pmem() operation. The dax mapping length is assumed to be hard
coded as PAGE_SIZE.

Without this fix any page unaligned zeroing request will trigger a
-EINVAL return from bdev_dax_pgoff().

Cc: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: cccbce6715 ("filesystem-dax: convert to dax_direct_access()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-10 21:46:55 -07:00
Linus Torvalds
291b38a756 Annotation of module parameters that specify device settings
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWPiW6vSw1s6N8H32AQLOrw/+NTqGf7bjq+64YKS6NfR0XDgE+wNJltGO
 ck7zJW3NHIg76RNu8s0I9xg5aVmwizz3Z5DGROZquaolnezux4tQihZ3AFyxIzLc
 +Y3WHYagcML7yFfjl/WznCLRD5EW3yPln4lCvQO0nW/xICRYeRI057JaIbi2Dtek
 BhcXt3c4AjXDLdYJkgtHV3p2R2mt8hcdFdWqqx6s7JaIThZNRGNzxAgtbcB9k5IW
 HVG9ZEIL73VBYWHrYivzjHYF5rBnNCPt87eOwDQeTOSkhv8te+u9k+bH8vxZw1T0
 XUtDrLBndKiuVo2GUfLkkF8LItx3Q9eLCJYy0joaIliyPqTEsPx9KjQ+Af0cxS9s
 ZPCZ5SYf96stKmDeL5xaMfrAmeyVHJ4lc4JTOqdzbIT8blsOSfYO/03p0ALShSDv
 /RQLaKGlf8Bjoy8PwKFcXb4sIDufcd/U1Av/EMFXxOfgN/u2JUkGKq6EaIM5B68L
 fHPje+aR9VNELPmPjwNOWtmN4I79EH3EItQf7zv0KG+UeKhcHLx/EAcSJ3ZRKEkH
 Lathg7pPOEJGArPiVO79TZzBG01ADn1aiwv65XObMzNZ+54xI/mN/Y1DNF/kL5jU
 XzvNzEjFt8mwMIZGVNdAt4+pDyMfIZGZSyUkSRKFnaQZMIvQrfQIU9RLBYLX5eOx
 +/p0VkIwDpg=
 =lbS7
 -----END PGP SIGNATURE-----

Merge tag 'hwparam-20170420' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull hw lockdown support from David Howells:
 "Annotation of module parameters that configure hardware resources
  including ioports, iomem addresses, irq lines and dma channels.

  This allows a future patch to prohibit the use of such module
  parameters to prevent that hardware from being abused to gain access
  to the running kernel image as part of locking the kernel down under
  UEFI secure boot conditions.

  Annotations are made by changing:

        module_param(n, t, p)
        module_param_named(n, v, t, p)
        module_param_array(n, t, m, p)

  to:

        module_param_hw(n, t, hwtype, p)
        module_param_hw_named(n, v, t, hwtype, p)
        module_param_hw_array(n, t, hwtype, m, p)

  where the module parameter refers to a hardware setting

  hwtype specifies the type of the resource being configured. This can
  be one of:

        ioport          Module parameter configures an I/O port
        iomem           Module parameter configures an I/O mem address
        ioport_or_iomem Module parameter could be either (runtime set)
        irq             Module parameter configures an I/O port
        dma             Module parameter configures a DMA channel
        dma_addr        Module parameter configures a DMA buffer address
        other           Module parameter configures some other value

  Note that the hwtype is compile checked, but not currently stored (the
  lockdown code probably won't require it). It is, however, there for
  future use.

  A bonus is that the hwtype can also be used for grepping.

  The intention is for the kernel to ignore or reject attempts to set
  annotated module parameters if lockdown is enabled. This applies to
  options passed on the boot command line, passed to insmod/modprobe or
  direct twiddling in /sys/module/ parameter files.

  The module initialisation then needs to handle the parameter not being
  set, by (1) giving an error, (2) probing for a value or (3) using a
  reasonable default.

  What I can't do is just reject a module out of hand because it may
  take a hardware setting in the module parameters. Some important
  modules, some ipmi stuff for instance, both probe for hardware and
  allow hardware to be manually specified; if the driver is aborts with
  any error, you don't get any ipmi hardware.

  Further, trying to do this entirely in the module initialisation code
  doesn't protect against sysfs twiddling.

  [!] Note that in and of itself, this series of patches should have no
      effect on the the size of the kernel or code execution - that is
      left to a patch in the next series to effect. It does mark
      annotated kernel parameters with a KERNEL_PARAM_FL_HWPARAM flag in
      an already existing field"

* tag 'hwparam-20170420' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (38 commits)
  Annotate hardware config module parameters in sound/pci/
  Annotate hardware config module parameters in sound/oss/
  Annotate hardware config module parameters in sound/isa/
  Annotate hardware config module parameters in sound/drivers/
  Annotate hardware config module parameters in fs/pstore/
  Annotate hardware config module parameters in drivers/watchdog/
  Annotate hardware config module parameters in drivers/video/
  Annotate hardware config module parameters in drivers/tty/
  Annotate hardware config module parameters in drivers/staging/vme/
  Annotate hardware config module parameters in drivers/staging/speakup/
  Annotate hardware config module parameters in drivers/staging/media/
  Annotate hardware config module parameters in drivers/scsi/
  Annotate hardware config module parameters in drivers/pcmcia/
  Annotate hardware config module parameters in drivers/pci/hotplug/
  Annotate hardware config module parameters in drivers/parport/
  Annotate hardware config module parameters in drivers/net/wireless/
  Annotate hardware config module parameters in drivers/net/wan/
  Annotate hardware config module parameters in drivers/net/irda/
  Annotate hardware config module parameters in drivers/net/hamradio/
  Annotate hardware config module parameters in drivers/net/ethernet/
  ...
2017-05-10 19:13:03 -07:00
Linus Torvalds
c70422f760 Another RDMA update from Chuck Lever, and a bunch of miscellaneous
bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZE2UeAAoJECebzXlCjuG+St8P/0vG+ps9sY012E6Wh9gy4Ev4
 BtxG/c3CtcxrbNzW+cFhdEloBGtC0VvcrKNCozJTK4LdaPYErkyRBpjgXvIggT9I
 GWY4ftpH3eJ6uByN9Okgc3/1la2poDflJO/nYhdRed3YHOnXTtx/746tu1xAnVCV
 tFtDGrbJZTprt5c3zETtdtquCUSy2aMT5ZbrdU3yWBCwQMNSIufN3an8epfB++xx
 Ct+G0HTRffcWAdYuLT0N1HKqm8pkncdNMFpm7mVw0hMCRy552G3fuj8LtkhVTvKE
 1KN3zXY4jhaUYWD5Yt6AJcpLEro65b8swYk4e9FP2TNUpCmuRdXT9cb9vE8YztxC
 8s4N23RHaEx9I6pC3OU64a2HfhiQM/oOIvjlhTBsjojXsQcqZFD1vsoSYA8Byl0w
 m9EQWqPqge4m6yEYl7uAyL6xSthbrhcU1Ks5jvNXGcWzEQj7BATnynJANsfZ+y6r
 ZoVcsRNX49m1BG+p9br+9DFffPiNFUMqxbfr73L9HRep3OsPeFKazFG0bKd3hOqA
 E6L/AnBd9soSqTuTvbisWrGWbomhtd5G/fAa1uHrWTPHMXUWCmkguiau51FNfcHu
 xcJlBBVCvUmmd5u3wF6QeiyjPs4KEBzQzsOUsWKHRxDBp6s+5PX/lHuXRBlDP+fN
 TQq0KbvBtea1OyMaRtoV
 =Rtl/
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Another RDMA update from Chuck Lever, and a bunch of miscellaneous
  bugfixes"

* tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd: Fix up the "supattr_exclcreat" attributes
  nfsd: encoders mustn't use unitialized values in error cases
  nfsd: fix undefined behavior in nfsd4_layout_verify
  lockd: fix lockd shutdown race
  NFSv4: Fix callback server shutdown
  SUNRPC: Refactor svc_set_num_threads()
  NFSv4.x/callback: Create the callback service through svc_create_pooled
  lockd: remove redundant check on block
  svcrdma: Clean out old XDR encoders
  svcrdma: Remove the req_map cache
  svcrdma: Remove unused RDMA Write completion handler
  svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
  svcrdma: Clean up RPC-over-RDMA backchannel reply processing
  svcrdma: Report Write/Reply chunk overruns
  svcrdma: Clean up RDMA_ERROR path
  svcrdma: Use rdma_rw API in RPC reply path
  svcrdma: Introduce local rdma_rw API helpers
  svcrdma: Clean up svc_rdma_get_inv_rkey()
  svcrdma: Add helper to save pages under I/O
  svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  ...
2017-05-10 13:29:23 -07:00
Linus Torvalds
73ccb023a2 NFS client updates for Linux 4.12
Highlights include:
 
 Stable bugfixes:
 - Fix use after free in write error path
 - Use GFP_NOIO for two allocations in writeback
 - Fix a hang in OPEN related to server reboot
 - Check the result of nfs4_pnfs_ds_connect
 - Fix an rcu lock leak
 
 Features:
 - Removal of the unmaintained and unused OSD pNFS layout
 - Cleanup and removal of lots of unnecessary dprintk()s
 - Cleanup and removal of some memory failure paths now that
   GFP_NOFS is guaranteed to never fail.
 - Remove the v3-only data server limitation on pNFS/flexfiles
 
 Bugfixes:
 - RPC/RDMA connection handling bugfixes
 - Copy offload: fixes to ensure the copied data is COMMITed to disk.
 - Readdir: switch back to using the ->iterate VFS interface
 - File locking fixes from Ben Coddington
 - Various use-after-free and deadlock issues in pNFS
 - Write path bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIbBAABAgAGBQJZE0KiAAoJEGcL54qWCgDy/moP93wZ+cGnN5sC+nsirqj4eUiM
 BQKKonweNQIoYRwp5B9jLTxsUMIxRasV5W3BbqEm4PUtBYXfqQ7SfLv7RboKbd4M
 RJB9PS+sjx3Fxf65mhveKziwUFLvQCQ3+we0TpUga6+7SBiGlgPKBfisk7frC0nt
 BbYBuGaWXMPxO0BnR8adNwqiGINPDSzB+8sgjiT8zkZLm4lrew2eV7TDvwVOguD+
 S2vLPGhg1F9wu8aG731MgiSNaeCgsBP6I5D29fTTD7z1DCNMQXOoHcX8k4KwwIDB
 sHRR0tVBsg+1B7WdH4y41GQ03rn3o2DHeJB5cdYGaEu4lx7CecCzt0o0dfAkNizT
 5LxbQxIHPNYMeZmP2T0oD41zQyfjKqrdRSPnXi3dPD98NwaM1Lqv+Kzb/eXzupXp
 vJ7859PQCa3KjQ1IFhwdXTmh53J1c8SzEDpzz7WX0R0saRyxeIJsm30MmdPqKu7Z
 notjsXxrTmjIhC+0vFLey1kejFDh+b0gT6UIwoMdx39VL9AM6DVL7HsrU1kEwCdf
 f8otaLcm0WoUaseF+cMtfRNGEqCMxPywwz7mEKlGiVZgyAM8VfzH+s5j6/u6ncwS
 ASwRclwwPAZN97rzl0exZxuaRwFZd7oFT1zrviPWvv+0SUPuy258J6QpolUSavgi
 Qh7f3QR65K+QX9QbO1g=
 =7Nm2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:
   - Fix use after free in write error path
   - Use GFP_NOIO for two allocations in writeback
   - Fix a hang in OPEN related to server reboot
   - Check the result of nfs4_pnfs_ds_connect
   - Fix an rcu lock leak

  Features:
   - Removal of the unmaintained and unused OSD pNFS layout
   - Cleanup and removal of lots of unnecessary dprintk()s
   - Cleanup and removal of some memory failure paths now that GFP_NOFS
     is guaranteed to never fail.
   - Remove the v3-only data server limitation on pNFS/flexfiles

  Bugfixes:
   - RPC/RDMA connection handling bugfixes
   - Copy offload: fixes to ensure the copied data is COMMITed to disk.
   - Readdir: switch back to using the ->iterate VFS interface
   - File locking fixes from Ben Coddington
   - Various use-after-free and deadlock issues in pNFS
   - Write path bugfixes"

* tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (89 commits)
  pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled
  NFSv4.1: Work around a Linux server bug...
  NFS append COMMIT after synchronous COPY
  NFSv4: Fix exclusive create attributes encoding
  NFSv4: Fix an rcu lock leak
  nfs: use kmap/kunmap directly
  NFS: always treat the invocation of nfs_getattr as cache hit when noac is on
  Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew
  NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
  pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
  pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
  pNFS: Fix a deadlock when coalescing writes and returning the layout
  pNFS: Don't clear the layout return info if there are segments to return
  pNFS: Ensure we commit the layout if it has been invalidated
  pNFS: Don't send COMMITs to the DSes if the server invalidated our layout
  pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
  pNFS: Ensure we check layout validity before marking it for return
  NFS4.1 handle interrupted slot reuse from ERR_DELAY
  NFSv4: check return value of xdr_inline_decode
  nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
  ...
2017-05-10 13:03:38 -07:00
Trond Myklebust
b26b78cb72 nfsd: Fix up the "supattr_exclcreat" attributes
If an NFSv4 client asks us for the supattr_exclcreat, then we must
not return attributes that are unsupported by this minor version.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 75976de655 ("NFSD: Return word2 bitmask if setting security..,")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-10 14:30:10 -04:00
J. Bruce Fields
f961e3f2ac nfsd: encoders mustn't use unitialized values in error cases
In error cases, lgp->lg_layout_type may be out of bounds; so we
shouldn't be using it until after the check of nfserr.

This was seen to crash nfsd threads when the server receives a LAYOUTGET
request with a large layout type.

GETDEVICEINFO has the same problem.

Reported-by: Ari Kauppi <Ari.Kauppi@synopsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-10 14:25:19 -04:00
Linus Torvalds
de4d195308 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes are:

   - Debloat RCU headers

   - Parallelize SRCU callback handling (plus overlapping patches)

   - Improve the performance of Tree SRCU on a CPU-hotplug stress test

   - Documentation updates

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  rcu: Open-code the rcu_cblist_n_lazy_cbs() function
  rcu: Open-code the rcu_cblist_n_cbs() function
  rcu: Open-code the rcu_cblist_empty() function
  rcu: Separately compile large rcu_segcblist functions
  srcu: Debloat the <linux/rcu_segcblist.h> header
  srcu: Adjust default auto-expediting holdoff
  srcu: Specify auto-expedite holdoff time
  srcu: Expedite first synchronize_srcu() when idle
  srcu: Expedited grace periods with reduced memory contention
  srcu: Make rcutorture writer stalls print SRCU GP state
  srcu: Exact tracking of srcu_data structures containing callbacks
  srcu: Make SRCU be built by default
  srcu: Fix Kconfig botch when SRCU not selected
  rcu: Make non-preemptive schedule be Tasks RCU quiescent state
  srcu: Expedite srcu_schedule_cbs_snp() callback invocation
  srcu: Parallelize callback handling
  kvm: Move srcu_struct fields to end of struct kvm
  rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
  rcu: Use true/false in assignment to bool
  rcu: Use bool value directly
  ...
2017-05-10 10:30:46 -07:00
Linus Torvalds
b948abf53a Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
 "The biggest part of this is making st_dev/st_ino on the overlay behave
  like a normal filesystem (i.e. st_ino doesn't change on copy up,
  st_dev is the same for all files and directories). Currently this only
  works if all layers are on the same filesystem, but future work will
  move the general case towards more sane behavior.

  There are also miscellaneous fixes, including fixes to handling
  append-only files. There's a small change in the VFS, but that only
  has an effect on overlayfs, since otherwise file->f_path.dentry->inode
  and file_inode(file) are always the same"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: update documentation w.r.t. constant inode numbers
  ovl: persistent inode numbers for upper hardlinks
  ovl: merge getattr for dir and nondir
  ovl: constant st_ino/st_dev across copy up
  ovl: persistent inode number for directories
  ovl: set the ORIGIN type flag
  ovl: lookup non-dir copy-up-origin by file handle
  ovl: use an auxiliary var for overlay root entry
  ovl: store file handle of lower inode on copy up
  ovl: check if all layers are on the same fs
  ovl: do not set overlay.opaque on non-dir create
  ovl: check IS_APPEND() on real upper inode
  vfs: ftruncate check IS_APPEND() on real upper inode
  ovl: Use designated initializers
  ovl: lockdep annotate of nested stacked overlayfs inode lock
2017-05-10 09:03:48 -07:00
Linus Torvalds
a2e5ad45a9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "Support for pid namespaces from Seth and refcount_t work from Elena"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: Add support for pid namespaces
  fuse: convert fuse_conn.count from atomic_t to refcount_t
  fuse: convert fuse_req.count from atomic_t to refcount_t
  fuse: convert fuse_file.count from atomic_t to refcount_t
2017-05-10 08:45:30 -07:00
Linus Torvalds
26c5eaa132 The two main items are support for disabling automatic rbd exclusive
lock transfers from myself and the long awaited -ENOSPC handling series
 from Jeff.  The former will allow rbd users to take advantage of
 exclusive lock's built-in blacklist/break-lock functionality while
 staying in control of who owns the lock.  With the latter in place, we
 will abort filesystem writes on -ENOSPC instead of having them block
 indefinitely.
 
 Beyond that we've got the usual pile of filesystem fixes from Zheng,
 some refcount_t conversion patches from Elena and a patch for an
 ancient open() flags handling bug from Alexander.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZEt/kAAoJEEp/3jgCEfOLpzAIAIld0N06DuHKG2F9mHEnLeGl
 Y60BZ3Ajo32i9qPT/u9ntI99ZMlkuHcNWg6WpCCh8umbwk2eiAKRP/KcfGcWmmp9
 EHj9COCmBR9TRM1pNS1lSMzljDnxf9sQmbIO9cwMQBUya5g19O0OpApzxF1YQhCR
 V9B/FYV5IXELC3b/NH45oeDAD9oy/WgwbhQ2feTBQJmzIVJx+Je9hdhR1PH1rI06
 ysyg3VujnUi/hoDhvPTBznNOxnHx/HQEecHH8b01MkbaCgxPH88jsUK/h7PYF3Gh
 DE/sCN69HXeu1D/al3zKoZdahsJ5GWkj9Q+vvBoQJm+ZPsndC+qpgSj761n9v38=
 =vamy
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The two main items are support for disabling automatic rbd exclusive
  lock transfers from myself and the long awaited -ENOSPC handling
  series from Jeff.

  The former will allow rbd users to take advantage of exclusive lock's
  built-in blacklist/break-lock functionality while staying in control
  of who owns the lock. With the latter in place, we will abort
  filesystem writes on -ENOSPC instead of having them block
  indefinitely.

  Beyond that we've got the usual pile of filesystem fixes from Zheng,
  some refcount_t conversion patches from Elena and a patch for an
  ancient open() flags handling bug from Alexander"

* tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
  ceph: fix memory leak in __ceph_setxattr()
  ceph: fix file open flags on ppc64
  ceph: choose readdir frag based on previous readdir reply
  rbd: exclusive map option
  rbd: return ResponseMessage result from rbd_handle_request_lock()
  rbd: kill rbd_is_lock_supported()
  rbd: support updating the lock cookie without releasing the lock
  rbd: store lock cookie
  rbd: ignore unlock errors
  rbd: fix error handling around rbd_init_disk()
  rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
  rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
  ceph: when seeing write errors on an inode, switch to sync writes
  Revert "ceph: SetPageError() for writeback pages if writepages fails"
  ceph: handle epoch barriers in cap messages
  libceph: add an epoch_barrier field to struct ceph_osd_client
  libceph: abort already submitted but abortable requests when map or pool goes full
  libceph: allow requests to return immediately on full conditions if caller wishes
  libceph: remove req->r_replay_version
  ceph: make seeky readdir more efficient
  ...
2017-05-10 08:42:33 -07:00
Linus Torvalds
1176032cb1 Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has fixes and cleanups Dave Sterba collected for the merge
  window.

  The biggest functional fixes are between btrfs raid5/6 and scrub, and
  raid5/6 and device replacement. Some of our pending qgroup fixes are
  included as well while I bash on the rest in testing.

  We also have the usual set of cleanups, including one that makes
  __btrfs_map_block() much more maintainable, and conversions from
  atomic_t to refcount_t"

* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits)
  btrfs: fix the gfp_mask for the reada_zones radix tree
  Btrfs: fix reported number of inode blocks
  Btrfs: send, fix file hole not being preserved due to inline extent
  Btrfs: fix extent map leak during fallocate error path
  Btrfs: fix incorrect space accounting after failure to insert inline extent
  Btrfs: fix invalid attempt to free reserved space on failure to cow range
  btrfs: Handle delalloc error correctly to avoid ordered extent hang
  btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  btrfs: check if the device is flush capable
  btrfs: delete unused member nobarriers
  btrfs: scrub: Fix RAID56 recovery race condition
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: Use ktime_get_real_ts for root ctime
  Btrfs: handle only applicable errors returned by btrfs_get_extent
  btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
  btrfs: use q which is already obtained from bdev_get_queue
  Btrfs: switch to div64_u64 if with a u64 divisor
  Btrfs: update scrub_parity to use u64 stripe_len
  Btrfs: enable repair during read for raid56 profile
  btrfs: use clear_page where appropriate
  ...
2017-05-10 08:33:17 -07:00
Steve French
de1892b887 Don't delay freeing mids when blocked on slow socket write of request
When processing responses, and in particular freeing mids (DeleteMidQEntry),
which is very important since it also frees the associated buffers (cifs_buf_release),
we can block a long time if (writes to) socket is slow due to low memory or networking
issues.

We can block in send (smb request) waiting for memory, and be blocked in processing
responess (which could free memory if we let it) - since they both grab the
server->srv_mutex.

In practice, in the DeleteMidQEntry case - there is no reason we need to
grab the srv_mutex so remove these around DeleteMidQEntry, and it allows
us to free memory faster.

Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-05-09 20:37:32 -05:00
Rabin Vincent
560d388950 CIFS: silence lockdep splat in cifs_relock_file()
cifs_relock_file() can perform a down_write() on the inode's lock_sem even
though it was already performed in cifs_strict_readv().  Lockdep complains
about this.  AFAICS, there is no problem here, and lockdep just needs to be
told that this nesting is OK.

 =============================================
 [ INFO: possible recursive locking detected ]
 4.11.0+ #20 Not tainted
 ---------------------------------------------
 cat/701 is trying to acquire lock:
  (&cifsi->lock_sem){++++.+}, at: cifs_reopen_file+0x7a7/0xc00

 but task is already holding lock:
  (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&cifsi->lock_sem);
   lock(&cifsi->lock_sem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by cat/701:
  #0:  (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310

 stack backtrace:
 CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  ? preempt_schedule_irq+0x6b/0x80
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? cifs_reopen_file+0x7a7/0xc00
  down_read+0x2d/0x70
  ? cifs_reopen_file+0x7a7/0xc00
  cifs_reopen_file+0x7a7/0xc00
  ? printk+0x43/0x4b
  cifs_readpage_worker+0x327/0x8a0
  cifs_readpage+0x8c/0x2a0
  generic_file_read_iter+0x692/0xd00
  cifs_strict_readv+0x29f/0x310
  generic_file_splice_read+0x11c/0x1c0
  do_splice_to+0xa5/0xc0
  splice_direct_to_actor+0xfa/0x350
  ? generic_pipe_buf_nosteal+0x10/0x10
  do_splice_direct+0xb5/0xe0
  do_sendfile+0x278/0x3a0
  SyS_sendfile64+0xc4/0xe0
  entry_SYSCALL_64_fastpath+0x1f/0xbe

Signed-off-by: Rabin Vincent <rabinv@axis.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-09 20:36:02 -05:00
Ari Kauppi
b550a32e60 nfsd: fix undefined behavior in nfsd4_layout_verify
UBSAN: Undefined behaviour in fs/nfsd/nfs4proc.c:1262:34
  shift exponent 128 is too large for 32-bit type 'int'

Depending on compiler+architecture, this may cause the check for
layout_type to succeed for overly large values (which seems to be the
case with amd64). The large value will be later used in de-referencing
nfsd4_layout_ops for function pointers.

Reported-by: Jani Tuovila <tuovila@synopsys.com>
Signed-off-by: Ari Kauppi <ari@synopsys.com>
[colin.king@canonical.com: use LAYOUT_TYPE_MAX instead of 32]
Cc: stable@vger.kernel.org
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-09 17:09:18 -04:00
Trond Myklebust
76b2a30338 pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled
Layoutstats is always desirable when using the flexfiles driver, so
we should enable it if that driver is being loaded. It is safe to do
so, because even when the mount specifies NFSv4.1, we will turn it
off if the server tells us it is unsupported.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-09 16:02:57 -04:00
Trond Myklebust
f4b23de3dd NFSv4.1: Work around a Linux server bug...
It turns out the Linux server has a bug in its implementation of
supattr_exclcreat; it returns the set of all attributes, whether
or not they are supported by minor version 1.
In order to avoid a regression, we therefore apply the supported_attrs
as a mask on top of whatever the server sent us.

Reported-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-09 15:52:15 -04:00
Linus Torvalds
11fbf53d66 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted bits and pieces from various people. No common topic in this
  pile, sorry"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/affs: add rename exchange
  fs/affs: add rename2 to prepare multiple methods
  Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
  fs: don't set *REFERENCED on single use objects
  fs: compat: Remove warning from COMPATIBLE_IOCTL
  remove pointless extern of atime_need_update_rcu()
  fs: completely ignore unknown open flags
  fs: add a VALID_OPEN_FLAGS
  fs: remove _submit_bh()
  fs: constify tree_descr arrays passed to simple_fill_super()
  fs: drop duplicate header percpu-rwsem.h
  fs/affs: bugfix: Write files greater than page size on OFS
  fs/affs: bugfix: enable writes on OFS disks
  fs/affs: remove node generation check
  fs/affs: import amigaffs.h
  fs/affs: bugfix: make symbolic links work again
2017-05-09 09:12:53 -07:00
Linus Torvalds
8ee74a91ac proc: try to remove use of FOLL_FORCE entirely
We fixed the bugs in it, but it's still an ugly interface, so let's see
if anybody actually depends on it.  It's entirely possible that nothing
actually requires the whole "punch through read-only mappings"
semantics.

For example, gdb definitely uses the /proc/<pid>/mem interface, but it
looks like it mainly does it for regular reads of the target (that don't
need FOLL_FORCE), and looking at the gdb source code seems to fall back
on the traditional ptrace(PTRACE_POKEDATA) interface if it needs to.

If this breaks something, I do have a (more complex) version that only
enables FOLL_FORCE when somebody has PTRACE_ATTACH'ed to the target,
like the comment here used to say ("Maybe we should limit FOLL_FORCE to
actual ptrace users?").

Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-09 08:45:16 -07:00
Linus Torvalds
bf5f89463f Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - various misc things

 - procfs updates

 - lib/ updates

 - checkpatch updates

 - kdump/kexec updates

 - add kvmalloc helpers, use them

 - time helper updates for Y2038 issues. We're almost ready to remove
   current_fs_time() but that awaits a btrfs merge.

 - add tracepoints to DAX

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
  selftests/vm: add a test for virtual address range mapping
  dax: add tracepoint to dax_insert_mapping()
  dax: add tracepoint to dax_writeback_one()
  dax: add tracepoints to dax_writeback_mapping_range()
  dax: add tracepoints to dax_load_hole()
  dax: add tracepoints to dax_pfn_mkwrite()
  dax: add tracepoints to dax_iomap_pte_fault()
  mtd: nand: nandsim: convert to memalloc_noreclaim_*()
  treewide: convert PF_MEMALLOC manipulations to new helpers
  mm: introduce memalloc_noreclaim_{save,restore}
  mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
  mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
  mm/huge_memory.c: use zap_deposited_table() more
  time: delete CURRENT_TIME_SEC and CURRENT_TIME
  gfs2: replace CURRENT_TIME with current_time
  apparmorfs: replace CURRENT_TIME with current_time()
  lustre: replace CURRENT_TIME macro
  fs: ubifs: replace CURRENT_TIME_SEC with current_time
  fs: ufs: use ktime_get_real_ts64() for birthtime
  ...
2017-05-08 18:17:56 -07:00
Ross Zwisler
b444073458 dax: add tracepoint to dax_insert_mapping()
Add a tracepoint to dax_insert_mapping(), following the same logging
conventions as the rest of DAX.  This tracepoint, along with the one in
dax_load_hole(), lets us know how a DAX PTE fault was serviced.

Here is an example DAX fault that inserts a PTE mapping:

  small-1126  [007] ....
   145.451604: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220

  small-1126  [007] ....
   145.452317: dax_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10420000 radix_entry 0x100006

  small-1126  [007] ....
   145.452399: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE

Link: http://lkml.kernel.org/r/20170221195116.13278-7-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:16 -07:00
Ross Zwisler
f9bc3a0753 dax: add tracepoint to dax_writeback_one()
Add a tracepoint to dax_writeback_one(), following the same logging
conventions as the rest of DAX.

Here is an example range writeback which ends up flushing one PMD and
one PTE:

  test-1265  [003] ....
   496.615250: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff

  test-1265  [003] ....
   496.616263: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x0 pglen 0x200

  test-1265  [003] ....
   496.616270: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x305 pglen 0x1

  test-1265  [003] ....
   496.616272: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff

[akpm@linux-foundation.org: struct blk_dax_ctl has disappeared]
Link: http://lkml.kernel.org/r/20170221195116.13278-6-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:16 -07:00
Ross Zwisler
d14a3f48a1 dax: add tracepoints to dax_writeback_mapping_range()
Add tracepoints to dax_writeback_mapping_range(), following the same
logging conventions as the rest of DAX.

Here is an example writeback call:

  msync-1085  [006] ....
   200.902565: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff

  msync-1085  [006] ....
   200.902579: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff

[ross.zwisler@linux.intel.com: fix regression in dax_writeback_mapping_range()]
  Link: http://lkml.kernel.org/r/20170314215358.31451-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170221195116.13278-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:16 -07:00
Ross Zwisler
678c9fd043 dax: add tracepoints to dax_load_hole()
Add tracepoints to dax_load_hole(), following the same logging conventions
as the rest of DAX.

Here is the logging generated by a PTE read from a hole:

  read-1075  [002] ....
    62.362108: dax_pte_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280

  read-1075  [002] ....
    62.362140: dax_load_hole: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE

  read-1075  [002] ....
    62.362141: dax_pte_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE

Link: http://lkml.kernel.org/r/20170221195116.13278-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:16 -07:00
Ross Zwisler
c3ff68d7d1 dax: add tracepoints to dax_pfn_mkwrite()
Add tracepoints to dax_pfn_mkwrite(), following the same logging
conventions as the rest of DAX.

Here is an example PTE fault followed by a pfn_mkwrite:

  small_aligned-1094  [002] ....
   374.084998: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200

  small_aligned-1094  [002] ....
   374.085145: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 MAJOR|NOPAGE

  small_aligned-1094  [002] ....
   374.085165: dax_pfn_mkwrite: dev 259:0 ino 0x1003 shared WRITE|MKWRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 NOPAGE

Link: http://lkml.kernel.org/r/20170221195116.13278-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Ross Zwisler
a9c42b33ed dax: add tracepoints to dax_iomap_pte_fault()
Patch series "second round of tracepoints for DAX".

This second round of DAX tracepoint patches adds tracing to the PTE
fault path (dax_iomap_pte_fault(), dax_pfn_mkwrite(), dax_load_hole(),
dax_insert_mapping()) and to the writeback path
(dax_writeback_mapping_range(), dax_writeback_one()).

The purpose of this tracing is to give us a high level view of what DAX
is doing, whether faults are being serviced by PMDs or PTEs, and by real
storage or by zero pages covering holes.

I do have some patches nearly ready which also add tracing to
grab_mapping_entry() and dax_insert_mapping_entry().  These are more
targeted at logging how we are interacting with the radix tree, how we
use empty entries for locking, whether we "downgrade" huge zero pages to
4k PTE sized allocations, etc.  In the end it seemed to me that this
might be too detailed to have as constantly present tracepoints, but if
anyone sees value in having tracepoints like this in the DAX code
permanently (Jan?), please let me know and I'll add those last two
patches.

All these tracepoints were done to be consistent with the style of the
XFS tracepoints and with the existing DAX PMD tracepoints.

This patch (of 6):

Add tracepoints to dax_iomap_pte_fault(), following the same logging
conventions as the rest of DAX.

Here is an example fault that initially tries to be serviced by the PMD
fault handler but which falls back to PTEs because the VMA isn't large
enough to hold a PMD:

  small-1086  [005] ....
   71.140014: xfs_filemap_huge_fault: dev 259:0 ino 0x1003

  small-1086  [005] ....
    71.140027: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400

  small-1086  [005] ....
    71.140028: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400 FALLBACK

  small-1086  [005] ....
    71.140035: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220

  small-1086  [005] ....
    71.140396: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE

Link: http://lkml.kernel.org/r/20170221195116.13278-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Stephen Rothwell
b32c8c7648 gfs2: replace CURRENT_TIME with current_time
Link: http://lkml.kernel.org/r/20170420161852.0492bc3f@canb.auug.org.au
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
607a11ad94 fs: ubifs: replace CURRENT_TIME_SEC with current_time
CURRENT_TIME_SEC is not y2038 safe.  current_time() will be transitioned
to use 64 bit time along with vfs in a separate patch.  There is no plan
to transition CURRENT_TIME_SEC to use y2038 safe time interfaces.

current_time() returns timestamps according to the granularities set in
the inode's super_block.  The granularity check to call
current_fs_time() or CURRENT_TIME_SEC is not required.

Use current_time() directly to update inode timestamp.  Use
timespec_trunc during file system creation, before the first inode is
created.

Link: http://lkml.kernel.org/r/1491613030-11599-9-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
a88e99e976 fs: ufs: use ktime_get_real_ts64() for birthtime
CURRENT_TIME is not y2038 safe.  Replace it with ktime_get_real_ts64().
Inode time formats are already 64 bit long and accommodates time64_t.

Link: http://lkml.kernel.org/r/1491613030-11599-6-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
1134e09100 fs: ceph: CURRENT_TIME with ktime_get_real_ts()
CURRENT_TIME is not y2038 safe.  The macro will be deleted and all the
references to it will be replaced by ktime_get_* apis.

struct timespec is also not y2038 safe.  Retain timespec for timestamp
representation here as ceph uses it internally everywhere.  These
references will be changed to use struct timespec64 in a separate patch.

The current_fs_time() api is being changed to use vfs struct inode* as
an argument instead of struct super_block*.

Set the new mds client request r_stamp field using ktime_get_real_ts()
instead of using current_fs_time().

Also, since r_stamp is used as mtime on the server, use timespec_trunc()
to truncate the timestamp, using the right granularity from the
superblock.

This api will be transitioned to be y2038 safe along with vfs.

Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
M:	Ilya Dryomov <idryomov@gmail.com>
M:	"Yan, Zheng" <zyan@redhat.com>
M:	Sage Weil <sage@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
e37fea58f7 fs: cifs: replace CURRENT_TIME by other appropriate apis
CURRENT_TIME macro is not y2038 safe on 32 bit systems.

The patch replaces all the uses of CURRENT_TIME by current_time() for
filesystem times, and ktime_get_* functions for authentication
timestamps and timezone calculations.

This is also in preparation for the patch that transitions vfs
timestamps to use 64 bit time and hence make them y2038 safe.

CURRENT_TIME macro will be deleted before merging the aforementioned
change.

The inode timestamps read from the server are assumed to have correct
granularity and range.

The patch also assumes that the difference between server and client
times lie in the range INT_MIN..INT_MAX.  This is valid because this is
the difference between current times between server and client, and the
largest timezone difference is in the range of one day.

All cifs timestamps currently use timespec representation internally.
Authentication and timezone timestamps can also be transitioned into
using timespec64 when all other timestamps for cifs is transitioned to
use timespec64.

Link: http://lkml.kernel.org/r/1491613030-11599-4-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
48fbfe50f1 fs: f2fs: use ktime_get_real_seconds for sit_info times
CURRENT_TIME_SEC is not y2038 safe.

Replace use of CURRENT_TIME_SEC with ktime_get_real_seconds in segment
timestamps used by GC algorithm including the segment mtime timestamps.

Link: http://lkml.kernel.org/r/1491613030-11599-2-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Tetsuo Handa
c718a97514 fs: semove set but not checked AOP_FLAG_UNINTERRUPTIBLE flag
Commit afddba49d1 ("fs: introduce write_begin, write_end, and
perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
checked in pagecache_write_begin(), but that check was removed by
4e02ed4b4a ("fs: remove prepare_write/commit_write").

Between these two commits, commit d9414774dc ("cifs: Convert cifs to
new aops.") added a check in cifs_write_begin(), but that check was soon
removed by commit a98ee8c1c7 ("[CIFS] fix regression in
cifs_write_begin/cifs_write_end").

Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere.  Let's
remove this flag.  This patch has no functionality changes.

Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:14 -07:00
Masahiro Yamada
6e7c2b4dd3 scripts/spelling.txt: add "intialise(d)" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  intialisation||initialisation
  intialised||initialised
  intialise||initialise

This commit does not intend to change the British spelling itself.

Link: http://lkml.kernel.org/r/1481573103-11329-18-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko
19809c2da2 mm, vmalloc: use __GFP_HIGHMEM implicitly
__vmalloc* allows users to provide gfp flags for the underlying
allocation.  This API is quite popular

  $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
  77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space.  About half of users don't
use this flag, though.  This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko
752ade68cb treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko
81be3dee96 fs/xattr.c: zero out memory copied to userspace in getxattr
getxattr uses vmalloc to allocate memory if kzalloc fails.  This is
filled by vfs_getxattr and then copied to the userspace.  vmalloc,
however, doesn't zero out the memory so if the specific implementation
of the xattr handler is sloppy we can theoretically expose a kernel
memory.  There is no real sign this is really the case but let's make
sure this will not happen and use vzalloc instead.

Fixes: 779302e678 ("fs/xattr.c:getxattr(): improve handling of allocation failures")
Link: http://lkml.kernel.org/r/20170306103327.2766-1-mhocko@kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>	[3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Michal Hocko
a7c3e901a4 mm: introduce kv[mz]alloc helpers
Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree.  Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code.  Yet we do not have any common helper
for that and so users have invented their own helpers.  Some of them are
really creative when doing so.  Let's just add kv[mz]alloc and make sure
it is implemented properly.  This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures.  This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them.  In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL).  Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers.  Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
  Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Kirill Tkhai
eaa0d190bf pidns: expose task pid_ns_for_children to userspace
pid_ns_for_children set by a task is known only to the task itself, and
it's impossible to identify it from outside.

It's a big problem for checkpoint/restore software like CRIU, because it
can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
their work.

This patch solves the problem, and it exposes pid_ns_for_children to ns
directory in standard way with the name "pid_for_children":

  ~# ls /proc/5531/ns -l | grep pid
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]

Link: http://lkml.kernel.org/r/149201123914.6007.2187327078064239572.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Kirill Tkhai
25b14e92af ns: allow ns_entries to have custom symlink content
Patch series "Expose task pid_ns_for_children to userspace".

pid_ns_for_children set by a task is known only to the task itself, and
it's impossible to identify it from outside.

It's a big problem for checkpoint/restore software like CRIU, because it
can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
their work.  If they have a custom pid_ns_for_children before dump, they
must have the same ns after restore.  Otherwise, restored task bumped
into enviroment it does not expect.

This patchset solves the problem.  It exposes pid_ns_for_children to ns
directory in standard way with the name "pid_for_children":

  ~# ls /proc/5531/ns -l | grep pid
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]

This patch (of 2):

Make possible to have link content prefix yyy different from the link
name xxx:

  $ readlink /proc/[pid]/ns/xxx
  yyy:[4026531838]

This will be used in next patch.

Link: http://lkml.kernel.org/r/149201120318.6007.7362655181033883000.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Kees Cook
7fe6a42e87 reiserfs: use designated initializers
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers.  These were identified
during allyesconfig builds of x86, arm, and arm64, with most initializer
fixes extracted from grsecurity.

Link: http://lkml.kernel.org/r/20170329210419.GA40066@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:11 -07:00
Tobin C. Harding
f245e1c17a fs/proc/inode.c: remove cast from memory allocation
Coccinelle emits this warning:

  WARNING: casting value returned by memory allocation function to (struct proc_inode *) is useless.

Remove unnecessary cast.

Link: http://lkml.kernel.org/r/1487745720-16967-1-git-send-email-me@tobin.cc
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:10 -07:00
Olga Kornievskaia
e092693443 NFS append COMMIT after synchronous COPY
Instead of messing with the commit path which has been causing issues,
add a COMMIT op after the COPY and ask for stable copies in the first
space.

It saves a round trip, since after the COPY, the client sends a COMMIT
anyway.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-08 19:01:06 -04:00
J. Bruce Fields
efda760fe9 lockd: fix lockd shutdown race
As reported by David Jeffery: "a signal was sent to lockd while lockd
was shutting down from a request to stop nfs.  The signal causes lockd
to call restart_grace() which puts the lockd_net structure on the grace
list.  If this signal is received at the wrong time, it will occur after
lockd_down_net() has called locks_end_grace() but before
lockd_down_net() stops the lockd thread.  This leads to lockd putting
the lockd_net structure back on the grace list, then exiting without
anything removing it from the list."

So, perform the final locks_end_grace() from the the lockd thread; this
ensures it's serialized with respect to restart_grace().

Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-08 18:06:18 -04:00
Linus Torvalds
70ef8f0d37 for-f2fs-4.12
In this round, we've focused on enhancing performance with regards to block
 allocation, GC, and discard/in-place-update IO controls. There are a bunch
 of clean-ups as well as minor bug fixes.
 
 = Enhancement
 - disable heap-based allocation by default
 - issue small-sized discard commands by default
 - change the policy of data hotness for logging
 - distinguish IOs in terms of size and wbc type
 - start SSR earlier to avoid foreground GC
 - enhance data structures managing discard commands
 - enhance in-place update flow
 - add some more fault injection routines
 - secure one more xattr entry
 
 = Bug fix
 - calculate victim cost for GC correctly
 - remain correct victim segment number for GC
 - race condition in nid allocator and initializer
 - stale pointer produced by atomic_writes
 - fix missing REQ_SYNC for flush commands
 - handle missing errors in more corner cases
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZEKXrAAoJEEAUqH6CSFDSJJ8P/1Zy0NS9TM/PFtT7Sevb6vgC
 LcKLtX1bVhUuX9wAt5Q6BZ9927tCQPt5vLEYUxtniqEQaC0fsJAMbRYot+gR/dvN
 4bGgv1TeVST5pKbmctzhAL30PvZ1w4QS6dLvPMm2sPQSrPKGUGt0J8wPiHHZuvH4
 pygKzDxbrIJTeMhLm9tgFg7dWTJXV3VDb57WpA1AM1LAFVsIPF4vZnryLv3GsRmY
 eGRxgZEtt/90hCRbEcPirPZrtpv/O5f12K4Vp/NPw+4XGMEk+nTYndq6rlUWVNjg
 iPEDuxONyk/yb274SqB6sbNDuxHOqn7stGJepdUpSbprIsLZ0RmMaYWjSNsLU3Vh
 p4fAzRqvfSqAHCt0FEL/vT8M9ST5xQRVr9P/l0kDK5Ww95RROd05bEaGm/sKc7NB
 PHiWUoMIFFmuVsoCi6sM0AKps53ZGON8GEUyVKyM7NWTw1oWLPWifGMthEkysmwm
 08SdU5+XqbCeyMPAA2GURqMA5A8ssuA8+F0Citf4JPckQHPPj5pAydmx2wVlfBlc
 /bneR7T/8OsUbxgG8JSbdHUiPcjb20F0GTxSOTXiV/AaZAMCtyETnw64K2V6E0n7
 uraKcYYhypyphCj/IYc4vnQ3dCu3U2/NvTYEVX8DBvboN38/JVqmNWgQx9g+tLzj
 +r5s7PqTDuXv5Cfzc5NC
 =SBUb
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've focused on enhancing performance with regards to
  block allocation, GC, and discard/in-place-update IO controls. There
  are a bunch of clean-ups as well as minor bug fixes.

  Enhancements:
   - disable heap-based allocation by default
   - issue small-sized discard commands by default
   - change the policy of data hotness for logging
   - distinguish IOs in terms of size and wbc type
   - start SSR earlier to avoid foreground GC
   - enhance data structures managing discard commands
   - enhance in-place update flow
   - add some more fault injection routines
   - secure one more xattr entry

  Bug fixes:
   - calculate victim cost for GC correctly
   - remain correct victim segment number for GC
   - race condition in nid allocator and initializer
   - stale pointer produced by atomic_writes
   - fix missing REQ_SYNC for flush commands
   - handle missing errors in more corner cases"

* tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (111 commits)
  f2fs: fix a mount fail for wrong next_scan_nid
  f2fs: enhance scalability of trace macro
  f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
  f2fs: Make flush bios explicitely sync
  f2fs: show available_nids in f2fs/status
  f2fs: flush dirty nats periodically
  f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
  f2fs: allow cpc->reason to indicate more than one reason
  f2fs: release cp and dnode lock before IPU
  f2fs: shrink size of struct discard_cmd
  f2fs: don't hold cmd_lock during waiting discard command
  f2fs: nullify fio->encrypted_page for each writes
  f2fs: sanity check segment count
  f2fs: introduce valid_ipu_blkaddr to clean up
  f2fs: lookup extent cache first under IPU scenario
  f2fs: reconstruct code to write a data page
  f2fs: introduce __wait_discard_cmd
  f2fs: introduce __issue_discard_cmd
  f2fs: enable small discard by default
  f2fs: delay awaking discard thread
  ...
2017-05-08 12:24:17 -07:00
Rock Lee
798868c021 ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
Change 'convert' to 'converts'
Change 'UBIFS' to 'UBIFS inode flags'

Signed-off-by: Rock Lee <rockdotlee@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-05-08 20:48:55 +02:00
Stefan Agner
2a068daf57 ubifs: Remove unnecessary assignment
Assigning a value of a variable to itself is not useful.

Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-05-08 20:48:47 +02:00
Colin Ian King
6a258f7d0f ubifs: Fix cut and paste error on sb type comparisons
The check for the bad node type of sb->type is checking sa->type
and not sb-type. This looks like a cut and paste error. Fix this.

Detected by PVS-Studio, warning: V581

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-05-08 20:48:41 +02:00
Hyunchul Lee
8326c1eec2 ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
When write syscall is called, every time security label is searched to
determine that file's privileges should be changed.
If LSM(Linux Security Model) is not used, this is useless.

So introduce CONFIG_UBIFS_SECURITY to disable security labels. it's default
value is "y".

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-05-08 20:48:23 +02:00
Linus Torvalds
677375cef8 Only bug fixes and cleanups for this merge window.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlkPYHkACgkQ8vlZVpUN
 gaM97ggAlOm8n/tlbcdonX/+HHjlnqcy5uYD7A9AH/JordpRzy4eqcMbxMG39p1R
 DBtjo9Y0i3iFEGajRc0h7KXDLeTBUQ/JZpR8H60MFfAQHnTowuI91eb3/6QeZiHh
 CN/2KKzpYitPIEUfEHnVeYKOfvrzR7je5hrEiAwEkPeKv7XyrNVM0LHQ/jKpbQwg
 ntIzHvxjQyo8plx/m5S4Yew7tqjYpNiq4plmyk/Vxtw2FmB/FC76UxYeadoB3EI5
 etw+bCORB0tFZO27o56kXywg+mDcp7HEtVvq9LG28oEuBDAVKNoeKEvV7SiOBlZp
 +HnqIz5Hx1UTxOlTAc10IjvEhriEuw==
 =qCDl
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Only bug fixes and cleanups for this merge window"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: correct collision claim for digested names
  MAINTAINERS: fscrypt: update mailing list, patchwork, and git
  ext4: clean up ext4_match() and callers
  f2fs: switch to using fscrypt_match_name()
  ext4: switch to using fscrypt_match_name()
  fscrypt: introduce helper function for filename matching
  fscrypt: avoid collisions when presenting long encrypted filenames
  f2fs: check entire encrypted bigname when finding a dentry
  ubifs: check for consistent encryption contexts in ubifs_lookup()
  f2fs: sync f2fs_lookup() with ext4_lookup()
  ext4: remove "nokey" check from ext4_lookup()
  fscrypt: fix context consistency check when key(s) unavailable
  fscrypt: Remove __packed from fscrypt_policy
  fscrypt: Move key structure and constants to uapi
  fscrypt: remove fscrypt_symlink_data_len()
  fscrypt: remove unnecessary checks for NULL operations
2017-05-08 11:40:34 -07:00
Linus Torvalds
dd727dad37 Add GETFSMAP support; some performance improvements for very large
file systems and for random write workloads into a preallocated file;
 bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlkPYB8ACgkQ8vlZVpUN
 gaP1HwgApoMQGegtRIbCZKUzKBJ2S6vwIoPAMz62JuwngOyWygJ1T1TliKTitG04
 XvijKpUHtEggMO/ZsUOCoyr2LzJlpVvvrJZsavEubO12LKreYMpvNraZF1GACYTb
 lIZpdWkpcEz5WnPV/PXW/dEMcSMhnKe8tbmHXMyAouSC6a55F5Wp456KF/plqkHU
 zkWTCDbEOtHThzpL8cthUL71ji62I3Op5jn/qOfKCm6/JtUlw5pYjWkRUNqqjSQE
 uQqMpqLxI/VjOdEiBPxEF6A+ZudZmoBQKY15ibWCcHUPFOPqk4RdYz6VivRI7zrg
 KrrKcdFT29MtKnRfAAoJcc0nJ4e1Iw==
 =il74
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:

 - add GETFSMAP support

 - some performance improvements for very large file systems and for
   random write workloads into a preallocated file

 - bug fixes and cleanups.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: cleanup write flags handling from jbd2_write_superblock()
  ext4: mark superblock writes synchronous for nobarrier mounts
  ext4: inherit encryption xattr before other xattrs
  ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
  ext4: avoid unnecessary transaction stalls during writeback
  ext4: preload block group descriptors
  ext4: make ext4_shutdown() static
  ext4: support GETFSMAP ioctls
  vfs: add common GETFSMAP ioctl definitions
  ext4: evict inline data when writing to memory map
  ext4: remove ext4_xattr_check_entry()
  ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
  ext4: merge ext4_xattr_list() into ext4_listxattr()
  ext4: constify static data that is never modified
  ext4: trim return value and 'dir' argument from ext4_insert_dentry()
  jbd2: fix dbench4 performance regression for 'nobarrier' mounts
  jbd2: Fix lockdep splat with generic/270 test
  mm: retry writepages() on ENOMEM when doing an data integrity writeback
2017-05-08 11:30:05 -07:00
Dan Williams
ef51042472 block, dax: move "select DAX" from BLOCK to FS_DAX
For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.

Filesystems need to include dax.h to call bdev_dax_supported().

Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-08 10:55:27 -07:00
Trond Myklebust
28cf22d0ba NFSv4: Fix exclusive create attributes encoding
When using NFS4_CREATE_EXCLUSIVE4_1 mode, the client will overestimate the
amount of space that it needs for the attributes because it does so
before checking whether or not the server supports a given attribute.

Fix by checking the attribute mask earlier.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-08 09:40:59 -04:00
Trond Myklebust
2e84611b3f NFSv4: Fix an rcu lock leak
The intention in the original patch was to release the lock when
we put the inode, however something got screwed up.

Reported-by: Jason Yan <yanaijie@huawei.com>
Fixes: 7b410d9ce4 ("pNFS: Delay getting the layout header in..")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-08 09:27:59 -04:00
Linus Torvalds
fe7a719b30 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Various fixes for stable for CIFS/SMB3 especially for better
  interoperability for SMB3 to Macs.

  It also includes Pavel's improvements to SMB3 async i/o support
  (which is much faster now)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: add misssing SFM mapping for doublequote
  SMB3: Work around mount failure when using SMB3 dialect to Macs
  cifs: fix CIFS_IOC_GET_MNT_INFO oops
  CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
  CIFS: fix oplock break deadlocks
  cifs: fix CIFS_ENUMERATE_SNAPSHOTS oops
  cifs: fix leak in FSCTL_ENUM_SNAPS response handling
  Set unicode flag on cifs echo request to avoid Mac error
  CIFS: Add asynchronous write support through kernel AIO
  CIFS: Add asynchronous read support through kernel AIO
  CIFS: Add asynchronous context to support kernel AIO
  cifs: fix IPv6 link local, with scope id, address parsing
  cifs: small underflow in cnvrtDosUnixTm()
2017-05-06 11:51:46 -07:00
Linus Torvalds
d484467c86 Changes for 4.12:
- various code cleanups
 - introduce GETFSMAP ioctl
 - various refactoring
 - avoid dio reads past eof
 - fix memory corruption and other errors with fragmented directory blocks
 - fix accidental userspace memory corruptions
 - publish fs uuid in superblock
 - make fstrim terminatable
 - fix race between quotaoff and in-core inode creation
 - Avoid use-after-free when finishing up w/ buffer heads
 - Reserve enough space to handle bmap tree resizing during cow remap
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZDfIzAAoJEPh/dxk0SrTrsEgP/3TjYbaqsad2e6KqtZwqN/Qx
 DUljUxReZl4rgnAaFD55XOPYWGZ2bBGNtAQlAR7/JYZuZs6obbBrqUukS19jPVi7
 SeQdknnU3yTq17LrwEeeQUOhem28GHxYtQYazdgNoTigZXABeXWzi53HzvPw5+Ci
 3a+zB1clu3cycKsD+UAhz/m0Z40ckjDMsDueJMOACiax+vPjlzSu36H9wzlF/h0R
 nq7VGSDZy6aS3H75PDjWVxoJGUSdO7jHYxwQflkk6wxrcmTCLZxuiDeSANOZ2KxM
 y8qTln6hqxalQSH9r6n84/XrQstYWfdLqwngIL5wMSvN6UbuFyNQKuouEkWs6EEZ
 4cuSqfihT7o5VcIpYiq1ZDgNzzpmDDMMeho4J9WBvm5Qt5hgPCo3gzweE/C6Sscs
 m+V1NvLd+kBiHoMhYPB8/lm4nXa/wT1Y3TtHc+8A/qkZKAwoOdxWKNIY58jfmdzb
 Rvv0LKi+6W5zanzXlNs3NXJBwZAeHuHXKY3UJT4BAWfjdtS6QvIf1Bcpj9ApyqE2
 oOnNMRhF+wSS9dSFoPXkRjzIyoR5CoOylB0KYV9OYELYPDLczwbvtX/9+tjDEol9
 odCZyyzJtKxYQbwf2TQ/ZqXQV4vw6lWOB7G4Itx7yv0Taa9vQ7cxSX2MnE7TA/pW
 IQKsE6C2I24Bfr2oPfms
 =oKCc
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "Here are the XFS changes for 4.12. The big new feature for this
  release is the new space mapping ioctl that we've been discussing
  since LSF2016, but other than that most of the patches are larger bug
  fixes, memory corruption prevention, and other cleanups.

  Summary:
   - various code cleanups
   - introduce GETFSMAP ioctl
   - various refactoring
   - avoid dio reads past eof
   - fix memory corruption and other errors with fragmented directory blocks
   - fix accidental userspace memory corruptions
   - publish fs uuid in superblock
   - make fstrim terminatable
   - fix race between quotaoff and in-core inode creation
   - avoid use-after-free when finishing up w/ buffer heads
   - reserve enough space to handle bmap tree resizing during cow remap"

* tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
  xfs: fix use-after-free in xfs_finish_page_writeback
  xfs: reserve enough blocks to handle btree splits when remapping
  xfs: wait on new inodes during quotaoff dquot release
  xfs: update ag iterator to support wait on new inodes
  xfs: support ability to wait on new inodes
  xfs: publish UUID in struct super_block
  xfs: Allow user to kill fstrim process
  xfs: better log intent item refcount checking
  xfs: fix up quotacheck buffer list error handling
  xfs: remove xfs_trans_ail_delete_bulk
  xfs: don't use bool values in trace buffers
  xfs: fix getfsmap userspace memory corruption while setting OF_LAST
  xfs: fix __user annotations for xfs_ioc_getfsmap
  xfs: corruption needs to respect endianess too!
  xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
  xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
  xfs: simplify validation of the unwritten extent bit
  xfs: remove unused values from xfs_exntst_t
  xfs: remove the unused XFS_MAXLINK_1 define
  xfs: more do_div cleanups
  ...
2017-05-06 11:46:16 -07:00
Linus Torvalds
044f1daaaa Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes and updates from Jens Axboe:
 "Some fixes and followup features/changes that should go in, in this
  merge window. This contains:

   - Two fixes for lightnvm from Javier, fixing problems in the new code
     merge previously in this merge window.

   - A fix from Jan for the backing device changes, fixing an issue in
     NFS that causes a failure to mount on certain setups.

   - A change from Christoph, cleaning up the blk-mq init and exit
     request paths.

   - Remove elevator_change(), which is now unused. From Bart.

   - A fix for queue operation invocation on a dead queue, from Bart.

   - A series fixing up mtip32xx for blk-mq scheduling, removing a
     bandaid we previously had in place for this. From me.

   - A regression fix for this series, fixing a case where we wait on
     workqueue flushing from an invalid (non-blocking) context. From me.

   - A fix/optimization from Ming, ensuring that we don't both quiesce
     and freeze a queue at the same time.

   - A fix from Peter on lock ordering for CPU hotplug. Not a real
     problem right now, but will be once the CPU hotplug rework goes in.

   - A series from Omar, cleaning up out blk-mq debugfs support, and
     adding support for exporting info from schedulers in debugfs as
     well. This is really useful in debugging stalls or livelocks. From
     Omar"

* 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
  mq-deadline: add debugfs attributes
  kyber: add debugfs attributes
  blk-mq-debugfs: allow schedulers to register debugfs attributes
  blk-mq: untangle debugfs and sysfs
  blk-mq: move debugfs declarations to a separate header file
  blk-mq: Do not invoke queue operations on a dead queue
  blk-mq-debugfs: get rid of a bunch of boilerplate
  blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>
  blk-mq-debugfs: don't open code strstrip()
  blk-mq-debugfs: error on long write to queue "state" file
  blk-mq-debugfs: clean up flag definitions
  blk-mq-debugfs: separate flags with |
  nfs: Fix bdi handling for cloned superblocks
  block/mq: Cure cpu hotplug lock inversion
  lightnvm: fix bad back free on error path
  lightnvm: create cmd before allocating request
  blk-mq: don't use sync workqueue flushing from drivers
  mtip32xx: convert internal commands to regular block infrastructure
  mtip32xx: cleanup internal tag assumptions
  block: don't call blk_mq_quiesce_queue() after queue is frozen
  ...
2017-05-06 11:25:08 -07:00
Linus Torvalds
53ef7d0e20 libnvdimm for 4.12
* Region media error reporting: A libnvdimm region device is the parent
 to one or more namespaces. To date, media errors have been reported via
 the "badblocks" attribute attached to pmem block devices for namespaces
 in "raw" or "memory" mode. Given that namespaces can be in "device-dax"
 or "btt-sector" mode this new interface reports media errors
 generically, i.e. independent of namespace modes or state. This
 subsequently allows userspace tooling to craft "ACPI 6.1 Section
 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and
 submit them via the ioctl path for NVDIMM root bus devices.
 
 * Introduce 'struct dax_device' and 'struct dax_operations': Prompted by
 a request from Linus and feedback from Christoph this allows for dax
 capable drivers to publish their own custom dax operations. This fixes
 the broken assumption that all dax operations are related to a
 persistent memory device, and makes it easier for other architectures
 and platforms to add customized persistent memory support.
 
 * 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
 available for storage appliance applications to manually trigger memory
 controllers to drain write-pending buffers that would otherwise be
 flushed automatically by the platform ADR (asynchronous-DRAM-refresh)
 mechanism at a power loss event. Support for "locked" DIMMs is included
 to prevent namespaces from surfacing when the namespace label data area
 is locked. Finally, fixes for various reported deadlocks and crashes,
 also tagged for -stable.
 
 * ACPI / nfit driver updates: General updates of the nfit driver to add
 DSM command overrides, ACPI 6.1 health state flags support, DSM payload
 debug available by default, and various fixes.
 
 Acknowledgements that came after the branch was pushed:
 
 commmit 565851c972 "device-dax: fix sysfs attribute deadlock"
 Tested-by: Yi Zhang <yizhan@redhat.com>
 
 commit 23f4984483 "libnvdimm: rework region badblocks clearing"
 Tested-by: Toshi Kani <toshi.kani@hpe.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4
 W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT
 J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ
 nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb
 Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7
 ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK
 zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL
 BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4
 59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH
 Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ
 WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh
 TmJQ3Ly9t3o3/weHSzmn
 =Ox0s
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "The bulk of this has been in multiple -next releases. There were a few
  late breaking fixes and small features that got added in the last
  couple days, but the whole set has received a build success
  notification from the kbuild robot.

  Change summary:

   - Region media error reporting: A libnvdimm region device is the
     parent to one or more namespaces. To date, media errors have been
     reported via the "badblocks" attribute attached to pmem block
     devices for namespaces in "raw" or "memory" mode. Given that
     namespaces can be in "device-dax" or "btt-sector" mode this new
     interface reports media errors generically, i.e. independent of
     namespace modes or state.

     This subsequently allows userspace tooling to craft "ACPI 6.1
     Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
     requests and submit them via the ioctl path for NVDIMM root bus
     devices.

   - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
     by a request from Linus and feedback from Christoph this allows for
     dax capable drivers to publish their own custom dax operations.
     This fixes the broken assumption that all dax operations are
     related to a persistent memory device, and makes it easier for
     other architectures and platforms to add customized persistent
     memory support.

   - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
     available for storage appliance applications to manually trigger
     memory controllers to drain write-pending buffers that would
     otherwise be flushed automatically by the platform ADR
     (asynchronous-DRAM-refresh) mechanism at a power loss event.
     Support for "locked" DIMMs is included to prevent namespaces from
     surfacing when the namespace label data area is locked. Finally,
     fixes for various reported deadlocks and crashes, also tagged for
     -stable.

   - ACPI / nfit driver updates: General updates of the nfit driver to
     add DSM command overrides, ACPI 6.1 health state flags support, DSM
     payload debug available by default, and various fixes.

  Acknowledgements that came after the branch was pushed:

   - commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
     Tested-by: Yi Zhang <yizhan@redhat.com>

   - commit 23f4984483 "libnvdimm: rework region badblocks clearing"
     Tested-by: Toshi Kani <toshi.kani@hpe.com>"

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
  libnvdimm, pfn: fix 'npfns' vs section alignment
  libnvdimm: handle locked label storage areas
  libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
  brd: fix uninitialized use of brd->dax_dev
  block, dax: use correct format string in bdev_dax_supported
  device-dax: fix sysfs attribute deadlock
  libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
  libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
  libnvdimm: rework region badblocks clearing
  acpi, nfit: kill ACPI_NFIT_DEBUG
  libnvdimm: fix clear length of nvdimm_forget_poison()
  libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
  libnvdimm, region: sysfs trigger for nvdimm_flush()
  libnvdimm: fix phys_addr for nvdimm_clear_poison
  x86, dax, pmem: remove indirection around memcpy_from_pmem()
  block: remove block_device_operations ->direct_access()
  block, dax: convert bdev_dax_supported() to dax_direct_access()
  filesystem-dax: convert to dax_direct_access()
  Revert "block: use DAX for partition table reads"
  ext2, ext4, xfs: retrieve dax_device for iomap operations
  ...
2017-05-05 18:49:20 -07:00
Linus Torvalds
1a5fb64fee We've got ten GFS2 patches for this merge window.
1. Andreas Gruenbacher wrote a patch to replace the deprecated
    call to rhashtable_walk_init with rhashtable_walk_enter.
 2. Andreas also wrote a patch to eliminate redundant code in
    two of our debugfs sequence files.
 3. Andreas also cleaned up the rhashtable key ugliness Linus
    pointed out during this cycle, following Linus's suggestions.
 4. Andreas also wrote a patch to take advantage of his new
    function rhashtable_lookup_get_insert_fast. This makes glock
    lookup faster and more bullet-proof.
 5. Andreas also wrote a patch to revert a patch in the evict
    path that caused occasional deadlocks, and is no longer
    needed.
 6. Andrew Price wrote a patch to re-enable fallocate for the
    rindex system file to enable gfs2_grow to grow properly on
    secondary file system grow operations.
 7. I wrote a patch to initialize an inode number field to make
    certain kernel trace points more understandable.
 8. I also wrote a patch that makes GFS2 file system "withdraw"
    work more like it should by ignoring operations after a
    withdraw that would formerly cause a BUG() and kernel panic.
 9. I also reworked the entire truncate/delete algorithm,
    scrapping the old recursive algorithm in favor of a new
    non-recursive algorithm. This was done for performance:
    This way, GFS2 no longer needs to lock multiple resource
    groups while doing truncates and deletes of files that cross
    multiple resource group boundaries, allowing for better
    parallelism. It also solves a problem whereby deleting large
    files would request a large chunk of kernel memory, which
    resulted in a get_page_from_freelist warning.
 10. Due to a regression found during testing, I added a new
     patch to correct "GFS2: Prevent BUG from occurring when
     normal Withdraws occur".
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZDNnaAAoJENeLYdPf93o7B7kIAJzwz7vVDVg2TpWVhMmXIWhf
 rZx3Gth5F0h+ZHddW7HzTLg+64XQ5//GyDD3UDtCpkhl5SJH+nt3juHyPJlRwioT
 0ua4SjyKLQSoJJVAEgAwu42QjORTXab7NjYn5LEhvRc0Gg/El9WGU+ZgmP2/aAvf
 KE2u/IEYNDkoJNS3Oqc7shajAyLYda6wCAASs/1ZGt9u48m/o/I23Zd7wr7EOkzw
 rd3gB0x80cJqDAB5IcymGOm111Tg4g34LwsRuyMnWE3H1jOgV+J515FVHEIvZuPq
 Wl9X7V8CzktI7nyLKVnZhpuv5JzyMq/vOPiD01tTFx8Oy1JCRezjmATXFjW/zIo=
 =MX3c
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull GFS2 updates from Bob Peterson:
 "We've got ten GFS2 patches for this merge window.

   - Andreas Gruenbacher wrote a patch to replace the deprecated call to
     rhashtable_walk_init with rhashtable_walk_enter.

   - Andreas also wrote a patch to eliminate redundant code in two of
     our debugfs sequence files.

   - Andreas also cleaned up the rhashtable key ugliness Linus pointed
     out during this cycle, following Linus's suggestions.

   - Andreas also wrote a patch to take advantage of his new function
     rhashtable_lookup_get_insert_fast. This makes glock lookup faster
     and more bullet-proof.

   - Andreas also wrote a patch to revert a patch in the evict path that
     caused occasional deadlocks, and is no longer needed.

   - Andrew Price wrote a patch to re-enable fallocate for the rindex
     system file to enable gfs2_grow to grow properly on secondary file
     system grow operations.

   - I wrote a patch to initialize an inode number field to make certain
     kernel trace points more understandable.

   - I also wrote a patch that makes GFS2 file system "withdraw" work
     more like it should by ignoring operations after a withdraw that
     would formerly cause a BUG() and kernel panic.

   - I also reworked the entire truncate/delete algorithm, scrapping the
     old recursive algorithm in favor of a new non-recursive algorithm.
     This was done for performance: This way, GFS2 no longer needs to
     lock multiple resource groups while doing truncates and deletes of
     files that cross multiple resource group boundaries, allowing for
     better parallelism. It also solves a problem whereby deleting large
     files would request a large chunk of kernel memory, which resulted
     in a get_page_from_freelist warning.

   - Due to a regression found during testing, I added a new patch to
     correct 'GFS2: Prevent BUG from occurring when normal Withdraws
     occur'."

* tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  GFS2: Allow glocks to be unlocked after withdraw
  GFS2: Non-recursive delete
  gfs2: Re-enable fallocate for the rindex
  Revert "GFS2: Wait for iopen glock dequeues"
  gfs2: Switch to rhashtable_lookup_get_insert_fast
  GFS2: Temporarily zero i_no_addr when creating a dinode
  gfs2: Don't pack struct lm_lockname
  gfs2: Deduplicate gfs2_{glocks,glstats}_open
  gfs2: Replace rhashtable_walk_init with rhashtable_walk_enter
  GFS2: Prevent BUG from occurring when normal Withdraws occur
2017-05-05 13:40:20 -07:00
Linus Torvalds
aeced66196 Some cleanups:
remove unused get_fsid_from_ino
   fix bounds check for listxattr
   clean up oversize xattr validation
   do not set getattr_time on orangefs_lookup
   return from orangefs_devreq_read quickly if possible
   do not wait for timeout if umounting
   handle zero size write in debugfs
 
 Bug fixes:
 
   do not check possibly stale size on truncate
   ensure the userspace component is unmounted if mount fails
   total reimplementation of dir.c
 
 New feature:
 
   implement statx
 
 The new implementation of dir.c is kind of a big deal, all new
 code. It has been posted to fs-devel during the previous rc period,
 we didn't get much review or feedback from there, but it has been reviewed
 very heavily here, so much so that we have two entire versions of the
 reimplementation. Not only does the new implementation fix some
 xfstests, but it passes all the new tests we made here that involve
 seeking and rewinding and giant directories and long file names.
 The new dir code has three patches itself:
 
   skip forward to the next directory entry if seek is short
   invalidate stored directory on seek
   count directory pieces correctly
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZDLasAAoJEM9EDqnrzg2+yR4P/jNsryNfQush5V/6EO+wpQ7p
 O0epuLG42QMN67wdsQDVOOzcRQq2IAoYrgupZfEvCVsoBiYxdCTTwhN/55UctMBA
 xnakv8BarrLd6pqSJOlQviP7ByXdEvy7dtYYuAEtdRnPtTZEmjDH0k9ME759+DVm
 pPQ6fanPzSZuG8fdjI4QrKiFfpE5slMeyMV9SmzIq81S1i+t2b9sDYKTiP3Jt14y
 KTweGdJXTRT+Piy27d80HN9ExlFXlcyru9GDWNhZi4EHlax7bq76Qwu1XKyaOg0h
 MN40+18k+Zqrpj1/tq4aj3YM0P3HjpRhtb5TqOC+QhZDIL1gJ8bv8rv61snWTak+
 6cXtwvIh7r4aEU+gkMLP29HXCVlGg3V4up+DdbHJVbIEXV8C5csJBP+sQUlU7A5D
 WoPmheV7CJ8nicwkxYm31dhdnW7mOwW/J4uUlM9w/yU/dVfoz1SK8AtKjy0xX87c
 Jpo7nuJEDprI+9neT0y5U+RHVqH08+cA5DCrdk0x8JaJIrjOZpvTROIPrtzlS7QL
 aTu+W/ISXtFwnM+ERmw8TKPD7TTUXypydYhzXe8V6itDpiNp1kQFGmLGzLhAMElH
 iGQkFatR6LSKh+DxUD3PREQGNyQCKpgPiqLoGYprzQ829tqLpThumfZic9lX1C/+
 we5VEpRbiz6BjN110DBJ
 =NGTt
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "Orangefs cleanups, fixes and statx support.

  Some cleanups:

   - remove unused get_fsid_from_ino
   - fix bounds check for listxattr
   - clean up oversize xattr validation
   - do not set getattr_time on orangefs_lookup
   - return from orangefs_devreq_read quickly if possible
   - do not wait for timeout if umounting
   - handle zero size write in debugfs

  Bug fixes:

   - do not check possibly stale size on truncate
   - ensure the userspace component is unmounted if mount fails
   - total reimplementation of dir.c

  New feature:

   - implement statx

  The new implementation of dir.c is kind of a big deal, all new code.
  It has been posted to fs-devel during the previous rc period, we
  didn't get much review or feedback from there, but it has been
  reviewed very heavily here, so much so that we have two entire
  versions of the reimplementation.

  Not only does the new implementation fix some xfstests, but it passes
  all the new tests we made here that involve seeking and rewinding and
  giant directories and long file names. The new dir code has three
  patches itself:

   - skip forward to the next directory entry if seek is short
   - invalidate stored directory on seek
   - count directory pieces correctly"

* tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: count directory pieces correctly
  orangefs: invalidate stored directory on seek
  orangefs: skip forward to the next directory entry if seek is short
  orangefs: handle zero size write in debugfs
  orangefs: do not wait for timeout if umounting
  orangefs: return from orangefs_devreq_read quickly if possible
  orangefs: ensure the userspace component is unmounted if mount fails
  orangefs: do not check possibly stale size on truncate
  orangefs: implement statx
  orangefs: remove ORANGEFS_READDIR macros
  orangefs: support very large directories
  orangefs: support llseek on directories
  orangefs: rewrite readdir to fix several bugs
  orangefs: do not set getattr_time on orangefs_lookup
  orangefs: clean up oversize xattr validation
  orangefs: fix bounds check for listxattr
  orangefs: remove unused get_fsid_from_ino
2017-05-05 13:36:10 -07:00
Linus Torvalds
414975eb76 befs fixes for 4.12-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZDFZyAAoJEGu/nxmHO1GNpz0IAINPEyXe9zAc/K74u5mIUPKT
 MqK/ifAYdOmGDu9kB68tXFQ5o3GNmAjWI4P8/T6oGlK9IudChrwTBY9Gss7iaawc
 +sNu71NmnyxbWHb7w71kIdhwNiHWolgZva1Ex9yaQYqRAy/JapCke9gs5TiruM4j
 zObaZnw48RwVyvU/Xixoz0hOLDGkPltOdy3tkWmy9v8sg/jSf+HF1FpAIfyO4pm+
 Kf2YR9IEkHhHwhoVEbHeSOjH/Tgb8gO8Suh4OnPRAP3gnVLWhb5Deh7Pjlgoj8Gn
 am2KFSkpShwvNG+yXufEwS4p7ERNd4u3uk/IWhJTuw6sE08L+dFU4Rj+DdxR2eY=
 =sENx
 -----END PGP SIGNATURE-----

Merge tag 'befs-v4.12-rc1' of git://github.com/luisbg/linux-befs

Pull befs fix from Luis de Bethencourt:
 "One fix from Fabian Frederick making the nfs client still work after a
  cache drop"

* tag 'befs-v4.12-rc1' of git://github.com/luisbg/linux-befs:
  befs: make export work with cold dcache
2017-05-05 13:33:38 -07:00
Fabian Frederick
6b4657667b fs/affs: add rename exchange
Process RENAME_EXCHANGE in affs_rename2() adding static
affs_xrename() based on affs_rename().

We remove headers from respective directories then
affect bh to other inode directory entries for swapping.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-05 15:24:52 -04:00
Fabian Frederick
c6184028a7 fs/affs: add rename2 to prepare multiple methods
Currently AFFS only supports RENAME_NOREPLACE.
This patch isolates that method to a static function to
prepare RENAME_EXCHANGE addition.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-05 15:24:52 -04:00
Bob Peterson
ed17545d01 GFS2: Allow glocks to be unlocked after withdraw
This bug fixes a regression introduced by patch 0d1c7ae9d8.

The intent of the patch was to stop promoting glocks after a
file system is withdrawn due to a variety of errors, because doing
so results in a BUG(). (You should be able to unmount after a
withdraw rather than having the kernel panic.)

Unfortunately, it also stopped demotions, so glocks could not be
unlocked after withdraw, which means the unmount would hang.

This patch allows function do_xmote to demote locks to an
unlocked state after a withdraw, but not promote them.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-05-05 14:19:28 -05:00
Eryu Guan
161f55efba xfs: fix use-after-free in xfs_finish_page_writeback
Commit 28b783e47a ("xfs: bufferhead chains are invalid after
end_page_writeback") fixed one use-after-free issue by
pre-calculating the loop conditionals before calling bh->b_end_io()
in the end_io processing loop, but it assigned 'next' pointer before
checking end offset boundary & breaking the loop, at which point the
bh might be freed already, and caused use-after-free.

This is caught by KASAN when running fstests generic/127 on sub-page
block size XFS.

[ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50
[ 2747.868840] ==================================================================
[ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698
...
[ 2747.918245] Call Trace:
[ 2747.920975]  dump_stack+0x63/0x84
[ 2747.924673]  kasan_object_err+0x21/0x70
[ 2747.928950]  kasan_report+0x271/0x530
[ 2747.933064]  ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
[ 2747.938409]  ? end_page_writeback+0xce/0x110
[ 2747.943171]  __asan_report_load8_noabort+0x19/0x20
[ 2747.948545]  xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
[ 2747.953724]  xfs_end_io+0x1af/0x2b0 [xfs]
[ 2747.958197]  process_one_work+0x5ff/0x1000
[ 2747.962766]  worker_thread+0xe4/0x10e0
[ 2747.966946]  kthread+0x2d3/0x3d0
[ 2747.970546]  ? process_one_work+0x1000/0x1000
[ 2747.975405]  ? kthread_create_on_node+0xc0/0xc0
[ 2747.980457]  ? syscall_return_slowpath+0xe6/0x140
[ 2747.985706]  ? do_page_fault+0x30/0x80
[ 2747.989887]  ret_from_fork+0x2c/0x40
[ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104
[ 2748.001155] Allocated:
[ 2748.003782] PID = 8327
[ 2748.006411]  save_stack_trace+0x1b/0x20
[ 2748.010688]  save_stack+0x46/0xd0
[ 2748.014383]  kasan_kmalloc+0xad/0xe0
[ 2748.018370]  kasan_slab_alloc+0x12/0x20
[ 2748.022648]  kmem_cache_alloc+0xb8/0x1b0
[ 2748.027024]  alloc_buffer_head+0x22/0xc0
[ 2748.031399]  alloc_page_buffers+0xd1/0x250
[ 2748.035968]  create_empty_buffers+0x30/0x410
[ 2748.040730]  create_page_buffers+0x120/0x1b0
[ 2748.045493]  __block_write_begin_int+0x17a/0x1800
[ 2748.050740]  iomap_write_begin+0x100/0x2f0
[ 2748.055308]  iomap_zero_range_actor+0x253/0x5c0
[ 2748.060362]  iomap_apply+0x157/0x270
[ 2748.064347]  iomap_zero_range+0x5a/0x80
[ 2748.068624]  iomap_truncate_page+0x6b/0xa0
[ 2748.073227]  xfs_setattr_size+0x1f7/0xa10 [xfs]
[ 2748.078312]  xfs_vn_setattr_size+0x68/0x140 [xfs]
[ 2748.083589]  xfs_file_fallocate+0x4ac/0x820 [xfs]
[ 2748.088838]  vfs_fallocate+0x2cf/0x780
[ 2748.093021]  SyS_fallocate+0x48/0x80
[ 2748.097006]  do_syscall_64+0x18a/0x430
[ 2748.101186]  return_from_SYSCALL_64+0x0/0x6a
[ 2748.105948] Freed:
[ 2748.108189] PID = 8327
[ 2748.110816]  save_stack_trace+0x1b/0x20
[ 2748.115093]  save_stack+0x46/0xd0
[ 2748.118788]  kasan_slab_free+0x73/0xc0
[ 2748.122969]  kmem_cache_free+0x7a/0x200
[ 2748.127247]  free_buffer_head+0x41/0x80
[ 2748.131524]  try_to_free_buffers+0x178/0x250
[ 2748.136316]  xfs_vm_releasepage+0x2e9/0x3d0 [xfs]
[ 2748.141563]  try_to_release_page+0x100/0x180
[ 2748.146325]  invalidate_inode_pages2_range+0x7da/0xcf0
[ 2748.152087]  xfs_shift_file_space+0x37d/0x6e0 [xfs]
[ 2748.157557]  xfs_collapse_file_space+0x49/0x120 [xfs]
[ 2748.163223]  xfs_file_fallocate+0x2a7/0x820 [xfs]
[ 2748.168462]  vfs_fallocate+0x2cf/0x780
[ 2748.172642]  SyS_fallocate+0x48/0x80
[ 2748.176629]  do_syscall_64+0x18a/0x430
[ 2748.180810]  return_from_SYSCALL_64+0x0/0x6a

Fixed it by checking on offset against end & breaking out first,
dereference bh only if there're still bufferheads to process.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-05 12:16:48 -07:00
Linus Torvalds
e579dde654 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This is a set of small fixes that were mostly stumbled over during
  more significant development. This proc fix and the fix to
  posix-timers are the most significant of the lot.

  There is a lot of good development going on but unfortunately it
  didn't quite make the merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Fix unbalanced hard link numbers
  signal: Make kill_proc_info static
  rlimit: Properly call security_task_setrlimit
  signal: Remove unused definition of sig_user_definied
  ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
  ipc: Remove unused declaration of recompute_msgmni
  posix-timers: Correct sanity check in posix_cpu_nsleep
  sysctl: Remove dead register_sysctl_root
2017-05-05 11:08:43 -07:00
Fabian Frederick
0795bf8357 nfs: use kmap/kunmap directly
This patch removes useless nfs_readdir_get_array() and
nfs_readdir_release_array() as suggested by Trond Myklebust

nfs_readdir() calls nfs_revalidate_mapping() before
readdir_search_pagecache() , nfs_do_filldir(), uncached_readdir()
so mapping should be correct.

While kmap() can't fail, all subsequent error checks were removed
as well as unused labels.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-05 13:01:33 -04:00
Hou Tao
59b86d85a7 NFS: always treat the invocation of nfs_getattr as cache hit when noac is on
When using 'ls -l' to display a large directory, if noac option is used,
in function nfs_getattr() nfs_need_revalidate_inode() will always be true
for NFSv3 and the nfs_entry cache of the directory will be flushed. The
flush will lead to a fully reread of the directory entries from server.

To prevent the unnecessary RPCs, we need to check whether or not the
noac option is used, and always report the invocation of nfs_getattr()
as cache hit instead cache miss when it's on.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-05 13:01:32 -04:00
Dave Wysochanski
5c737cb299 Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew
If memory allocation fails for the callback data, we need to put the nfs_client
or we end up with an elevated refcount.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-05 13:01:32 -04:00
Trond Myklebust
0048fdd066 NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
If the server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION because we
are trunking, then RECLAIM_COMPLETE must handle that by calling
nfs4_schedule_session_recovery() and then retrying.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2017-05-05 12:01:50 -04:00
Björn Jacke
85435d7a15 CIFS: add misssing SFM mapping for doublequote
SFM is mapping doublequote to 0xF020

Without this patch creating files with doublequote fails to Windows/Mac

Signed-off-by: Bjoern Jacke <bjacke@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: stable <stable@vger.kernel.org>
2017-05-05 08:33:44 -05:00
Fabian Frederick
dcfd9b215b befs: make export work with cold dcache
based on commit b3b42c0dea
("fs/affs: make export work with cold dcache")

This adds get_parent function so that nfs client can still work after
cache drop (Tested on NFS v4 with echo 3 > /proc/sys/vm/drop_caches)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2017-05-05 11:35:35 +01:00
Amir Goldstein
5b6c9053fb ovl: persistent inode numbers for upper hardlinks
An upper type non directory dentry that is a copy up target
should have a reference to its lower copy up origin.

There are three ways for an upper type dentry to be instantiated:
1. A lower type dentry that is being copied up
2. An entry that is found in upper dir by ovl_lookup()
3. A negative dentry is hardlinked to an upper type dentry

In the first case, the lower reference is set before copy up.
In the second case, the lower reference is found by ovl_lookup().
In the last case of hardlinked upper dentry, it is not easy to
update the lower reference of the negative dentry.  Instead,
drop the newly hardlinked negative dentry from dcache and let
the next access call ovl_lookup() to find its lower reference.

This makes sure that the inode number reported by stat(2) after
the hardlink is created is the same inode number that will be
reported by stat(2) after mount cycle, which is the inode number
of the lower copy up origin of the hardlink source.

NOTE that this does not fix breaking of lower hardlinks on copy
up, but only fixes the case of lower nlink == 1, whose upper copy
up inode is hardlinked in upper dir.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Miklos Szeredi
5b712091a3 ovl: merge getattr for dir and nondir
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
72b608f085 ovl: constant st_ino/st_dev across copy up
When all layers are on the same underlying filesystem, let stat(2) return
st_dev/st_ino values of the copy up origin inode if it is known.

This results in constant st_ino/st_dev representation of files in an
overlay mount before and after copy up.

When the underlying filesystem support NFS exportfs, the result is also
persistent st_ino/st_dev representation before and after mount cycle.

Lower hardlinks are broken on copy up to different upper files, so we
cannot use the lower origin st_ino for those different files, even for the
same fs case.

When all overlay layers are on the same fs, use overlay st_dev for non-dirs
to get the correct result from du -x.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
b7a807dc20 ovl: persistent inode number for directories
stat(2) on overlay directories reports the overlay temp inode
number, which is constant across copy up, but is not persistent.

When all layers are on the same fs, report the copy up origin inode
number for directories.

This inode number is persistent, unique across the overlay mount and
constant across copy up.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
595485033d ovl: set the ORIGIN type flag
For directory entries, non zero oe->numlower implies OVL_TYPE_MERGE.
Define a new type flag OVL_TYPE_ORIGIN to indicate that an entry holds a
reference to its lower copy up origin.

For directory entries ORIGIN := MERGE && UPPER. For non-dir entries ORIGIN
means that a lower type dentry has been recently copied up or that we were
able to find the copy up origin from overlay.origin xattr.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
a9d019573e ovl: lookup non-dir copy-up-origin by file handle
If overlay.origin xattr is found on a non-dir upper inode try to get lower
dentry by calling exportfs_decode_fh().

On failure to lookup by file handle to lower layer, do not lookup the copy
up origin by name, because the lower found by name could be another file in
case the upper file was renamed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
c22205d058 ovl: use an auxiliary var for overlay root entry
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
3a1e819b4e ovl: store file handle of lower inode on copy up
Sometimes it is interesting to know if an upper file is pure upper or a
copy up target, and if it is a copy up target, it may be interesting to
find the copy up origin.

This will be used to preserve lower inode numbers across copy up.

Store the lower inode file handle in upper inode extended attribute
overlay.origin on copy up to use it later for these cases.  Store the lower
filesystem uuid along side the file handle, so we can validate that we are
looking for the origin file in the original fs.

If lower fs does not support NFS export ops store a zero sized xattr so we
can always use the overlay.origin xattr to distinguish between a copy up
and a pure upper inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
7bcd74b98d ovl: check if all layers are on the same fs
Some features can only work when all layers are on the same fs.  Test this
condition during mount time, so features can check them later.

Add helper ovl_same_sb() to return the common super block in case all
layers are on the same fs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:57 +02:00
Linus Torvalds
af82455f7d char/misc patches for 4.12-rc1
Here is the big set of new char/misc driver drivers and features for
 4.12-rc1.
 
 There's lots of new drivers added this time around, new firmware drivers
 from Google, more auxdisplay drivers, extcon drivers, fpga drivers, and
 a bunch of other driver updates.  Nothing major, except if you happen to
 have the hardware for these drivers, and then you will be happy :)
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWQvAgg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yknsACgzkAeyz16Z97J3UTaeejbR7nKUCAAoKY4WEHY
 8O9f9pr9gj8GMBwxeZQa
 =OIfB
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big set of new char/misc driver drivers and features for
  4.12-rc1.

  There's lots of new drivers added this time around, new firmware
  drivers from Google, more auxdisplay drivers, extcon drivers, fpga
  drivers, and a bunch of other driver updates. Nothing major, except if
  you happen to have the hardware for these drivers, and then you will
  be happy :)

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
  firmware: google memconsole: Fix return value check in platform_memconsole_init()
  firmware: Google VPD: Fix return value check in vpd_platform_init()
  goldfish_pipe: fix build warning about using too much stack.
  goldfish_pipe: An implementation of more parallel pipe
  fpga fr br: update supported version numbers
  fpga: region: release FPGA region reference in error path
  fpga altera-hps2fpga: disable/unprepare clock on error in alt_fpga_bridge_probe()
  mei: drop the TODO from samples
  firmware: Google VPD sysfs driver
  firmware: Google VPD: import lib_vpd source files
  misc: lkdtm: Add volatile to intentional NULL pointer reference
  eeprom: idt_89hpesx: Add OF device ID table
  misc: ds1682: Add OF device ID table
  misc: tsl2550: Add OF device ID table
  w1: Remove unneeded use of assert() and remove w1_log.h
  w1: Use kernel common min() implementation
  uio_mf624: Align memory regions to page size and set correct offsets
  uio_mf624: Refactor memory info initialization
  uio: Allow handling of non page-aligned memory regions
  hangcheck-timer: Fix typo in comment
  ...
2017-05-04 19:15:35 -07:00
Chris Mason
9bcaaea741 btrfs: fix the gfp_mask for the reada_zones radix tree
Commits cc8385b59e and 7ef70b4d99 added preallocation for the
reada radix trees and also switched them over to GFP_KERNEL for the
default gfp mask.

Since we're doing radix tree insertions under spinlocks, we need
to make sure the mask doesn't allow sleeping.  This fix keeps
the radix preallocation but switches back to the original gfp_mask.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-05-04 16:56:11 -07:00
Martin Brandenburg
2f713b5c7d orangefs: count directory pieces correctly
A large directory full of differently sized file names triggered this.
Most directories, even very large directories with shorter names, would
be lucky enough to fit in one server response.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04 14:38:24 -04:00
Martin Brandenburg
942835d68f orangefs: invalidate stored directory on seek
If an application seeks to a position before the point which has been
read, it must want updates which have been made to the directory.  So
delete the copy stored in the kernel so it will be fetched again.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04 14:38:15 -04:00
Martin Brandenburg
bf15ba7c1f orangefs: skip forward to the next directory entry if seek is short
If userspace seeks to a position in the stream which is not correct, it
would have returned EIO because the data in the buffer at that offset
would be incorrect.  This and the userspace daemon returning a corrupt
directory are indistinguishable.

Now if the data does not look right, skip forward to the next chunk and
try again.  The motivation is that if the directory changes, an
application may seek to a position that was valid and no longer is valid.

It is not yet possible for a directory to change.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04 14:38:10 -04:00
Eric Biggers
d9b9f8d5a8 ext4: clean up ext4_match() and callers
When ext4 encryption was originally merged, we were encrypting the
user-specified filename in ext4_match(), introducing a lot of additional
complexity into ext4_match() and its callers.  This has since been
changed to encrypt the filename earlier, so we can remove the gunk
that's no longer needed.  This more or less reverts ext4_search_dir()
and ext4_find_dest_de() to the way they were in the v4.0 kernel.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:40 -04:00
Eric Biggers
1f73d49177 f2fs: switch to using fscrypt_match_name()
Switch f2fs directory searches to use the fscrypt_match_name() helper
function.  There should be no functional change.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:39 -04:00
Eric Biggers
067d1023b6 ext4: switch to using fscrypt_match_name()
Switch ext4 directory searches to use the fscrypt_match_name() helper
function.  There should be no functional change.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:38 -04:00
Eric Biggers
17159420a6 fscrypt: introduce helper function for filename matching
Introduce a helper function fscrypt_match_name() which tests whether a
fscrypt_name matches a directory entry.  Also clean up the magic numbers
and document things properly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:37 -04:00
Eric Biggers
6b06cdee81 fscrypt: avoid collisions when presenting long encrypted filenames
When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes.  Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long.  Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.

However, there is a bug.  It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions.  However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped".  Thus, it's actually the
second-to-last block which depends on the full plaintext.

This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable.  For example, with ext4:

    # echo pass | e4crypt add_key -p 16 edir/
    # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
    # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
    100000
    # sync
    # echo 3 > /proc/sys/vm/drop_caches
    # keyctl new_session
    # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
    2004
    # rm -rf edir/
    rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
    ...

To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.

Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient.  This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations.  Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories.  They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.

For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs.  It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet.  Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.

Fixes: 5de0b4d0cd ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:36 -04:00
Jaegeuk Kim
6332cd32c8 f2fs: check entire encrypted bigname when finding a dentry
If user has no key under an encrypted dir, fscrypt gives digested dentries.
Previously, when looking up a dentry, f2fs only checks its hash value with
first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
This patch enhances to check entire dentry bytes likewise ext4.

Eric reported how to reproduce this issue by:

 # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
100000
 # sync
 # echo 3 > /proc/sys/vm/drop_caches
 # keyctl new_session
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
99999

Cc: <stable@vger.kernel.org>
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(fixed f2fs_dentry_hash() to work even when the hash is 0)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:35 -04:00
Eric Biggers
413d5a9edb ubifs: check for consistent encryption contexts in ubifs_lookup()
As ext4 and f2fs do, ubifs should check for consistent encryption
contexts during ->lookup() in an encrypted directory.  This protects
certain users of filesystem encryption against certain types of offline
attacks.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:35 -04:00
Eric Biggers
faac7fd97e f2fs: sync f2fs_lookup() with ext4_lookup()
As for ext4, now that fscrypt_has_permitted_context() correctly handles
the case where we have the key for the parent directory but not the
child, f2fs_lookup() no longer has to work around it.  Also add the same
warning message that ext4 uses.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:34 -04:00
Eric Biggers
8c68084bff ext4: remove "nokey" check from ext4_lookup()
Now that fscrypt_has_permitted_context() correctly handles the case
where we have the key for the parent directory but not the child, we
don't need to try to work around this in ext4_lookup().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:33 -04:00
Eric Biggers
272f98f684 fscrypt: fix context consistency check when key(s) unavailable
To mitigate some types of offline attacks, filesystem encryption is
designed to enforce that all files in an encrypted directory tree use
the same encryption policy (i.e. the same encryption context excluding
the nonce).  However, the fscrypt_has_permitted_context() function which
enforces this relies on comparing struct fscrypt_info's, which are only
available when we have the encryption keys.  This can cause two
incorrect behaviors:

1. If we have the parent directory's key but not the child's key, or
   vice versa, then fscrypt_has_permitted_context() returned false,
   causing applications to see EPERM or ENOKEY.  This is incorrect if
   the encryption contexts are in fact consistent.  Although we'd
   normally have either both keys or neither key in that case since the
   master_key_descriptors would be the same, this is not guaranteed
   because keys can be added or removed from keyrings at any time.

2. If we have neither the parent's key nor the child's key, then
   fscrypt_has_permitted_context() returned true, causing applications
   to see no error (or else an error for some other reason).  This is
   incorrect if the encryption contexts are in fact inconsistent, since
   in that case we should deny access.

To fix this, retrieve and compare the fscrypt_contexts if we are unable
to set up both fscrypt_infos.

While this slightly hurts performance when accessing an encrypted
directory tree without the key, this isn't a case we really need to be
optimizing for; access *with* the key is much more important.
Furthermore, the performance hit is barely noticeable given that we are
already retrieving the fscrypt_context and doing two keyring searches in
fscrypt_get_encryption_info().  If we ever actually wanted to optimize
this case we might start by caching the fscrypt_contexts.

Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:43:17 -04:00
Jan Kara
17f423b516 jbd2: cleanup write flags handling from jbd2_write_superblock()
Currently jbd2_write_superblock() silently adds REQ_SYNC to flags with
which journal superblock is written. Make this explicit by making flags
passed down to jbd2_write_superblock() contain REQ_SYNC.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:01:31 -04:00
Jan Kara
00473374b7 ext4: mark superblock writes synchronous for nobarrier mounts
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
generic_make_request_checks() however strips REQ_FUA flag from a bio
when the storage doesn't report volatile write cache and thus write
effectively becomes asynchronous which can lead to performance
regressions. This affects superblock writes for ext4. Fix the problem
by marking superblock writes always as synchronous.

Fixes: b685d3d65a
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 10:58:03 -04:00
Jan Kara
9052c7cf49 nfs: Fix bdi handling for cloned superblocks
In commit 0d3b12584972 "nfs: Convert to separately allocated bdi" I have
wrongly cloned bdi reference in nfs_clone_super(). Further inspection
has shown that originally the code was actually allocating a new bdi (in
->clone_server callback) which was later registered in
nfs_fs_mount_common() and used for sb->s_bdi in nfs_initialise_sb().
This could later result in bdi for the original superblock not getting
unregistered when that superblock got shutdown (as the cloned sb still
held bdi reference) and later when a new superblock was created under
the same anonymous device number, a clash in sysfs has happened on bdi
registration:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 10284 at /linux-next/fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x74
sysfs: cannot create duplicate filename '/devices/virtual/bdi/0:32'
Modules linked in: axp20x_usb_power gpio_axp209 nvmem_sunxi_sid sun4i_dma sun4i_ss virt_dma
CPU: 1 PID: 10284 Comm: mount.nfs Not tainted 4.11.0-rc4+ #14
Hardware name: Allwinner sun7i (A20) Family
[<c010f19c>] (unwind_backtrace) from [<c010bc74>] (show_stack+0x10/0x14)
[<c010bc74>] (show_stack) from [<c03c6e24>] (dump_stack+0x78/0x8c)
[<c03c6e24>] (dump_stack) from [<c0122200>] (__warn+0xe8/0x100)
[<c0122200>] (__warn) from [<c0122250>] (warn_slowpath_fmt+0x38/0x48)
[<c0122250>] (warn_slowpath_fmt) from [<c02ac178>] (sysfs_warn_dup+0x64/0x74)
[<c02ac178>] (sysfs_warn_dup) from [<c02ac254>] (sysfs_create_dir_ns+0x84/0x94)
[<c02ac254>] (sysfs_create_dir_ns) from [<c03c8b8c>] (kobject_add_internal+0x9c/0x2ec)
[<c03c8b8c>] (kobject_add_internal) from [<c03c8e24>] (kobject_add+0x48/0x98)
[<c03c8e24>] (kobject_add) from [<c048d75c>] (device_add+0xe4/0x5a0)
[<c048d75c>] (device_add) from [<c048ddb4>] (device_create_groups_vargs+0xac/0xbc)
[<c048ddb4>] (device_create_groups_vargs) from [<c048dde4>] (device_create_vargs+0x20/0x28)
[<c048dde4>] (device_create_vargs) from [<c02075c8>] (bdi_register_va+0x44/0xfc)
[<c02075c8>] (bdi_register_va) from [<c023d378>] (super_setup_bdi_name+0x48/0xa4)
[<c023d378>] (super_setup_bdi_name) from [<c0312ef4>] (nfs_fill_super+0x1a4/0x204)
[<c0312ef4>] (nfs_fill_super) from [<c03133f0>] (nfs_fs_mount_common+0x140/0x1e8)
[<c03133f0>] (nfs_fs_mount_common) from [<c03335cc>] (nfs4_remote_mount+0x50/0x58)
[<c03335cc>] (nfs4_remote_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
[<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
[<c025cba0>] (vfs_kern_mount) from [<c033352c>] (nfs_do_root_mount+0x80/0xa0)
[<c033352c>] (nfs_do_root_mount) from [<c0333818>] (nfs4_try_mount+0x28/0x3c)
[<c0333818>] (nfs4_try_mount) from [<c0313874>] (nfs_fs_mount+0x2cc/0x8c4)
[<c0313874>] (nfs_fs_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
[<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
[<c025cba0>] (vfs_kern_mount) from [<c02600f0>] (do_mount+0x158/0xc7c)
[<c02600f0>] (do_mount) from [<c0260f98>] (SyS_mount+0x8c/0xb4)
[<c0260f98>] (SyS_mount) from [<c0107840>] (ret_fast_syscall+0x0/0x3c)

Fix the problem by always creating new bdi for a superblock as we used
to do.

Reported-and-tested-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Fixes: 0d3b12584972ce5781179ad3f15cca3cdb5cae05
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04 07:57:46 -06:00
Luis Henriques
eeca958dce ceph: fix memory leak in __ceph_setxattr()
The ceph_inode_xattr needs to be released when removing an xattr.  Easily
reproducible running the 'generic/020' test from xfstests or simply by
doing:

  attr -s attr0 -V 0 /mnt/test && attr -r attr0 /mnt/test

While there, also fix the error path.

Here's the kmemleak splat:

unreferenced object 0xffff88001f86fbc0 (size 64):
  comm "attr", pid 244, jiffies 4294904246 (age 98.464s)
  hex dump (first 32 bytes):
    40 fa 86 1f 00 88 ff ff 80 32 38 1f 00 88 ff ff  @........28.....
    00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
  backtrace:
    [<ffffffff81560199>] kmemleak_alloc+0x49/0xa0
    [<ffffffff810f3e5b>] kmem_cache_alloc+0x9b/0xf0
    [<ffffffff812b157e>] __ceph_setxattr+0x17e/0x820
    [<ffffffff812b1c57>] ceph_set_xattr_handler+0x37/0x40
    [<ffffffff8111fb4b>] __vfs_removexattr+0x4b/0x60
    [<ffffffff8111fd37>] vfs_removexattr+0x77/0xd0
    [<ffffffff8111fdd1>] removexattr+0x41/0x60
    [<ffffffff8111fe65>] path_removexattr+0x75/0xa0
    [<ffffffff81120aeb>] SyS_lremovexattr+0xb/0x10
    [<ffffffff81564b20>] entry_SYSCALL_64_fastpath+0x13/0x94
    [<ffffffffffffffff>] 0xffffffffffffffff

Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:24 +02:00
Alexander Graf
f775ff7d89 ceph: fix file open flags on ppc64
The file open flags (O_foo) are platform specific and should never go
out to an interface that is not local to the system.

Unfortunately these flags have leaked out onto the wire in the cephfs
implementation. That lead to bogus flags getting transmitted on ppc64.

This patch converts the kernel view of flags to the ceph view of file
open flags.

Fixes: 124e68e74 ("ceph: file operations")
Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:24 +02:00
Yan, Zheng
b50c2de51e ceph: choose readdir frag based on previous readdir reply
The dirfragtree is lazily updated, it's not always accurate. Infinite
loops happens in following circumstance.

- client send request to read frag A
- frag A has been fragmented into frag B and C. So mds fills the reply
  with contents of frag B
- client wants to read next frag C. ceph_choose_frag(frag value of C)
  return frag A.

The fix is using previous readdir reply to calculate next readdir frag
when possible.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:24 +02:00
Jeff Layton
26544c623e ceph: when seeing write errors on an inode, switch to sync writes
Currently, we don't have a real feedback mechanism in place for when we
start seeing buffered writeback errors. If writeback is failing, there
is nothing that prevents an application from continuing to dirty pages
that aren't being cleaned.

In the event that we're seeing write errors of any sort occur on an
inode, have the callback set a flag to force further writes to be
synchronous. When the next write succeeds, clear the flag to allow
buffered writeback to continue.

Since this is just a hint to the write submission mechanism, we only
take the i_ceph_lock when a lockless check shows that the flag needs to
be changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:22 +02:00
Jeff Layton
6fc1fe5e4c Revert "ceph: SetPageError() for writeback pages if writepages fails"
This reverts commit b109eec6f4.

If I'm filling up a filesystem with this sort of command:

    $ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync

...then I'll eventually get back EIO on a write. Further calls
will give us ENOSPC.

I'm not sure what prompted this change, but I don't think it's what we
want to do. If writepages failed, we will have already set the mapping
error appropriately, and that's what gets reported by fsync() or
close().

__filemap_fdatawait_range however, does this:

	wait_on_page_writeback(page);
	if (TestClearPageError(page))
		ret = -EIO;

...and that -EIO ends up trumping the mapping's error if one exists.

When writepages fails, we only want to set the error in the mapping,
and not flag the individual pages.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:22 +02:00
Jeff Layton
92475f05bd ceph: handle epoch barriers in cap messages
Have the client store and update the osdc epoch_barrier when a cap
message comes in with one.

When sending cap messages, send the epoch barrier as well. This allows
clients to inform servers that their released caps may not be used until
a particular OSD map epoch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton
a1f4020aab libceph: allow requests to return immediately on full conditions if caller wishes
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.

Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.

A later patch will deal with requests that were submitted before the new
map showing the full condition came in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Yan, Zheng
79162547b7 ceph: make seeky readdir more efficient
Current cephfs client uses string to indicate start position of
readdir. The string is last entry of previous readdir reply.
This approach does not work for seeky readdir because we can
not easily convert the new postion to a string. For seeky readdir,
mds needs to return dentries from the beginning. Client keeps
retrying if the reply does not contain the dentry it wants.

In current version of ceph, mds sorts CDentry in its cache in
hash order. Client also uses dentry hash to compose dir postion.
For seeky readdir, if client passes the hash part of dir postion
to mds. mds can avoid replying useless dentries.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
2827528da0 ceph: close stopped mds' session
If a mds has stopped, close its session and clean up its session
requests/caps. The process is similar to handling SESSION_CLOSE
initiated by mds.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
0a07fc8cd0 ceph: fix potential use-after-free
__unregister_session() free the session if it drops the last
reference. We should grab an extra reference if we want to use
session after __unregister_session().

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
76201b6354 ceph: allow connecting to mds whose rank >= mdsmap::m_max_mds
mdsmap::m_max_mds is the expected count of active mds. It's not the
max rank of active mds. User can decrease mdsmap::m_max_mds, but does
not stop mds whose rank >= mdsmap::m_max_mds.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
8242c9f35a ceph: fix wrong check in ceph_renew_caps()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova
0e1a5ee657 libceph: convert ceph_pagelist.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova
805692d0e0 ceph: convert ceph_cap_snap.nref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Elena Reshetova
3997c01d26 ceph: convert ceph_mds_session.s_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov
74da4a0f57 libceph, ceph: always advertise all supported features
No reason to hide CephFS-specific features in the rbd case.  Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Steve French
7db0a6efdc SMB3: Work around mount failure when using SMB3 dialect to Macs
Macs send the maximum buffer size in response on ioctl to validate
negotiate security information, which causes us to fail the mount
as the response buffer is larger than the expected response.

Changed ioctl response processing to allow for padding of validate
negotiate ioctl response and limit the maximum response size to
maximum buffer size.

Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
2017-05-03 21:23:48 -05:00
Yunlei He
e9cdd30770 f2fs: fix a mount fail for wrong next_scan_nid
-write_checkpoint
   -do_checkpoint
      -next_free_nid    <--- something wrong with next free nid

-f2fs_fill_super
   -build_node_manager
      -build_free_nids
          -get_current_nat_page
             -__get_meta_page   <--- attempt to access beyond end of device

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 19:00:30 -07:00
Linus Torvalds
dd23f273d9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few misc things

 - most of MM

 - KASAN updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  kasan: separate report parts by empty lines
  kasan: improve double-free report format
  kasan: print page description after stacks
  kasan: improve slab object description
  kasan: change report header
  kasan: simplify address description logic
  kasan: change allocation and freeing stack traces headers
  kasan: unify report headers
  kasan: introduce helper functions for determining bug type
  mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
  mm: hwpoison: call shake_page() unconditionally
  mm/swapfile.c: fix swap space leak in error path of swap_free_entries()
  mm/gup.c: fix access_ok() argument type
  mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
  mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
  fs/block_dev: always invalidate cleancache in invalidate_bdev()
  fs: fix data invalidation in the cleancache during direct IO
  zram: reduce load operation in page_same_filled
  zram: use zram_free_page instead of open-coded
  zram: introduce zram data accessor
  ...
2017-05-03 17:55:59 -07:00
David Disseldorp
d8a6e505d6 cifs: fix CIFS_IOC_GET_MNT_INFO oops
An open directory may have a NULL private_data pointer prior to readdir.

Fixes: 0de1f4c6f6 ("Add way to query server fs info for smb3")
Cc: stable@vger.kernel.org
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 19:32:35 -05:00
Björn Jacke
b704e70b7c CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
- trailing space maps to 0xF028
- trailing period maps to 0xF029

This fix corrects the mapping of file names which have a trailing character
that would otherwise be illegal (period or space) but is allowed by POSIX.

Signed-off-by: Bjoern Jacke <bjacke@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 19:31:33 -05:00
Andrey Ryabinin
a5f6a6a9c7 fs/block_dev: always invalidate cleancache in invalidate_bdev()
invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.

Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.

Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:12 -07:00
Andrey Ryabinin
55635ba76e fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.

We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache.  The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.

Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well.  So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.

 - Patch 1 fixes direct IO writes by removing ->nrpages check.
 - Patch 2 fixes similar case in invalidate_bdev().
     Note: I only fixed conditional cleancache_invalidate_inode() here.
       Do we also need to add ->nrexceptional check in into invalidate_bdev()?

 - Patches 3-4: some optimizations.

This patch (of 4):

Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero.  This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call.  So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.

Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.

Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.

Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.

Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:12 -07:00
Michal Hocko
eb52da3f48 jbd2: make the whole kjournald2 kthread NOFS safe
kjournald2 is central to the transaction commit processing.  As such any
potential allocation from this kernel thread has to be GFP_NOFS.  Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Michal Hocko
81378da64d jbd2: mark the transaction context with the scope GFP_NOFS context
now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS.  All allocations will
automatically inherit GFP_NOFS this way.  This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Michal Hocko
9ba1fb2c60 xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts.  The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary.  Let's use memalloc_nofs_{save,restore} instead as it
should provide exactly what we need here - implicit GFP_NOFS context.

Link: http://lkml.kernel.org/r/20170306131408.9828-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Michal Hocko
7dea19f9ee mm: introduce memalloc_nofs_{save,restore} API
GFP_NOFS context is used for the following 5 reasons currently:

 - to prevent from deadlocks when the lock held by the allocation
   context would be needed during the memory reclaim

 - to prevent from stack overflows during the reclaim because the
   allocation is performed from a deep context already

 - to prevent lockups when the allocation context depends on other
   reclaimers to make a forward progress indirectly

 - just in case because this would be safe from the fs POV

 - silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems to
the MM.  Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily.  We would like to get rid of
those as much as possible.  One way to do that is to use the flag in
scopes rather than isolated cases.  Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope of
GFP_NOFS allocation context.  This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

[akpm@linux-foundation.org: fix comment typo, reflow comment]
Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Michal Hocko
9070733b4e xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago.  We would like to make this concept more generic and use
it for other filesystems as well.  Let's start by giving the flag a more
generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts.  Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Shaohua Li
cf8496ea80 proc: show MADV_FREE pages info in smaps
Show MADV_FREE pages info of each vma in smaps.  The interface is for
diganose or monitoring purpose, userspace could use it to understand
what happens in the application.  Since userspace could dirty MADV_FREE
pages without notice from kernel, this interface is the only place we
can get accurate accounting info about MADV_FREE pages.

[mhocko@kernel.org: update Documentation/filesystems/proc.txt]
Link: http://lkml.kernel.org/r/89efde633559de1ec07444f2ef0f4963a97a2ce8.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:08 -07:00
Geliang Tang
d47736fafe fs/ocfs2/cluster: use offset_in_page() macro
Use offset_in_page() macro instead of open-coding.

Link: http://lkml.kernel.org/r/4dbc77ccaaed98b183cf4dba58a4fa325fd65048.1492758503.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:07 -07:00
Junxiao Bi
33496c3c3d ocfs2: o2hb: revert hb threshold to keep compatible
Configfs is the interface for ocfs2-tools to set configure to kernel and
$configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
used to configure heartbeat dead threshold.  Kernel has a default value
of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
to override it.

Commit 45b997737a ("ocfs2/cluster: use per-attribute show and store
methods") changed heartbeat dead threshold name while ocfs2-tools did
not, so ocfs2-tools won't set this configurable and the default value is
always used.  So revert it.

Fixes: 45b997737a ("ocfs2/cluster: use per-attribute show and store methods")
Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:07 -07:00
Geliang Tang
667b8a37f3 fs/ocfs2/cluster: use setup_timer
Use setup_timer() instead of init_timer() to simplify the code.

Link: http://lkml.kernel.org/r/5e75bf07beb91e092d5aa36c36769949a480456a.1489060564.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:07 -07:00
Chao Yu
a72d4b97bb f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
This patch expands cover region of inode->i_rwsem to keep setting flag
atomically.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 14:30:19 -07:00
Jan Kara
3adc5fcb7e f2fs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions.

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
Cc: stable@vger.kernel.org # 4.9+
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 14:30:18 -07:00
Darrick J. Wong
fe0be23e68 xfs: reserve enough blocks to handle btree splits when remapping
In xfs_reflink_end_cow, we erroneously reserve only enough blocks to
handle adding 1 extent.  This is problematic if we fragment free space,
have to do CoW, and then have to perform multiple bmap btree expansions.
Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to
handle btree splits, so log recovery fails after our first error causes
the filesystem to go down.

Therefore, refactor the transaction block reservation macros until we
have a macro that works for our deferred (re)mapping activities, and fix
both problems by using that macro.

With 1k blocks we can hit this fairly often in g/187 if the scratch fs
is big enough.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-03 13:21:40 -07:00
Linus Torvalds
a3719f34fd Merge branch 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, reiserfs, udf and ext2 updates from Jan Kara:
 "The branch contains changes to quota code so that it does not modify
  persistent flags in inode->i_flags (it was the only place in kernel
  doing that) and handle it inside filesystem's quotaon/off handlers
  instead.

  The branch also contains two UDF cleanups, a couple of reiserfs fixes
  and one fix for ext2 quota locking"

* 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: Improve comments in ext4_quota_{on|off}()
  udf: use kmap_atomic for memcpy copying
  udf: use octal for permissions
  quota: Remove dquot_quotactl_ops
  reiserfs: Remove i_attrs_to_sd_attrs()
  reiserfs: Remove useless setting of i_flags
  jfs: Remove jfs_get_inode_flags()
  ext2: Remove ext2_get_inode_flags()
  ext4: Remove ext4_get_inode_flags()
  quota: Stop setting IMMUTABLE and NOATIME flags on quota files
  jfs: Set flags on quota files directly
  ext2: Set flags on quota files directly
  reiserfs: Set flags on quota files directly
  ext4: Set flags on quota files directly
  reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore
  reiserfs: Make cancel_old_flush() reliable
  ext2: Call dquot_writeback_dquots() with s_umount held
  reiserfs: avoid a -Wmaybe-uninitialized warning
2017-05-03 11:35:47 -07:00
Linus Torvalds
5133cd7518 Merge branch 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
 "The branch contains mainly a rework of fsnotify infrastructure fixing
  a shortcoming that we have waited for response to fanotify permission
  events with SRCU read lock held and when the process consuming events
  was slow to respond the kernel has stalled.

  It also contains several cleanups of unnecessary indirections in
  fsnotify framework and a bugfix from Amir fixing leakage of kernel
  internal errno to userspace"

* 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits)
  fanotify: don't expose EOPENSTALE to userspace
  fsnotify: remove a stray unlock
  fsnotify: Move ->free_mark callback to fsnotify_ops
  fsnotify: Add group pointer in fsnotify_init_mark()
  fsnotify: Drop inode_mark.c
  fsnotify: Remove fsnotify_find_{inode|vfsmount}_mark()
  fsnotify: Remove fsnotify_detach_group_marks()
  fsnotify: Rename fsnotify_clear_marks_by_group_flags()
  fsnotify: Inline fsnotify_clear_{inode|vfsmount}_mark_group()
  fsnotify: Remove fsnotify_recalc_{inode|vfsmount}_mask()
  fsnotify: Remove fsnotify_set_mark_{,ignored_}mask_locked()
  fanotify: Release SRCU lock when waiting for userspace response
  fsnotify: Pass fsnotify_iter_info into handle_event handler
  fsnotify: Provide framework for dropping SRCU lock in ->handle_event
  fsnotify: Remove special handling of mark destruction on group shutdown
  fsnotify: Detach mark from object list when last reference is dropped
  fsnotify: Move queueing of mark for destruction into fsnotify_put_mark()
  inotify: Do not drop mark reference under idr_lock
  fsnotify: Free fsnotify_mark_connector when there is no mark attached
  fsnotify: Lock object list with connector lock
  ...
2017-05-03 11:05:15 -07:00
Jaegeuk Kim
5b0ef73c9d f2fs: show available_nids in f2fs/status
This patch adds an entry in f2fs/status to show # of available nids.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:57 -07:00
Jaegeuk Kim
1c0f4bf5c3 f2fs: flush dirty nats periodically
This patch flushes dirty nats in order to acquire available nids by writing
checkpoint. Otherwise, we can have no chance to get freed nids.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:56 -07:00
Chao Yu
1f43e2ad7b f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
Introduce CP_TRIMMED_FLAG to indicate all invalid block were trimmed
before umount, so once we do mount with image which contain the flag,
we don't record invalid blocks as undiscard one, when fstrim is being
triggered, we can avoid issuing redundant discard commands.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:56 -07:00
Chao Yu
c473f1a965 f2fs: allow cpc->reason to indicate more than one reason
Change to use different bits of cpc->reason to indicate different status,
so cpc->reason can indicate more than one reason.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:55 -07:00
Hou Pengyang
279d6df20c f2fs: release cp and dnode lock before IPU
We don't need to rewrite the page under cp_rwsem and dnode locks.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:54 -07:00
Fred Isaman
c296cfe26b pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-03 12:29:41 -04:00
Linus Torvalds
0302e28dee Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Highlights:

  IMA:
   - provide ">" and "<" operators for fowner/uid/euid rules

  KEYS:
   - add a system blacklist keyring

   - add KEYCTL_RESTRICT_KEYRING, exposes keyring link restriction
     functionality to userland via keyctl()

  LSM:
   - harden LSM API with __ro_after_init

   - add prlmit security hook, implement for SELinux

   - revive security_task_alloc hook

  TPM:
   - implement contextual TPM command 'spaces'"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (98 commits)
  tpm: Fix reference count to main device
  tpm_tis: convert to using locality callbacks
  tpm: fix handling of the TPM 2.0 event logs
  tpm_crb: remove a cruft constant
  keys: select CONFIG_CRYPTO when selecting DH / KDF
  apparmor: Make path_max parameter readonly
  apparmor: fix parameters so that the permission test is bypassed at boot
  apparmor: fix invalid reference to index variable of iterator line 836
  apparmor: use SHASH_DESC_ON_STACK
  security/apparmor/lsm.c: set debug messages
  apparmor: fix boolreturn.cocci warnings
  Smack: Use GFP_KERNEL for smk_netlbl_mls().
  smack: fix double free in smack_parse_opts_str()
  KEYS: add SP800-56A KDF support for DH
  KEYS: Keyring asymmetric key restrict method with chaining
  KEYS: Restrict asymmetric key linkage using a specific keychain
  KEYS: Add a lookup_restriction function for the asymmetric key type
  KEYS: Add KEYCTL_RESTRICT_KEYRING
  KEYS: Consistent ordering for __key_link_begin and restrict check
  KEYS: Add an optional lookup_restriction hook to key_type
  ...
2017-05-03 08:50:52 -07:00
Josef Bacik
563f40019d fs: don't set *REFERENCED on single use objects
By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
inode we create.  This is problematic as this means that it takes two
trips through the LRU for any of these objects to be reclaimed,
regardless of their actual lifetime.  With enough pressure from these
caches we can easily evict our working set from page cache with single
use objects.  So instead only set *REFERENCED if we've already been
added to the LRU list.  This means that we've been touched since the
first time we were accessed, and so more likely to need to hang out in
cache.

To illustrate this issue I wrote the following scripts

https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure

on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
creates a new file system and creates 2 6.5gib files in order to take up
13gib of the 16gib of ram with pagecache.  Then it runs a test program
that reads these 2 files in a loop, and keeps track of how often it has
to read bytes for each loop.  On an ideal system with no pressure we
should have to read 0 bytes indefinitely.  The second thing this script
does is start a fs_mark job that creates a ton of 0 length files,
putting pressure on the system with slab only allocations.  On exit the
script prints out how many bytes were read by the read-file program.
The results are as follows

Without patch:
/mnt/btrfs-test/reads/file1: total read during loops 27262988288
/mnt/btrfs-test/reads/file2: total read during loops 27262976000

With patch:
/mnt/btrfs-test/reads/file2: total read during loops 18640457728
/mnt/btrfs-test/reads/file1: total read during loops 9565376512

This patch results in a 50% reduction of the amount of pages evicted
from our working set.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-03 11:47:05 -04:00
Rabin Vincent
3998e6b87d CIFS: fix oplock break deadlocks
When the final cifsFileInfo_put() is called from cifsiod and an oplock
break work is queued, lockdep complains loudly:

 =============================================
 [ INFO: possible recursive locking detected ]
 4.11.0+ #21 Not tainted
 ---------------------------------------------
 kworker/0:2/78 is trying to acquire lock:
  ("cifsiod"){++++.+}, at: flush_work+0x215/0x350

 but task is already holding lock:
  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock("cifsiod");
   lock("cifsiod");

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by kworker/0:2/78:
  #0:  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0
  #1:  ((&wdata->work)){+.+...}, at: process_one_work+0x255/0x8e0

 stack backtrace:
 CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 4.11.0+ #21
 Workqueue: cifsiod cifs_writev_complete
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? match_held_lock+0x20/0x2b0
  ? trace_hardirqs_off_caller+0x86/0x130
  ? mark_lock+0xa6/0x920
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? flush_work+0x215/0x350
  flush_work+0x236/0x350
  ? flush_work+0x215/0x350
  ? destroy_worker+0x170/0x170
  __cancel_work_timer+0x17d/0x210
  ? ___preempt_schedule+0x16/0x18
  cancel_work_sync+0x10/0x20
  cifsFileInfo_put+0x338/0x7f0
  cifs_writedata_release+0x2a/0x40
  ? cifs_writedata_release+0x2a/0x40
  cifs_writev_complete+0x29d/0x850
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40

This is a real warning.  Since the oplock is queued on the same
workqueue this can deadlock if there is only one worker thread active
for the workqueue (which will be the case during memory pressure when
the rescuer thread is handling it).

Furthermore, there is at least one other kind of hang possible due to
the oplock break handling if there is only worker.  (This can be
reproduced without introducing memory pressure by having passing 1 for
the max_active parameter of cifsiod.) cifs_oplock_break() can wait
indefintely in the filemap_fdatawait() while the cifs_writev_complete()
work is blocked:

 sysrq: SysRq : Show Blocked State
   task                        PC stack   pid father
 kworker/0:1     D    0    16      2 0x00000000
 Workqueue: cifsiod cifs_oplock_break
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x4a/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  cifs_oplock_break+0x651/0x710
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40
 dd              D    0   683    171 0x00000000
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x29/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  filemap_write_and_wait+0x4e/0x70
  cifs_flush+0x6a/0xb0
  filp_close+0x52/0xa0
  __close_fd+0xdc/0x150
  SyS_close+0x33/0x60
  entry_SYSCALL_64_fastpath+0x1f/0xbe

 Showing all locks held in the system:
 2 locks held by kworker/0:1/16:
  #0:  ("cifsiod"){.+.+.+}, at: process_one_work+0x255/0x8e0
  #1:  ((&cfile->oplock_break)){+.+.+.}, at: process_one_work+0x255/0x8e0

 Showing busy workqueues and worker pools:
 workqueue cifsiod: flags=0xc
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
     in-flight: 16:cifs_oplock_break
     delayed: cifs_writev_complete, cifs_echo_request
 pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 750 3

Fix these problems by creating a a new workqueue (with a rescuer) for
the oplock break work.

Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-05-03 10:10:10 -05:00
David Disseldorp
6026685de3 cifs: fix CIFS_ENUMERATE_SNAPSHOTS oops
As with 618763958b, an open directory may have a NULL private_data
pointer prior to readdir. CIFS_ENUMERATE_SNAPSHOTS must check for this
before dereference.

Fixes: 834170c859 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 09:59:20 -05:00
David Disseldorp
0e5c795592 cifs: fix leak in FSCTL_ENUM_SNAPS response handling
The server may respond with success, and an output buffer less than
sizeof(struct smb_snapshot_array) in length. Do not leak the output
buffer in this case.

Fixes: 834170c859 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 09:54:12 -05:00
Chao Yu
9a744b92da f2fs: shrink size of struct discard_cmd
In order to shrink size of struct discard_cmd, change variable type of
@state in struct discard_cmd from int to unsigned char.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:51 -07:00
Chao Yu
ec9895add2 f2fs: don't hold cmd_lock during waiting discard command
Previously, with protection of cmd_lock, we will wait for end io of
discard command which potentially may lead long latency, making worse
concurrency.

So, in this patch, we try to add reference into discard entry to prevent
the entry being released by other thread, then we can avoid holding
global cmd_lock during waiting discard to finish.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:50 -07:00
Jaegeuk Kim
4d97807813 f2fs: nullify fio->encrypted_page for each writes
This makes sure each write request has nullified encrypted_page pointer.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:49 -07:00
Jin Qian
b9dd46188e f2fs: sanity check segment count
F2FS uses 4 bytes to represent block address. As a result, supported
size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments.

Signed-off-by: Jin Qian <jinqian@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:48 -07:00
Jaegeuk Kim
a817737e87 f2fs: introduce valid_ipu_blkaddr to clean up
This patch introduces valid_ipu_blkaddr to clean up checking block address for
inplace-update.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:48 -07:00
Hou Pengyang
e959c8f543 f2fs: lookup extent cache first under IPU scenario
If a page is cold, NOT atomit written and need_ipu now, there is
a high probability that IPU should be adapted. For IPU, we try to
check extent tree to get the block index first, instead of reading
the dnode page, where may lead to an useless dnode IO, since no need to
update the dnode index for IPU.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:47 -07:00
Hou Pengyang
7eab0c0df8 f2fs: reconstruct code to write a data page
This patch introduces encrypt_one_page which encrypts one data page before
submit_bio, and change the use of need_inplace_update.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:46 -07:00
Chao Yu
63a94fa1d7 f2fs: introduce __wait_discard_cmd
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:45 -07:00
Chao Yu
bd5b07383a f2fs: introduce __issue_discard_cmd
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:44 -07:00
Linus Torvalds
89c9fea3c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  tty: fix comment for __tty_alloc_driver()
  init/main: properly align the multi-line comment
  init/main: Fix double "the" in comment
  Fix dead URLs to ftp.kernel.org
  drivers: Clean up duplicated email address
  treewide: Fix typo in xml/driver-api/basics.xml
  tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall"
  selftests/timers: Spelling s/privledges/privileges/
  HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/
  net: phy: dp83848: Fix Typo
  UBI: Fix typos
  Documentation: ftrace.txt: Correct nice value of 120 priority
  net: fec: Fix typo in error msg and comment
  treewide: Fix typos in printk
2017-05-02 19:09:35 -07:00
Linus Torvalds
76f1948a79 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatch updates from Jiri Kosina:

 - a per-task consistency model is being added for architectures that
   support reliable stack dumping (extending this, currently rather
   trivial set, is currently in the works).

   This extends the nature of the types of patches that can be applied
   by live patching infrastructure. The code stems from the design
   proposal made [1] back in November 2014. It's a hybrid of SUSE's
   kGraft and RH's kpatch, combining advantages of both: it uses
   kGraft's per-task consistency and syscall barrier switching combined
   with kpatch's stack trace switching. There are also a number of
   fallback options which make it quite flexible.

   Most of the heavy lifting done by Josh Poimboeuf with help from
   Miroslav Benes and Petr Mladek

   [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

 - module load time patch optimization from Zhou Chengming

 - a few assorted small fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: add missing printk newlines
  livepatch: Cancel transition a safe way for immediate patches
  livepatch: Reduce the time of finding module symbols
  livepatch: make klp_mutex proper part of API
  livepatch: allow removal of a disabled patch
  livepatch: add /proc/<pid>/patch_state
  livepatch: change to a per-task consistency model
  livepatch: store function sizes
  livepatch: use kstrtobool() in enabled_store()
  livepatch: move patching functions into patch.c
  livepatch: remove unnecessary object loaded check
  livepatch: separate enabled and patched states
  livepatch/s390: add TIF_PATCH_PENDING thread flag
  livepatch/s390: reorganize TIF thread flag bits
  livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  livepatch/x86: add TIF_PATCH_PENDING thread flag
  livepatch: create temporary klp_update_patch_state() stub
  x86/entry: define _TIF_ALLWORK_MASK flags explicitly
  stacktrace/x86: add function for detecting reliable stack traces
2017-05-02 18:24:16 -07:00
Linus Torvalds
8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Steve French
26c9cb668c Set unicode flag on cifs echo request to avoid Mac error
Mac requires the unicode flag to be set for cifs, even for the smb
echo request (which doesn't have strings).

Without this Mac rejects the periodic echo requests (when mounting
with cifs) that we use to check if server is down

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
c610c4b619 CIFS: Add asynchronous write support through kernel AIO
This patch adds support to process write calls passed by io_submit()
asynchronously. It based on the previously introduced async context
that allows to process i/o responses in a separate thread and
return the caller immediately for asynchronous calls.

This improves writing performance of single threaded applications
with increasing of i/o queue depth size.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
6685c5e2d1 CIFS: Add asynchronous read support through kernel AIO
This patch adds support to process read calls passed by io_submit()
asynchronously. It based on the previously introduced async context
that allows to process i/o responses in a separate thread and
return the caller immediately for asynchronous calls.

This improves reading performance of single threaded applications
with increasing of i/o queue depth size.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
ccf7f4088a CIFS: Add asynchronous context to support kernel AIO
Currently the code doesn't recognize asynchronous calls passed
by io_submit() and processes all calls synchronously. This is not
what kernel AIO expects. This patch introduces a new async context
that keeps track of all issued i/o requests and moves a response
collecting procedure to a separate thread. This allows to return
to a caller immediately for async calls and call iocb->ki_complete()
once all requests are completed. For sync calls the current thread
simply waits until all requests are completed.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Daniel N Pettersson
29bb3158cf cifs: fix IPv6 link local, with scope id, address parsing
When the IP address is gotten from the UNC, use only the address part
of the UNC. Else all after the percent sign in an IPv6 link local
address is interpreted as a scope id. This includes the slash and
share name. A scope id is expected to be an integer and any trailing
characters makes the conversion to integer fail.
Example of mount command that fails:
mount -i -t cifs //fe80::6a05:caff:fe3e:8ffc%2/test /mnt/t -o sec=none

Signed-off-by: Daniel N Pettersson <danielnp@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Dan Carpenter
564277ecee cifs: small underflow in cnvrtDosUnixTm()
January is month 1.  There is no zero-th month.  If someone passes a
zero month then it means we read from one space before the start of the
total_days_of_prev_months[] array.

We may as well also be strict about days as well.

Fixes: 1bd5bbcb65 ("[CIFS] Legacy time handling for Win9x and OS/2 part 1")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Linus Torvalds
204f144c9f Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fs/compat.c cleanups from Al Viro:
 "More moving of compat syscalls from fs/compat.c to fs/*.c where the
  native counterparts live.

  And death to compat_sys_getdents64() - the only architecture that used
  to need it was ia64, and _that_ has lost biarch support quite a few
  years ago"

* 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/compat.c: trim unused includes
  move compat_rw_copy_check_uvector() over to fs/read_write.c
  fhandle: move compat syscalls from compat.c
  open: move compat syscalls from compat.c
  stat: move compat syscalls from compat.c
  fcntl: move compat syscalls from compat.c
  readdir: move compat syscalls from compat.c
  statfs: move compat syscalls from compat.c
  utimes: move compat syscalls from compat.c
  move compat select-related syscalls to fs/select.c
  Remove compat_sys_getdents64()
2017-05-02 11:54:26 -07:00
Linus Torvalds
da7b66ffb2 Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice updates from Al Viro:
 "These actually missed the last cycle; the branch itself is from last
  December"

* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make nr_pages calculation in default_file_splice_read() a bit less ugly
  splice/tee/vmsplice: validate flags
  splice_pipe_desc: kill ->flags
  remove spd_release_page()
2017-05-02 11:38:06 -07:00
Linus Torvalds
5b13475a5e Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "Cleanups that sat in -next + -stable fodder that has just missed 4.11.

  There's more iov_iter work in my local tree, but I'd prefer to push
  the stuff that had been in -next first"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  iov_iter: don't revert iov buffer if csum error
  generic_file_read_iter(): make use of iov_iter_revert()
  generic_file_direct_write(): make use of iov_iter_revert()
  orangefs: use iov_iter_revert()
  sctp: switch to copy_from_iter_full()
  net/9p: switch to copy_from_iter_full()
  switch memcpy_from_msg() to copy_from_iter_full()
  rds: make use of iov_iter_revert()
2017-05-02 11:18:50 -07:00
Linus Torvalds
6fd4e7f774 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Three cifs/smb3 fixes - including two for stable"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: don't check for failure from mempool_alloc()
  Do not return number of bytes written for ioctl CIFS_IOC_COPYCHUNK_FILE
  Fix match_prepath()
2017-05-02 11:16:29 -07:00
Linus Torvalds
2575be8ad3 - constify compression structures; Bhumika Goyal
- restore powerpc dumping; Ankit Kumar
 - fix more bugs in the rarely exercises module unloading logic
 - reorganize filesystem locking to fix problems noticed by lockdep
 - refactor internal pstore APIs to make development and review easier:
   - improve error reporting
   - add kernel-doc structure and function comments
   - avoid insane argument passing by using a common record structure
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZB3wzAAoJEIly9N/cbcAmVQAP/3+GzoxUcL43ypsDa1CmCsFN
 l2roQjzWLNGfHgq5qkS/mNtrUdEvBMUBd2oyhHcaiqM0DsuuO3rKTp6dZ8oczYjN
 6GoTmU8ZwIPze3VEadNPjCIdpsfTMNKtZvVJCTrWnsgXxTawDS89qqr7SCs3qhBS
 Dm1E8oX77YyhOKoGA6O3CJxpdm/Ge4+KpPR6Uwj90eVro04vYoiwnBjyLUzE7w1l
 JcXEGEh1t5NjUxHeMwW7HZwJYZfA3DQ6I3MOOzGhf9tsKp6J0LQTTV8PMSEo1mif
 mLZhDBy8BBlunL+b2Tp3+c4QItGSHkBCWASI2RLa2TM7xvL67oC+qm/WaUyoRovy
 hllEG96rsCs3Zx7fFFsfQCwURcTWfJQMrD+0d/fM+P2ylWvgp+KU6PeLTS9IHu6M
 3n6i5i6A6OY/QvmZr1tN/06kUBjtQmo8EgQ0jxoxAlWyNcJqi93hmJyaRW28KxjS
 tjFTNLZMrslj0UDmjiD6fIuaT6gsGDB+3wAMPVAf+iV/k/2GUlj3ZILe4RaABAe9
 8xaUu11tZ5sTniayZ+10bA+6+K5n7uTlgU8RfFgaUZoRAzHgtyijOmdo6N+HILfK
 klv59B1Fmf6JpDlq7L9vurOqE82FAWFn4DruFM2bAaky2meFUNbYFiNfwK4l6lPI
 pmAgpdgRRvNMBCEmbVfv
 =S14G
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "This has a large internal refactoring along with several smaller
  fixes.

   - constify compression structures; Bhumika Goyal

   - restore powerpc dumping; Ankit Kumar

   - fix more bugs in the rarely exercises module unloading logic

   - reorganize filesystem locking to fix problems noticed by lockdep

   - refactor internal pstore APIs to make development and review
     easier:
      - improve error reporting
      - add kernel-doc structure and function comments
      - avoid insane argument passing by using a common record
        structure"

* tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
  pstore: Solve lockdep warning by moving inode locks
  pstore: Fix flags to enable dumps on powerpc
  pstore: Remove unused vmalloc.h in pmsg
  pstore: simplify write_user_compat()
  pstore: Remove write_buf() callback
  pstore: Replace arguments for write_buf_user() API
  pstore: Replace arguments for write_buf() API
  pstore: Replace arguments for erase() API
  pstore: Do not duplicate record metadata
  pstore: Allocate records on heap instead of stack
  pstore: Pass record contents instead of copying
  pstore: Always allocate buffer for decompression
  pstore: Replace arguments for write() API
  pstore: Replace arguments for read() API
  pstore: Switch pstore_mkfile to pass record
  pstore: Move record decompression to function
  pstore: Extract common arguments into structure
  pstore: Add kernel-doc for struct pstore_info
  pstore: Improve register_pstore() error reporting
  pstore: Avoid race in module unloading
  ...
2017-05-02 10:35:45 -07:00
Trond Myklebust
5f0114832a pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
If the layout segment is invalid, we want to just resend the remaining
writes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:34 -04:00
Trond Myklebust
61f454e30c pNFS: Fix a deadlock when coalescing writes and returning the layout
Consider the following deadlock:

Process P1	Process P2		Process P3
==========	==========		==========
					lock_page(page)

		lseg = pnfs_update_layout(inode)

lo = NFS_I(inode)->layout
pnfs_error_mark_layout_for_return(lo)

		lock_page(page)

					lseg = pnfs_update_layout(inode)

In this scenario,
- P1 has declared the layout to be in error, but P2 holds a reference to
  a layout segment on that inode, so the layoutreturn is deferred.
- P2 is waiting for a page lock held by P3.
- P3 is asking for a new layout segment, but is blocked waiting
  for the layoutreturn.

The fix is to ensure that pnfs_error_mark_layout_for_return() does
not set the NFS_LAYOUT_RETURN flag, which blocks P3. Instead, we allow
the latter to call LAYOUTGET so that it can make progress and unblock
P2.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:33 -04:00
Trond Myklebust
5466d21411 pNFS: Don't clear the layout return info if there are segments to return
In pnfs_clear_layoutreturn_info, ensure that we don't clear the layout
return info if there are new segments queued for return due to, for
instance, a race between a LAYOUTRETURN and a failed I/O attempt.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:33 -04:00
Eric Biggers
aa1dca3bd9 ext4: inherit encryption xattr before other xattrs
When using both encryption and SELinux (or another feature that requires
an xattr per file) on a filesystem with 256-byte inodes, each file's
xattrs usually spill into an external xattr block.  Currently, the
xattrs are inherited in the order ACL, security, then encryption.
Therefore, if spillage occurs, the encryption xattr will always end up
in the external block.  This is not ideal because the encryption xattrs
contain a nonce, so they will always be unique and will prevent the
external xattr blocks from being deduplicated.

To improve the situation, change the inheritance order to encryption,
ACL, then security.  This gives the encryption xattr a better chance to
be stored in-inode, allowing the other xattr(s) to be deduplicated.

Note that it may be better for userspace to format the filesystem with
512-byte inodes in this case.  However, it's not the default.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-02 00:49:54 -04:00
Linus Torvalds
6dc2cce932 Merge branch 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pul x86/process updates from Ingo Molnar:
 "The main change in this cycle was to add the ARCH_[GET|SET]_CPUID
  prctl() ABI extension to control the availability of the CPUID
  instruction, analogously to the existing PR_GET|SET_TSC ABI that
  controls RDTSC.

  Motivation: the 'rr' user-space record-and-replay execution debugger
  would like to trap and emulate the CPUID instruction - which
  instruction is normally unprivileged.

  Trapping CPUID is possible on IvyBridge and later Intel CPUs - expose
  this hardware capability"

* 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/syscalls/32: Ignore arch_prctl for other architectures
  um/arch_prctl: Fix fallout from x86 arch_prctl() rework
  x86/arch_prctl: Add ARCH_[GET|SET]_CPUID
  x86/cpufeature: Detect CPUID faulting support
  x86/syscalls/32: Wire up arch_prctl on x86-32
  x86/arch_prctl: Add do_arch_prctl_common()
  x86/arch_prctl/64: Rename do_arch_prctl() to do_arch_prctl_64()
  x86/arch_prctl/64: Use SYSCALL_DEFINE2 to define sys_arch_prctl()
  x86/arch_prctl: Rename 'code' argument to 'option'
  x86/msr: Rename MISC_FEATURE_ENABLES to MISC_FEATURES_ENABLES
  x86/process: Optimize TIF_NOTSC switch
  x86/process: Correct and optimize TIF_BLOCKSTEP switch
  x86/process: Optimize TIF checks in __switch_to_xtra()
2017-05-01 19:57:58 -07:00
Linus Torvalds
3527d3e951 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - another round of rq-clock handling debugging, robustization and
     fixes

   - PELT accounting improvements

   - CPU hotplug related ->cpus_allowed affinity handling fixes all
     around the tree

   - ... plus misc fixes, cleanups and updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  sched/x86: Update reschedule warning text
  crypto: N2 - Replace racy task affinity logic
  cpufreq/sparc-us2e: Replace racy task affinity logic
  cpufreq/sparc-us3: Replace racy task affinity logic
  cpufreq/sh: Replace racy task affinity logic
  cpufreq/ia64: Replace racy task affinity logic
  ACPI/processor: Replace racy task affinity logic
  ACPI/processor: Fix error handling in __acpi_processor_start()
  sparc/sysfs: Replace racy task affinity logic
  powerpc/smp: Replace open coded task affinity logic
  ia64/sn/hwperf: Replace racy task affinity logic
  ia64/salinfo: Replace racy task affinity logic
  workqueue: Provide work_on_cpu_safe()
  ia64/topology: Remove cpus_allowed manipulation
  sched/fair: Move the PELT constants into a generated header
  sched/fair: Increase PELT accuracy for small tasks
  sched/fair: Fix comments
  sched/Documentation: Add 'sched-pelt' tool
  sched/fair: Fix corner case in __accumulate_sum()
  sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
  ...
2017-05-01 19:12:53 -07:00
Linus Torvalds
5db6db0d40 Merge branch 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess unification updates from Al Viro:
 "This is the uaccess unification pile. It's _not_ the end of uaccess
  work, but the next batch of that will go into the next cycle. This one
  mostly takes copy_from_user() and friends out of arch/* and gets the
  zero-padding behaviour in sync for all architectures.

  Dealing with the nocache/writethrough mess is for the next cycle;
  fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
  sold on access_ok() in there, BTW; just not in this pile), same for
  reducing __copy_... callsites, strn*... stuff, etc. - there will be a
  pile about as large as this one in the next merge window.

  This one sat in -next for weeks. -3KLoC"

* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
  HAVE_ARCH_HARDENED_USERCOPY is unconditional now
  CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
  m32r: switch to RAW_COPY_USER
  hexagon: switch to RAW_COPY_USER
  microblaze: switch to RAW_COPY_USER
  get rid of padding, switch to RAW_COPY_USER
  ia64: get rid of copy_in_user()
  ia64: sanitize __access_ok()
  ia64: get rid of 'segment' argument of __do_{get,put}_user()
  ia64: get rid of 'segment' argument of __{get,put}_user_check()
  ia64: add extable.h
  powerpc: get rid of zeroing, switch to RAW_COPY_USER
  esas2r: don't open-code memdup_user()
  alpha: fix stack smashing in old_adjtimex(2)
  don't open-code kernel_setsockopt()
  mips: switch to RAW_COPY_USER
  mips: get rid of tail-zeroing in primitives
  mips: make copy_from_user() zero tail explicitly
  mips: clean and reorder the forest of macros...
  mips: consolidate __invoke_... wrappers
  ...
2017-05-01 14:41:04 -07:00
Arnd Bergmann
67fd389735 block, dax: use correct format string in bdev_dax_supported
The new message has an incorrect format string, causing a warning in some
configurations:

fs/block_dev.c: In function 'bdev_dax_supported':
fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
     "error: dax access failed (%d)", len);

This changes it to use the correct %ld instead of %d.

Fixes: 2093f2e9df ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-01 13:16:29 -07:00
Linus Torvalds
694752922b Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
2017-05-01 10:39:57 -07:00
Theodore Ts'o
72d622b422 ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
Add fallback code and a WARN_ONCE() call instead of a BUG_ON() in
the ext4_end_bio() function.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 20:08:05 -04:00
Jan Kara
dddbd6ac8f ext4: avoid unnecessary transaction stalls during writeback
Currently ext4_writepages() submits all pages with transaction started.
When no page needs block allocation or extent conversion we can submit
all dirty pages in the inode while holding a single transaction handle
and when device is congested this can take significant amount of time.
Thus ext4_writepages() can block transaction commits for extended
periods of time.

Take for example a simple benchmark simulating PostgreSQL database
(pgioperf in mmtest). The benchmark runs 16 processes doing random reads
from a huge file, one process doing random writes to the huge file, and
one process doing sequential writes to a small files and frequently
running fsync. With unpatched kernel transaction commits take on average
~18s with standard deviation of ~41s, top 5 commit times are:

274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s.

After this patch transaction commits take on average 0.1s with standard
deviation of 0.15s, top 5 commit times are:

0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s

[ Modified so we use an explicit do_map flag instead of relying on
  io_end not being allocated, the since io_end->inode is needed for I/O
  error handling. -- tytso ]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 18:29:10 -04:00
Joe Richey
9c8268def6 fscrypt: Move key structure and constants to uapi
This commit exposes the necessary constants and structures for a
userspace program to pass filesystem encryption keys into the keyring.
The fscrypt_key structure was already part of the kernel ABI, this
change just makes it so programs no longer have to redeclare these
structures (like e4crypt in e2fsprogs currently does).

Note that we do not expose the other FS_*_KEY_SIZE constants as they are
not necessary. Only XTS is supported for contents_encryption_mode, so
currently FS_MAX_KEY_SIZE bytes of key material must always be passed to
the kernel.

This commit also removes __packed from fscrypt_key as it does not
contain any implicit padding and does not refer to an on-disk structure.

Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 01:26:34 -04:00
Eric Biggers
cd39e4bac1 fscrypt: remove unnecessary checks for NULL operations
The functions in fs/crypto/*.c are only called by filesystems configured
with encryption support.  Since the ->get_context(), ->set_context(),
and ->empty_dir() operations are always provided in that case (and must
be, otherwise there would be no way to get/set encryption policies, or
in the case of ->get_context() even access encrypted files at all),
there is no need to check for these operations being NULL and we can
remove these unneeded checks.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 01:26:34 -04:00
Andrew Perepechko
85c8f176a6 ext4: preload block group descriptors
With enabled meta_bg option block group descriptors
reading IO is not sequential and requires optimization.

Signed-off-by: Andrew Perepechko <andrew.perepechko@seagate.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:46:35 -04:00
Eric Biggers
1a20a63084 ext4: make ext4_shutdown() static
Make the ext4_shutdown() function static, as suggested by running sparse
('make C=2 fs/ext4/').  This was the only such warning in fs/ext4/.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:40:44 -04:00
Darrick J. Wong
0c9ec4beec ext4: support GETFSMAP ioctls
Support the GETFSMAP ioctls so that we can use the xfs free space
management tools to probe ext4 as well.  Note that this is a partial
implementation -- we only report fixed-location metadata and free space;
everything else is reported as "unknown".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:36:53 -04:00
Eric Biggers
7b4cc9787f ext4: evict inline data when writing to memory map
Currently the case of writing via mmap to a file with inline data is not
handled.  This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:

    mkfs.ext4 -O inline_data /dev/vdb
    mount /dev/vdb /mnt
    xfs_io -f /mnt/file \
	-c 'pwrite 0 1' \
	-c 'mmap -w 0 1m' \
	-c 'mwrite 0 1' \
	-c 'fsync'

	kernel BUG at fs/ext4/inode.c:2723!
	invalid opcode: 0000 [#1] SMP
	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
	task: ffff88003d3a8040 task.stack: ffffc90000300000
	RIP: 0010:ext4_writepages+0xc89/0xf8a
	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
	Call Trace:
	 ? _raw_spin_unlock+0x27/0x2a
	 ? kvm_clock_read+0x1e/0x20
	 do_writepages+0x23/0x2c
	 ? do_writepages+0x23/0x2c
	 __filemap_fdatawrite_range+0x80/0x87
	 filemap_write_and_wait_range+0x67/0x8c
	 ext4_sync_file+0x20e/0x472
	 vfs_fsync_range+0x8e/0x9f
	 ? syscall_trace_enter+0x25b/0x2d0
	 vfs_fsync+0x1c/0x1e
	 do_fsync+0x31/0x4a
	 SyS_fsync+0x10/0x14
	 do_syscall_64+0x69/0x131
	 entry_SYSCALL64_slow_path+0x25/0x25

We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be.  So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.

Reported-by: Nick Alcock <nick.alcock@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:10:50 -04:00
Eric Biggers
6ba644b9fd ext4: remove ext4_xattr_check_entry()
ext4_xattr_check_entry() was redundant with validation of the full xattr
entries list in ext4_xattr_check_entries(), which all callers also did.
ext4_xattr_check_entry() also didn't actually do correct validation;
specifically, it never checked that the value doesn't overlap the xattr
names, nor did it account for padding when checking whether the xattr
value overflows the available space.  So remove it to eliminate any
potential confusion.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:01:02 -04:00
Eric Biggers
2c4f992337 ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
ext4_xattr_check_names() actually validates both the xattr names and
values, not just the names.  So rename it to ext4_xattr_check_entries()
to avoid confusion.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:56:52 -04:00
Eric Biggers
ba7ea1d8f4 ext4: merge ext4_xattr_list() into ext4_listxattr()
There's no difference between ext4_xattr_list() and ext4_listxattr(), so
merge them together and just have ext4_listxattr().  Some years ago they
took different arguments, but that's no longer the case.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:53:17 -04:00
Eric Biggers
d600618673 ext4: constify static data that is never modified
Constify static data in ext4 that is never (intentionally) modified so
that it is placed in .rodata and benefits from memory protection.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:47:50 -04:00
Eric Biggers
1bc0af600b ext4: trim return value and 'dir' argument from ext4_insert_dentry()
In the initial implementation of ext4 encryption, the filename was
encrypted in ext4_insert_dentry(), which could fail and also required
access to the 'dir' inode.  Since then ext4 filename encryption has been
changed to encrypt the filename earlier, so we can revert the additions
to ext4_insert_dentry().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:27:26 -04:00
Jan Kara
5052b069ac jbd2: fix dbench4 performance regression for 'nobarrier' mounts
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
filesystem is mounted with nobarrier mount option, journal superblock
writes ended up being async writes after this patch and that caused
heavy performance regression for dbench4 benchmark with high number of
processes. In my test setup with HP RAID array with non-volatile write
cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.

Fix the problem by making sure journal superblock writes are always
treated as synchronous since they generally block progress of the
journalling machinery and thus the whole filesystem.

Fixes: b685d3d65a
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 21:07:30 -04:00
Jan Kara
c52c47e4b4 jbd2: Fix lockdep splat with generic/270 test
I've hit a lockdep splat with generic/270 test complaining that:

3216.fsstress.b/3533 is trying to acquire lock:
 (jbd2_handle){++++..}, at: [<ffffffff813152e0>] jbd2_log_wait_commit+0x0/0x150

but task is already holding lock:
 (jbd2_handle){++++..}, at: [<ffffffff8130bd3b>] start_this_handle+0x35b/0x850

The underlying problem is that jbd2_journal_force_commit_nested()
(called from ext4_should_retry_alloc()) may get called while a
transaction handle is started. In such case it takes care to not wait
for commit of the running transaction (which would deadlock) but only
for a commit of a transaction that is already committing (which is safe
as that doesn't wait for any filesystem locks).

In fact there are also other callers of jbd2_log_wait_commit() that take
care to pass tid of a transaction that is already committing and for
those cases, the lockdep instrumentation is too restrictive and leading
to false positive reports. Fix the problem by calling
jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
transaction isn't already committing.

Fixes: 1eaa566d36
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 20:12:16 -04:00
Mark Charlebois
9280cdd6fe fs: compat: Remove warning from COMPATIBLE_IOCTL
cmd in COMPATIBLE_IOCTL is always a u32, so cast it so there isn't a
warning about an overflow in XFORM.

From: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-29 17:47:19 -04:00
Al Viro
0b33540f9d remove pointless extern of atime_need_update_rcu()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-29 17:42:25 -04:00
Trond Myklebust
1f18b82c34 pNFS: Ensure we commit the layout if it has been invalidated
If the layout is being invalidated on the server, then we must
invoke nfs_commit_inode() to ensure any commits to the DS get
cleared out.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-29 11:29:30 -04:00
Trond Myklebust
722f0b8911 pNFS: Don't send COMMITs to the DSes if the server invalidated our layout
If the layout was invalidated, then assume we should requeue all the
pending writes for the DS in question.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-29 11:29:24 -04:00
Trond Myklebust
37f8aa16da pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
If the attempt to write through pNFS fails, we need to use the same
failure semantics as for the read path: If the FF_FLAGS_NO_IO_THRU_MDS
flag is set or we have sufficient valid DSes, then we must retry through
pNFS

Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-29 00:02:37 -04:00
Takashi Iwai
d66bb1607e proc: Fix unbalanced hard link numbers
proc_create_mount_point() forgot to increase the parent's nlink, and
it resulted in unbalanced hard link numbers, e.g. /proc/fs shows one
less than expected.

Fixes: eb6d38d542 ("proc: Allow creating permanently empty directories...")
Cc: stable@vger.kernel.org
Reported-by: Tristan Ye <tristan.ye@suse.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-04-28 21:05:26 -05:00
Linus Torvalds
28b2013587 Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "We have one more fix for btrfs.

  This gets rid of a new WARN_ON from rc1 that ended up making more
  noise than we really want. The larger fix for the underflow got
  delayed a bit and it's better for now to put it under
  CONFIG_BTRFS_DEBUG"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: qgroup: move noisy underflow warning to debugging build
2017-04-28 10:13:17 -07:00
Trond Myklebust
bdebfccd0e pNFS: Ensure we check layout validity before marking it for return
pnfs_error_mark_layout_for_return needs to check that the layout is
valid before calling pnfs_set_plh_return_info().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-28 13:07:01 -04:00
Olga Kornievskaia
88bd4f8629 NFS4.1 handle interrupted slot reuse from ERR_DELAY
If the RPC slot was interrupted and server replied to the next
operation on the "reused" slot with ERR_DELAY, don't clear out
the "interrupted" flag until we properly recover.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-28 13:07:00 -04:00
Pan Bian
4edabfd7d0 NFSv4: check return value of xdr_inline_decode
Function xdr_inline_decode() will return a NULL pointer if the input
buffer does not have long enough buffer to decode nbytes of data.
However, in function decode_op_map(), the return value of
xdr_inline_decode() is not validated before it is used. This patch adds
a check to the return value of xdr_inline_decode().

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-28 13:06:59 -04:00
Artem Savkov
209aa23083 nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
Calling pnfs_put_lset on an IS_ERR pointer results in a NULL pointer
dereference like the one below. At the same time the check of retvalue
of filelayout_check_deviceid() sets lseg to error, but does not free it
before that.

[ 3000.636161] BUG: unable to handle kernel NULL pointer dereference at 000000000000003c
[ 3000.636970] IP: pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.637420] PGD 4f23b067
[ 3000.637421] PUD 4a0f4067
[ 3000.637679] PMD 0
[ 3000.637937]
[ 3000.638287] Oops: 0000 [#1] SMP
[ 3000.638591] Modules linked in: nfs_layout_nfsv41_files nfsv3 nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill rpcsec_gss_krb5 nfsv4 nfs fscache binfmt_misc arc4 md4 nls_utf8 cifs ccm dns_resolver rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr virtio_balloon ppdev virtio_rng parport_pc i2c_piix4 parport acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c ata_generic pata_acpi virtio_blk virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ata_piix ttm libata drm serio_raw
[ 3000.645245]  i2c_core virtio_pci virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: xt_u32]
[ 3000.646360] CPU: 1 PID: 26402 Comm: date Not tainted 4.11.0-rc7.1.el7.test.x86_64 #1
[ 3000.647092] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 3000.647638] task: ffff8800415ada00 task.stack: ffffc90000ff0000
[ 3000.648207] RIP: 0010:pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.648696] RSP: 0018:ffffc90000ff39b8 EFLAGS: 00010246
[ 3000.649193] RAX: 0000000000000000 RBX: fffffffffffffff4 RCX: 00000000000d43be
[ 3000.649859] RDX: 00000000000d43bd RSI: 0000000000000000 RDI: fffffffffffffff4
[ 3000.650530] RBP: ffffc90000ff39d8 R08: 000000000001e320 R09: ffffffffa05c35ce
[ 3000.651203] R10: ffff88007fd1e320 R11: ffffea0001283d80 R12: 0000000001400040
[ 3000.651875] R13: ffff88004f77d9f0 R14: ffffc90000ff3cd8 R15: ffff8800417ade00
[ 3000.652546] FS:  00007fac4d5cd740(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[ 3000.653304] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3000.653849] CR2: 000000000000003c CR3: 000000004f080000 CR4: 00000000000406e0
[ 3000.654527] Call Trace:
[ 3000.654771]  fl_pnfs_update_layout.constprop.20+0x10c/0x150 [nfs_layout_nfsv41_files]
[ 3000.655505]  filelayout_pg_init_write+0x21d/0x270 [nfs_layout_nfsv41_files]
[ 3000.656195]  __nfs_pageio_add_request+0x11c/0x490 [nfs]
[ 3000.656698]  nfs_pageio_add_request+0xac/0x260 [nfs]
[ 3000.657180]  nfs_do_writepage+0x109/0x2e0 [nfs]
[ 3000.657616]  nfs_writepages_callback+0x16/0x30 [nfs]
[ 3000.658096]  write_cache_pages+0x26f/0x510
[ 3000.658495]  ? nfs_do_writepage+0x2e0/0x2e0 [nfs]
[ 3000.658946]  ? _raw_spin_unlock_bh+0x1e/0x20
[ 3000.659357]  ? wb_wakeup_delayed+0x5f/0x70
[ 3000.659748]  ? __mark_inode_dirty+0x2eb/0x360
[ 3000.660170]  nfs_writepages+0x84/0xd0 [nfs]
[ 3000.660575]  ? nfs_updatepage+0x571/0xb70 [nfs]
[ 3000.661012]  do_writepages+0x1e/0x30
[ 3000.661358]  __filemap_fdatawrite_range+0xc6/0x100
[ 3000.661819]  filemap_write_and_wait_range+0x41/0x90
[ 3000.662292]  nfs_file_fsync+0x34/0x1f0 [nfs]
[ 3000.662704]  vfs_fsync_range+0x3d/0xb0
[ 3000.663065]  vfs_fsync+0x1c/0x20
[ 3000.663385]  nfs4_file_flush+0x57/0x80 [nfsv4]
[ 3000.663813]  filp_close+0x2f/0x70
[ 3000.664132]  __close_fd+0x9a/0xc0
[ 3000.664453]  SyS_close+0x23/0x50
[ 3000.664785]  do_syscall_64+0x67/0x180
[ 3000.665162]  entry_SYSCALL64_slow_path+0x25/0x25
[ 3000.665600] RIP: 0033:0x7fac4d0e1e90
[ 3000.665946] RSP: 002b:00007ffd54e90c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 3000.666679] RAX: ffffffffffffffda RBX: 00007fac4d3b5400 RCX: 00007fac4d0e1e90
[ 3000.667349] RDX: 0000000000000000 RSI: 00007fac4d5d9000 RDI: 0000000000000001
[ 3000.668031] RBP: 0000000000000000 R08: 00007fac4d3b6a00 R09: 00007fac4d5cd740
[ 3000.668709] R10: 00007ffd54e909e0 R11: 0000000000000246 R12: 0000000000000000
[ 3000.669385] R13: 00007fac4d3b5e80 R14: 0000000000000000 R15: 0000000000000000
[ 3000.670061] Code: 00 00 66 66 66 66 90 55 48 85 ff 48 89 e5 41 56 41 55 41 54 53 48 89 fb 0f 84 97 00 00 00 f6 05 16 8f bc ff 10 0f 85 a6 00 00 00 <4c> 8b 63 48 48 8d 7b 38 49 8b 84 24 90 00 00 00 4c 8d a8 88 00
[ 3000.671831] RIP: pnfs_put_lseg+0x29/0x100 [nfsv4] RSP: ffffc90000ff39b8
[ 3000.672462] CR2: 000000000000003c

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-28 13:06:59 -04:00
Brian Foster
e20c8a517f xfs: wait on new inodes during quotaoff dquot release
The quotaoff operation has a race with inode allocation that results
in a livelock. An inode allocation that occurs before the quota
status flags are updated acquires the appropriate dquots for the
inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
into the perag radix tree, sometime later attaches the dquots to the
inode and finally clears the XFS_INEW flag. Quotaoff expects to
release the dquots from all inodes in the filesystem via
xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
which skips inodes in the XFS_INEW state because they are not fully
constructed. If the scan occurs after dquots have been attached to
an inode, but before XFS_INEW is cleared, the newly allocated inode
will continue to hold a reference to the applicable dquots. When
quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
dquot(s) remain elevated and the dqpurge scan spins indefinitely.

To address this problem, update the xfs_qm_dqrele_all_inodes() scan
to wait on inodes marked on the XFS_INEW state. We wait on the
inodes explicitly rather than skip and retry to avoid continuous
retry loops due to a parallel inode allocation workload. Since
quotaoff updates the quota state flags and uses a synchronous
transaction before the dqrele scan, and dquots are attached to
inodes after radix tree insertion iff quota is enabled, one INEW
waiting pass through the AG guarantees that the scan has processed
all inodes that could possibly hold dquot references.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-28 08:11:08 -07:00
Brian Foster
ae2c4ac2dd xfs: update ag iterator to support wait on new inodes
The AG inode iterator currently skips new inodes as such inodes are
inserted into the inode radix tree before they are fully
constructed. Certain contexts require the ability to wait on the
construction of new inodes, however. The fs-wide dquot release from
the quotaoff sequence is an example of this.

Update the AG inode iterator to support the ability to wait on
inodes flagged with XFS_INEW upon request. Create a new
xfs_inode_ag_iterator_flags() interface and support a set of
iteration flags to modify the iteration behavior. When the
XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
radix tree inode lookup and wait on them before the callback is
executed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-28 08:11:08 -07:00
Brian Foster
756baca27f xfs: support ability to wait on new inodes
Inodes that are inserted into the perag tree but still under
construction are flagged with the XFS_INEW bit. Most contexts either
skip such inodes when they are encountered or have the ability to
handle them.

The runtime quotaoff sequence introduces a context that must wait
for construction of such inodes to correctly ensure that all dquots
in the fs are released. In anticipation of this, support the ability
to wait on new inodes. Wake the appropriate bit when XFS_INEW is
cleared.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-28 08:11:08 -07:00
Amir Goldstein
8f720d9f89 xfs: publish UUID in struct super_block
Copy the uuid of the filesystem to struct super_block s_uuid field,
as several other filesystems already do.  Copy regardless of the nouuid
mount option, because other filesystems also do not guaranty uniqueness
of the s_uuid field in super_block struct.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-28 08:10:53 -07:00
NeilBrown
a6f74e80f2 cifs: don't check for failure from mempool_alloc()
mempool_alloc() cannot fail if the gfp flags allow it to
sleep, and both GFP_FS allows for sleeping.

So these tests of the return value from mempool_alloc()
cannot be needed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-28 07:56:33 -05:00
Sachin Prabhu
7d0c234fd2 Do not return number of bytes written for ioctl CIFS_IOC_COPYCHUNK_FILE
commit 620d8745b3 ("Introduce cifs_copy_file_range()") changes the
behaviour of the cifs ioctl call CIFS_IOC_COPYCHUNK_FILE. In case of
successful writes, it now returns the number of bytes written. This
return value is treated as an error by the xfstest cifs/001. Depending
on the errno set at that time, this may or may not result in the test
failing.

The patch fixes this by setting the return value to 0 in case of
successful writes.

Fixes: commit 620d8745b3 ("Introduce cifs_copy_file_range()")
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-28 07:56:33 -05:00
Sachin Prabhu
cd8c42968e Fix match_prepath()
Incorrect return value for shares not using the prefix path means that
we will never match superblocks for these shares.

Fixes: commit c1d8b24d18 ("Compare prepaths when comparing superblocks")
Cc: stable@vger.kernel.org
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-28 07:54:54 -05:00
Kees Cook
3a7d2fd16c pstore: Solve lockdep warning by moving inode locks
Lockdep complains about a possible deadlock between mount and unlink
(which is technically impossible), but fixing this improves possible
future multiple-backend support, and keeps locking in the right order.

The lockdep warning could be triggered by unlinking a file in the
pstore filesystem:

  -> #1 (&sb->s_type->i_mutex_key#14){++++++}:
         lock_acquire+0xc9/0x220
         down_write+0x3f/0x70
         pstore_mkfile+0x1f4/0x460
         pstore_get_records+0x17a/0x320
         pstore_fill_super+0xa4/0xc0
         mount_single+0x89/0xb0
         pstore_mount+0x13/0x20
         mount_fs+0xf/0x90
         vfs_kern_mount+0x66/0x170
         do_mount+0x190/0xd50
         SyS_mount+0x90/0xd0
         entry_SYSCALL_64_fastpath+0x1c/0xb1

  -> #0 (&psinfo->read_mutex){+.+.+.}:
         __lock_acquire+0x1ac0/0x1bb0
         lock_acquire+0xc9/0x220
         __mutex_lock+0x6e/0x990
         mutex_lock_nested+0x16/0x20
         pstore_unlink+0x3f/0xa0
         vfs_unlink+0xb5/0x190
         do_unlinkat+0x24c/0x2a0
         SyS_unlinkat+0x16/0x30
         entry_SYSCALL_64_fastpath+0x1c/0xb1

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&sb->s_type->i_mutex_key#14);
                                lock(&psinfo->read_mutex);
                                lock(&sb->s_type->i_mutex_key#14);
   lock(&psinfo->read_mutex);

Reported-by: Marta Lofstedt <marta.lofstedt@intel.com>
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
2017-04-27 20:35:34 -07:00
Trond Myklebust
ed6473ddc7 NFSv4: Fix callback server shutdown
We want to use kthread_stop() in order to ensure the threads are
shut down before we tear down the nfs_callback_info in nfs_callback_down.

Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: bb6aeba736 ("NFSv4.x: Switch to using svc_set_num_threads()...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-27 18:00:16 -04:00
Kinglong Mee
df807fffaa NFSv4.x/callback: Create the callback service through svc_create_pooled
As the comments for svc_set_num_threads() said,
" Destroying threads relies on the service threads filling in
rqstp->rq_task, which only the nfs ones do.  Assumes the serv
has been created using svc_create_pooled()."

If creating service through svc_create(), the svc_pool_map_put()
will be called in svc_destroy(), but the pool map isn't used.
So that, the reference of pool map will be drop, the next using
of pool map will get a zero npools.

[  137.992130] divide error: 0000 [#1] SMP
[  137.992148] Modules linked in: nfsd(E) nfsv4 nfs fscache fuse tun bridge stp llc ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event vmw_balloon coretemp crct10dif_pclmul crc32_pclmul ppdev ghash_clmulni_intel intel_rapl_perf joydev snd_ens1371 gameport snd_ac97_codec ac97_bus snd_seq snd_pcm snd_rawmidi snd_timer snd_seq_device snd soundcore parport_pc parport nfit acpi_cpufreq tpm_tis tpm_tis_core tpm vmw_vmci i2c_piix4 shpchp auth_rpcgss nfs_acl lockd(E) grace sunrpc(E) xfs libcrc32c vmwgfx drm_kms_helper ttm crc32c_intel drm e1000 mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd]
[  137.992336] CPU: 0 PID: 4514 Comm: rpc.nfsd Tainted: G            E   4.11.0-rc8+ #536
[  137.992777] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  137.993757] task: ffff955984101d00 task.stack: ffff9873c2604000
[  137.994231] RIP: 0010:svc_pool_for_cpu+0x2b/0x80 [sunrpc]
[  137.994768] RSP: 0018:ffff9873c2607c18 EFLAGS: 00010246
[  137.995227] RAX: 0000000000000000 RBX: ffff95598376f000 RCX: 0000000000000002
[  137.995673] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9559944aec00
[  137.996156] RBP: ffff9873c2607c18 R08: ffff9559944aec28 R09: 0000000000000000
[  137.996609] R10: 0000000001080002 R11: 0000000000000000 R12: ffff95598376f010
[  137.997063] R13: ffff95598376f018 R14: ffff9559944aec28 R15: ffff9559944aec00
[  137.997584] FS:  00007f755529eb40(0000) GS:ffff9559bb600000(0000) knlGS:0000000000000000
[  137.998048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  137.998548] CR2: 000055f3aecd9660 CR3: 0000000084290000 CR4: 00000000001406f0
[  137.999052] Call Trace:
[  137.999517]  svc_xprt_do_enqueue+0xef/0x260 [sunrpc]
[  138.000028]  svc_xprt_received+0x47/0x90 [sunrpc]
[  138.000487]  svc_add_new_perm_xprt+0x76/0x90 [sunrpc]
[  138.000981]  svc_addsock+0x14b/0x200 [sunrpc]
[  138.001424]  ? recalc_sigpending+0x1b/0x50
[  138.001860]  ? __getnstimeofday64+0x41/0xd0
[  138.002346]  ? do_gettimeofday+0x29/0x90
[  138.002779]  write_ports+0x255/0x2c0 [nfsd]
[  138.003202]  ? _copy_from_user+0x4e/0x80
[  138.003676]  ? write_recoverydir+0x100/0x100 [nfsd]
[  138.004098]  nfsctl_transaction_write+0x48/0x80 [nfsd]
[  138.004544]  __vfs_write+0x37/0x160
[  138.004982]  ? selinux_file_permission+0xd7/0x110
[  138.005401]  ? security_file_permission+0x3b/0xc0
[  138.005865]  vfs_write+0xb5/0x1a0
[  138.006267]  SyS_write+0x55/0xc0
[  138.006654]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  138.007071] RIP: 0033:0x7f7554b9dc30
[  138.007437] RSP: 002b:00007ffc9f92c788 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  138.007807] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f7554b9dc30
[  138.008168] RDX: 0000000000000002 RSI: 00005640cd536640 RDI: 0000000000000003
[  138.008573] RBP: 00007ffc9f92c780 R08: 0000000000000001 R09: 0000000000000002
[  138.008918] R10: 0000000000000064 R11: 0000000000000246 R12: 0000000000000004
[  138.009254] R13: 00005640cdbf77a0 R14: 00005640cdbf7720 R15: 00007ffc9f92c238
[  138.009610] Code: 0f 1f 44 00 00 48 8b 87 98 00 00 00 55 48 89 e5 48 83 78 08 00 74 10 8b 05 07 42 02 00 83 f8 01 74 40 83 f8 02 74 19 31 c0 31 d2 <f7> b7 88 00 00 00 5d 89 d0 48 c1 e0 07 48 03 87 90 00 00 00 c3
[  138.010664] RIP: svc_pool_for_cpu+0x2b/0x80 [sunrpc] RSP: ffff9873c2607c18
[  138.011061] ---[ end trace b3468224cafa7d11 ]---

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-27 17:59:00 -04:00
Geliang Tang
3509d048c8 pstore: Remove unused vmalloc.h in pmsg
Since the vmalloc code has been removed from write_pmsg() in the commit
"5bf6d1b pstore/pmsg: drop bounce buffer", remove the unused header
vmalloc.h.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-04-27 14:48:59 -07:00
Chris Mason
bce19f9d23 Merge branch 'for-chris-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12 2017-04-27 14:13:09 -07:00
Linus Torvalds
8b5d11e4b0 Thanks to Ari Kauppi and Tuomas Haanpää at Synopsis for spotting bugs in
our NFSv2/v3 xdr code that could crash the server or leak memory.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZAlLrAAoJECebzXlCjuG+lb8P/idTu9rLGaU1VYPInrdoXru0
 iPY+p5inmGSYW2MfCGlS7disaCACgzPBKVqKjeNB1hfHn2JZCrfeBd0XvBYlc7TH
 JYmlHKSBjN/ZfnYMl0WrlVUCZstt4JVmxWjO3sZTCL3nImEbGA7d13yBDagxISWh
 gM9wOOiJwR5lzT/W3MezNCYj4n27/vVhODMP+Qhy0rpK08dBORIH0Bi7hwtVgdD9
 cMVIXMbujmJrCz0Uhbo/DhoItcePRBrCLTXdY6WEeMmHxnXlyaA2XuA0EjjZHuMs
 +BsbAOsNy1BxY1c8Z2ignYgRymUvUBHeiJeIGZLbyKKM2OJMtE4BoqlSWjx08P3Y
 hTwTuaw8u7uxfAsvxepukZoonWj/uVY5tP2Hyq+K6CIKKJpB7vj+c3QGYD3dNnUu
 zDl/LG3ayZgpPXqfjKRnSzZE+St1/IwDnvaM2WN2B1mkuerVr8qDGq6xd4kB0QKX
 BcEKiwcb2ewfPaLlVnSXz6Wbuh2pB42BObJjC3qbOgMvQ7SBUM0UcZmFpJJ20uCR
 BX20aFzB/GHcd6fTRHpDrAxB4XGdVG/8Da5Ki2WhRnmmaeSXushhiKWImujY9B6i
 s4mZHu4gGGJdLzOT5u2HZl93STevriL70SA9nZPhPLQycnJMUAO0buU9UcNI7wXR
 GVu2F3IHd+anxDtDwOv4
 =Glhl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.11-3' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Thanks to Ari Kauppi and Tuomas Haanpää at Synopsis for spotting bugs
  in our NFSv2/v3 xdr code that could crash the server or leak memory"

* tag 'nfsd-4.11-3' of git://linux-nfs.org/~bfields/linux:
  nfsd: stricter decoding of write-like NFSv2/v3 ops
  nfsd4: minor NFSv2/v3 write decoding cleanup
  nfsd: check for oversized NFSv2/v3 arguments
2017-04-27 13:39:19 -07:00
Linus Torvalds
19ac447420 A fix for a kernel stack overflow bug in ceph setattr code, marked for
stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZAhJLAAoJEEp/3jgCEfOLw4gH/ia+bMzmsnkYtjMfxQfCh0ia
 MHi7JS/YcAej/o71c/tvWlTU7mRbmvUCVSAcishRNytEBNGL8YzkP12vMOp/5Vdx
 kKk6yDWn9z0mR5/YdBKaE8ziM5Umdy+zLqeL4yuxyhtbxKFGUPG4txJKS5WD80yU
 Ld/toF2fL3y/JEs+s1pd5G+DPhEhEm2hFf56/VI6N7y08CHJgTqHB3GJ3ZnuUbnU
 UhSvNR9skdVirObI8jt3oWIix8uAGq5+6MjVeTqXo75Qng5sdBGZ8S2agxXbM3j7
 Hu8h/1bhKyPCUzAXnOyGcZeR+5DQolKmlKLhogbT4I9X4YC2ie4Djg0bmFHscWI=
 =8aUa
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.11-rc9' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A fix for a kernel stack overflow bug in ceph setattr code, marked for
  stable"

* tag 'ceph-for-4.11-rc9' of git://github.com/ceph/ceph-client:
  ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
2017-04-27 11:38:05 -07:00
Linus Torvalds
f56fc7bdaa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - fix orangefs handling of faults on write() - I'd missed that one back
   when orangefs was going through review.

 - readdir counterpart of "9p: cope with bogus responses from server in
   p9_client_{read,write}" - server might be lying or broken, and we'd
   better not overrun the kmalloc'ed buffer we are copying the results
   into.

 - NFS O_DIRECT read/write can leave iov_iter advanced by too much;
   that's what had been causing iov_iter_pipe() warnings davej had been
   seeing.

 - statx_timestamp.tv_nsec type fix (s32 -> u32). That one really should
   go in before 4.11.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  uapi: change the type of struct statx_timestamp.tv_nsec to unsigned
  fix nfs O_DIRECT advancing iov_iter too much
  p9_client_readdir() fix
  orangefs_bufmap_copy_from_iovec(): fix EFAULT handling
2017-04-27 11:09:37 -07:00
Lukas Czerner
3c3781951c xfs: Allow user to kill fstrim process
fstrim can take really long time on big, slow device or on file system
with a lots of allocation groups. Currently there is no way for the user
to cancell the operation. This patch makes it possible for the user to
kill fstrim pocess by adding the check for fatal_signal_pending() in
xfs_trim_extents().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-27 10:45:34 -07:00
Michael Kerrisk (man-pages)
59372bbf3a statx: correct error handling of NULL pathname
The change in commit 1e2f82d1e9 ("statx: Kill fd-with-NULL-path
support in favour of AT_EMPTY_PATH") to error on a NULL pathname to
statx() is inconsistent.

It results in the error EINVAL for a NULL pathname.  Other system calls
with similar APIs (fchownat(), fstatat(), linkat()), return EFAULT.

The solution is simply to remove the EINVAL check.  As I already pointed
out in [1], user_path_at*() and filename_lookup() will handle the NULL
pathname as per the other APIs, to correctly produce the error EFAULT.

[1] https://lkml.org/lkml/2017/4/26/561

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-27 10:45:09 -07:00
Christoph Hellwig
629e014bb8 fs: completely ignore unknown open flags
Currently we just stash anything we got into file->f_flags, and the
report it in fcntl(F_GETFD).  This patch just clears out all unknown
flags so that we don't pass them to the fs or report them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-27 05:13:04 -04:00
Christoph Hellwig
80f18379a7 fs: add a VALID_OPEN_FLAGS
Add a central define for all valid open flags, and use it in the uniqueness
check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-27 05:13:04 -04:00
Eric Biggers
020c2833db fs: remove _submit_bh()
_submit_bh() allowed submitting a buffer_head for I/O using custom
bio_flags.  It used to be used by jbd to set BIO_SNAP_STABLE, introduced
by commit 7136851117 ("mm: make snapshotting pages for stable writes a
per-bio operation").  However, the code and flag has since been removed
and no _submit_bh() users remain.

These days, bio_flags are mostly used internally by the block layer to
track the state of bio's.  As such, it doesn't really make sense for
filesystems to use them instead of op_flags when wanting special
behavior for block requests.

Therefore, remove _submit_bh() and trim the bio_flags argument from
submit_bh_wbc().

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Eric Biggers
cda37124f4 fs: constify tree_descr arrays passed to simple_fill_super()
simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory.  Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Fabian Frederick
a80f2d2224 fs/affs: bugfix: Write files greater than page size on OFS
Previous AFFS patch fixed OFS write operations but unveiled
another bug: files greater than 4KB are being created with a wrong
size resulting in errors like the following:

dd if=/dev/zero of=file bs=4097 count=1
cp file /mnt/affs/
cp: error writing '/mnt/affs/file': Bad address

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Fabian Frederick
077e073e8f fs/affs: bugfix: enable writes on OFS disks
We called unconditionally affs_bread_ino() with create 0 resulting in
"error (device ...): get_block(): strange block request 0"
when trying to write on AFFS OFS format.

This patch adds create parameter to that function.
0 for affs_readpage_ofs()
1 for affs_write_begin_ofs()

Bug was found here:
https://bugzilla.kernel.org/show_bug.cgi?id=114961

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Fabian Frederick
a9d6cfb70f fs/affs: remove node generation check
node generation has to be stored on disk.
AFAICS we won't be able to manage it on AFFS.
This patch removes relevant check in affs_nfs_get_inode()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:05 -04:00
Fabian Frederick
d2d58e0e0d fs/affs: import amigaffs.h
Have that file in global include/linux is not needed.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:05 -04:00
Fabian Frederick
f1bf90724d fs/affs: bugfix: make symbolic links work again
AFFS symbolic links were broken since kernel 2.6.29

Problem was bisected to the following

commit ebd09abbd9 ("vfs: ensure page symlinks are NUL-terminated")
commit 035146851c ("vfs: introduce helper function to safely
NUL-terminate symlinks")

AFFS wasn't setting inode size when reading symbolic link from disk or
creating a new one. Result was zero allocation in pagecache.

ln -s file symlink

ls -lrt

file
symlink ->

This patch adds inode isize information on inode get and symbolic link
addition.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:05 -04:00
David S. Miller
b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
David Howells
1e2f82d1e9 statx: Kill fd-with-NULL-path support in favour of AT_EMPTY_PATH
With the new statx() syscall, the following both allow the attributes of
the file attached to a file descriptor to be retrieved:

	statx(dfd, NULL, 0, ...);

and:

	statx(dfd, "", AT_EMPTY_PATH, ...);

Change the code to reject the first option, though this means copying
the path and engaging pathwalk for the fstat() equivalent.  dfd can be a
non-directory provided path is "".

[ The timing of this isn't wonderful, but applying this now before we
  have statx() in any released kernel, before anybody starts using the
  NULL special case.    - Linus ]

Fixes: a528d35e8b ("statx: Add a system call to make enhanced file info available")
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Sandeen <sandeen@sandeen.net>
cc: fstests@vger.kernel.org
cc: linux-api@vger.kernel.org
cc: linux-man@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-26 15:05:47 -07:00
Dan Carpenter
907bfcd8d8 orangefs: handle zero size write in debugfs
If we write zero bytes to this debugfs file, then it will cause an
underflow when we do copy_from_user(buf, ubuf, count - 1).  Debugfs can
normally only be written to by root so the impact of this is low.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:01 -04:00
Martin Brandenburg
b5a9d61eeb orangefs: do not wait for timeout if umounting
When the computer is turned off, all the processes are killed and then
all the filesystems are umounted.  OrangeFS should not wait for the
userspace daemon to come back in that case.

This only works for plain umount(2).  To actually take advantage of this
interactively, `umount -f' is needed; otherwise umount will issue a
statfs first, which will wait for the userspace daemon to come back.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:01 -04:00
Martin Brandenburg
b7a57ccab8 orangefs: return from orangefs_devreq_read quickly if possible
It is not necessary to take the lock and search through the request list
if the list is empty.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
9d286b0d82 orangefs: ensure the userspace component is unmounted if mount fails
If the mount is aborted after userspace has been asked to mount,
userspace must be told to unmount.

Ordinarily orangefs_kill_sb does the unmount.  However it cannot be
called if the superblock has not been set up.  This is a very narrow
window.

The NULL fs_id is not unmounted.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
53950ef541 orangefs: do not check possibly stale size on truncate
Let the server figure this out because our size might be out of date or
not present.

The bug was that

	xfs_io -f -t -c "pread -v 0 100" /mnt/foo
	echo "Test" > /mnt/foo
	xfs_io -f -t -c "pread -v 0 100" /mnt/foo

fails because the second truncate did not happen if nothing had
requested the size after the write in echo.  Thus i_size was zero (not
present) and the orangefs_setattr though i_size was zero and there was
nothing to do.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
68a24a6cc4 orangefs: implement statx
Fortunately OrangeFS has had a getattr request mask for a long time.

The server basically has two difficulty levels for attributes.  Fetching
any attribute except size requires communicating with the metadata
server for that handle.  Since all the attributes are right there, it
makes sense to return them all.  Fetching the size requires
communicating with every I/O server (that the file is distributed
across).  Therefore if asked for anything except size, get everything
except size, and if asked for size, get everything.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
7b796ae370 orangefs: remove ORANGEFS_READDIR macros
They are clones of the ORANGEFS_ITERATE macros in use elsewhere.  Delete
ORANGEFS_ITERATE_NEXT which is a hack previously used by readdir.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
480e3e532e orangefs: support very large directories
This works by maintaining a linked list of pages which the directory
has been read into rather than one giant fixed-size buffer.

This replaces code which limits the total directory size to the total
amount that could be returned in one server request.  Since filenames
are usually considerably shorter than the maximum, the old code could
usually handle several server requests before running out of space.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
72f66b8329 orangefs: support llseek on directories
This and the previous commit fix xfstests generic/257.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
382f4581e6 orangefs: rewrite readdir to fix several bugs
In the past, readdir assumed that the user buffer will be large enough
that all entries from the server will fit.  If this was not true,
entries would be skipped.

Since it works now, request 512 entries rather than 96 per server
operation.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
17930b252c orangefs: do not set getattr_time on orangefs_lookup
Since orangefs_lookup calls orangefs_iget which calls
orangefs_inode_getattr, getattr_time will get set.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
e675c5ec51 orangefs: clean up oversize xattr validation
Also don't check flags as this has been validated by the VFS already.

Fix an off-by-one error in the max size checking.

Stop logging just because userspace wants to write attributes which do
not fit.

This and the previous commit fix xfstests generic/020.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
a956af337b orangefs: fix bounds check for listxattr
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
418ce3eb66 orangefs: remove unused get_fsid_from_ino
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Trond Myklebust
c373fff7bd NFSv4: Don't special case "launder"
If the client receives a fatal server error from nfs_pageio_add_request(),
then we should always truncate the page on which the error occurred.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-26 13:03:04 -04:00
Trond Myklebust
54551d85ad NFS: Add a few more fatal I/O errors to nfs_error_is_fatal()
EACCES, EDQUOT, EFBIG and ESTALE are all fatal errors as far as NFS
I/O is concerned. They need to be reported back to the application.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-26 13:03:04 -04:00
Al Viro
eea86b637a Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', 'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess 2017-04-26 12:06:59 -04:00
Filipe Manana
a7e3b975a0 Btrfs: fix reported number of inode blocks
Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.

Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)

  # The following should have reported 64K...
  $ du -h /mnt/sdc/foo1
  128K	/mnt/sdc/foo1

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo1
  64K	/mnt/sdc/foo1

  $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 65536
  64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)

  # The following should have reported 128K...
  $ du -h /mnt/sdc/foo2
  192K	/mnt/sdc/foo2

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo2
  128K	/mnt/sdc/foo2

So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.

Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.

Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:26 +01:00
Filipe Manana
e1cbfd7bf6 Btrfs: send, fix file hole not being preserved due to inline extent
Normally we don't have inline extents followed by regular extents, but
there's currently at least one harmless case where this happens. For
example, when the page size is 4Kb and compression is enabled:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt
  $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
  $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar

In this case we get a compressed inline extent, representing 4Kb of
data, followed by a hole extent and then a regular data extent. The
inline extent was not expanded/converted to a regular extent exactly
because it represents 4Kb of data. This does not cause any apparent
problem (such as the issue solved by commit e1699d2d7b
("btrfs: add missing memset while reading compressed inline extents"))
except trigger an unexpected case in the incremental send code path
that makes us issue an operation to write a hole when it's not needed,
resulting in more writes at the receiver and wasting space at the
receiver.

So teach the incremental send code to deal with this particular case.

The issue can be currently triggered by running fstests btrfs/137 with
compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:25 +01:00
Filipe Manana
be2d253cc9 Btrfs: fix extent map leak during fallocate error path
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
extent map structure. The failure can happen either due to an -ENOMEM
condition or, when quotas are enabled, due to -EDQUOT for example.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2017-04-26 16:27:24 +01:00
Filipe Manana
1c81ba237b Btrfs: fix incorrect space accounting after failure to insert inline extent
When using compression, if we fail to insert an inline extent we
incorrectly end up attempting to free the reserved data space twice,
once through extent_clear_unlock_delalloc(), because we pass it the
flag EXTENT_DO_ACCOUNTING, and once through a direct call to
btrfs_free_reserved_data_space_noquota(). This results in a trace
like the following:

[  834.576240] ------------[ cut here ]------------
[  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
[  834.596103] Call Trace:
[  834.596103]  dump_stack+0x67/0x90
[  834.596103]  __warn+0xc2/0xdd
[  834.596103]  warn_slowpath_null+0x1d/0x1f
[  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
[  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
[  834.596103]  async_cow_start+0x32/0x4d [btrfs]
[  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
[  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
[  834.596103]  process_one_work+0x273/0x4e4
[  834.596103]  worker_thread+0x1eb/0x2ca
[  834.596103]  ? rescuer_thread+0x2b6/0x2b6
[  834.596103]  kthread+0x100/0x108
[  834.596103]  ? __list_del_entry+0x22/0x22
[  834.596103]  ret_from_fork+0x2e/0x40
[  834.611656] ---[ end trace 719902fe6bdef08f ]---

So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
if an error happened.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:23 +01:00
Filipe Manana
a315e68f6e Btrfs: fix invalid attempt to free reserved space on failure to cow range
When attempting to COW a file range (we are starting writeback and doing
COW), if we manage to reserve an extent for the range we will write into
but fail after reserving it and before creating the respective ordered
extent, we end up in an error path where we attempt to decrement the
data space's bytes_may_use counter after we already did it while
reserving the extent, leading to a warning/trace like the following:

[  847.621524] ------------[ cut here ]------------
[  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
[  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  847.648601] Call Trace:
[  847.648601]  dump_stack+0x67/0x90
[  847.648601]  __warn+0xc2/0xdd
[  847.648601]  warn_slowpath_null+0x1d/0x1f
[  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
[  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
[  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
[  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
[  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
[  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
[  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
[  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
[  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
[  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  ? mark_lock+0x24/0x201
[  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
[  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
[  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
[  847.648601]  do_writepages+0x23/0x2c
[  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
[  847.648601]  filemap_fdatawrite_range+0x13/0x15
[  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
[  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
[  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
[  847.648601]  vfs_fsync_range+0x8c/0x9e
[  847.648601]  vfs_fsync+0x1c/0x1e
[  847.648601]  do_fsync+0x31/0x4a
[  847.648601]  SyS_fsync+0x10/0x14
[  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
[  847.648601] RIP: 0033:0x7f5b05200800
[  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
[  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
[  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
[  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
[  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
[  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
[  847.685787] ---[ end trace 2a4a3e15382508e8 ]---

So fix this by not attempting to decrement the data space info's
bytes_may_use counter if we already reserved the extent and an error
happened before creating the ordered extent. We are already correctly
freeing the reserved extent if an error happens, so there's no additional
measure needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:22 +01:00
Qu Wenruo
524272607e btrfs: Handle delalloc error correctly to avoid ordered extent hang
[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

[CAUSE]

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|                       |<---------- cleanup range --------->|
 ||
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.

However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.

So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.

[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.

The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.

After fix, delalloc error will be handled like:

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<--------  ----------->|<------ old error handler --------->|
 ||          ||
 ||          \_=> Cleaned up by cleanup_ordered_extents()
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:21 +01:00
Qu Wenruo
4dbd80fb91 btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:21 +01:00
Amir Goldstein
4a99f3c83d ovl: do not set overlay.opaque on non-dir create
The optimization for opaque dir create was wrongly being applied
also to non-dir create.

Fixes: 97c684cc91 ("ovl: create directories inside merged parent opaque")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
2017-04-26 14:33:44 +02:00
Colin Ian King
e56efe9322 lockd: remove redundant check on block
A null check followed by a return is being performed already, so block
is always non-null at the second check on block, hence we can remove
this redundant null-check (Detected by PVS-Studio).  Also re-work
comment to clean up a check-patch warning.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:56 -04:00
NeilBrown
99bbf6ecc6 NFS: don't try to cross a mountpount when there isn't one there.
consider the sequence of commands:
 mkdir -p /import/nfs /import/bind /import/etc
 mount --bind / /import/bind
 mount --make-private /import/bind
 mount --bind /import/etc /import/bind/etc

 exportfs -o rw,no_root_squash,crossmnt,async,no_subtree_check localhost:/
 mount -o vers=4 localhost:/ /import/nfs
 ls -l /import/nfs/etc

You would not expect this to report a stale file handle.
Yet it does.

The manipulations under /import/bind cause the dentry for
/etc to get the DCACHE_MOUNTED flag set, even though nothing
is mounted on /etc.  This causes nfsd to call
nfsd_cross_mnt() even though there is no mountpoint.  So an
upcall to mountd for "/etc" is performed.

The 'crossmnt' flag on the export of / causes mountd to
report that /etc is exported as it is a descendant of /.  It
assumes the kernel wouldn't ask about something that wasn't
a mountpoint.  The filehandle returned identifies the
filesystem and the inode number of /etc.

When this filehandle is presented to rpc.mountd, via
"nfsd.fh", the inode cannot be found associated with any
name in /etc/exports, or with any mountpoint listed by
getmntent().  So rpc.mountd says the filehandle doesn't
exist. Hence ESTALE.

This is fixed by teaching nfsd not to trust DCACHE_MOUNTED
too much.  It is just a hint, not a guarantee.
Change nfsd_mountpoint() to return '1' for a certain mountpoint,
'2' for a possible mountpoint, and 0 otherwise.

Then change nfsd_crossmnt() to check if follow_down()
actually found a mountpount and, if not, to avoid performing
a lookup if the location is not known to certainly require
an export-point.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
NeilBrown
2f10fdcb6a nfsd4: remove pointless strdup_if_nonnull
kstrdup() already checks for NULL.

(Brought to our attention by Jason Yann noticing (from sparse output)
that it should have been declared static.)

Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
J. Bruce Fields
51f5677777 nfsd: check for oversized NFSv2/v3 arguments
A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.

Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply.  This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE).  But a client that sends an incorrectly long reply
can violate those assumptions.  This was observed to cause crashes.

So, insist that the argument not be any longer than we expect.

Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.

As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.

Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:53 -04:00
Chao Yu
d618ebaf0a f2fs: enable small discard by default
This patch start to enable 4K granularity small discard by default
when realtime discard is on, so, in seriously fragmented space,
small size discard can be issued in time to avoid useless storage
space occupying of invalid filesystem's data, then performance of
flash storage can be recovered.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:18:45 -07:00
Chao Yu
34e159da41 f2fs: delay awaking discard thread
It's better to delay awaking discard thread while queuing discard commands
in checkpoint, it will help to give more chances for merging big and small
discard.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:18:44 -07:00
Yunlei He
66a82d1fc7 f2fs: seperate read nat page from nat_tree_lock
This patch seperate nat page read io from nat_tree_lock.

-lock_page
	-get_node_info()
		-current_nat_addr

			......           	->       write_checkpoint

			-get_meta_page

Because we lock node page, we can make sure no other threads
modify this nid concurrently. So we just obtain current_nat_addr
under nat_tree_lock, node info is always same in both nat pack.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:16:39 -07:00
Sheng Yong
d3bb910c15 f2fs: fix multiple f2fs_add_link() having same name for inline dentry
Commit 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having
same name) does not cover the scenario where inline dentry is enabled.
In that case, F2FS_I(dir)->task will be NULL, and __f2fs_add_link will
lookup dentries one more time.

This patch fixes it by moving the assigment of current task to a upper
level to cover both normal and inline dentry.

Cc: <stable@vger.kernel.org>
Fixes: 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having same name)
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:16:31 -07:00
J. Bruce Fields
13bf9fbff0 nfsd: stricter decoding of write-like NFSv2/v3 ops
The NFSv2/v3 code does not systematically check whether we decode past
the end of the buffer.  This generally appears to be harmless, but there
are a few places where we do arithmetic on the pointers involved and
don't account for the possibility that a length could be negative.  Add
checks to catch these.

Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 16:36:23 -04:00
J. Bruce Fields
db44bac41b nfsd4: minor NFSv2/v3 write decoding cleanup
Use a couple shortcuts that will simplify a following bugfix.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 16:36:16 -04:00
J. Bruce Fields
e6838a29ec nfsd: check for oversized NFSv2/v3 arguments
A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.

Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply.  This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE).  But a client that sends an incorrectly long reply
can violate those assumptions.  This was observed to cause crashes.

Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.

So, following a suggestion from Neil Brown, add a central check to
enforce our expectation that no NFSv2/v3 call has both a large call and
a large reply.

As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.

We may also consider rejecting calls that have any extra garbage
appended.  That would be safer, and within our rights by spec, but given
the age of our server and the NFS protocol, and the fact that we've
never enforced this before, we may need to balance that against the
possibility of breaking some oddball client.

Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 16:34:37 -04:00
Trond Myklebust
bb3393d5b3 NFSv3: nfs3_nlm_alloc_call should be declared static
Fix compiler warnings.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25 16:25:06 -04:00
Dan Williams
d4b29fd78e block: remove block_device_operations ->direct_access()
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
2093f2e9df block, dax: convert bdev_dax_supported() to dax_direct_access()
Kill of the final user of bdev_direct_access() and struct blk_dax_ctl.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
cccbce6715 filesystem-dax: convert to dax_direct_access()
Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.

This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
a41fe02b6b Revert "block: use DAX for partition table reads"
commit d1a5f2b4d8 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
2237570168 ("block_dev: remove DAX leftovers").

Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
fa5d932c32 ext2, ext4, xfs: retrieve dax_device for iomap operations
In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Trond Myklebust
a6598813a4 NFS: Don't write back further requests if there is a pending write error
If the server has already returned a fatal write error that the user
has not yet received on this file, then don't write back the other pages.
Instead, act as if they have been sent, and have returned with the same
error.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25 15:42:34 -04:00
Trond Myklebust
6aeafd05ec pNFS: Fix use after free issues in pnfs_do_read()
The assumption should be that if the caller returns PNFS_ATTEMPTED, then hdr
has been consumed, and so we should not be testing hdr->task.tk_status.
If the caller returns PNFS_TRY_AGAIN, then we need to recoalesce and
free hdr.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25 15:42:34 -04:00
Yan, Zheng
8179a101eb ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
ceph_set_acl() calls __ceph_setattr() if the setacl operation needs
to modify inode's i_mode. __ceph_setattr() updates inode's i_mode,
then calls posix_acl_chmod().

The problem is that __ceph_setattr() calls posix_acl_chmod() before
sending the setattr request. The get_acl() call in posix_acl_chmod()
can trigger a getxattr request. The reply of the getxattr request
can restore inode's i_mode to its old value. The set_acl() call in
posix_acl_chmod() sees old value of inode's i_mode, so it calls
__ceph_setattr() again.

Cc: stable@vger.kernel.org # needs backporting for < 4.9
Link: http://tracker.ceph.com/issues/19688
Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-04-25 21:08:26 +02:00
Darrick J. Wong
c4cf1acdb1 xfs: better log intent item refcount checking
Use ASSERTs on the log intent item refcounts so that we fail noisily if
anyone tries to double-free the item.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-04-25 09:40:42 -07:00
Brian Foster
20e8a06378 xfs: fix up quotacheck buffer list error handling
The quotacheck error handling of the delwri buffer list assumes the
resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
on the buffers that are dequeued. This can lead to assert failures
on buffer release and possibly other locking problems.

Move this code to a delwri queue cancel helper function to
encapsulate the logic required to properly release buffers from a
delwri queue. Update the helper to clear the delwri queue flag and
call it from quotacheck.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
27af1bbf52 xfs: remove xfs_trans_ail_delete_bulk
xfs_iflush_done uses an on-stack variable length array to pass the log
items to be deleted to xfs_trans_ail_delete_bulk.  On-stack VLAs are a
nasty gcc extension that can lead to unbounded stack allocations, but
fortunately we can easily avoid them by simply open coding
xfs_trans_ail_delete_bulk in xfs_iflush_done, which is the only caller
of it except for the single-item xfs_trans_ail_delete.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
3f88a15ae0 xfs: don't use bool values in trace buffers
Using bool values produces sparse warnings of this form:

fs/xfs/./xfs_trace.h:2252:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/xfs/./xfs_trace.h:2252:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/xfs/./xfs_trace.h:2278:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/xfs/./xfs_trace.h:2278:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/xfs/./xfs_trace.h:2307:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)

Just use a char instead to fix those up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Darrick J. Wong
12e4a381c5 xfs: fix getfsmap userspace memory corruption while setting OF_LAST
At the end of a getfsmap call, we will set FMR_OF_LAST in the last
struct fsmap that was handed in by userspace if we've truly run out of
space mapping record (as opposed to simply running out of space in the
user array).  Unfortunately, fmh_entries is the wrong check for whether
or not we've filled out anything in the user array because the ioctl
provides that fmh_count==0 sets fmh_entries without filling out the user
array.  Therefore we end up writing things into user memory areas that we
weren't given, and kaboom.

Since Christoph amended the getfsmap structure to track the number of
fsmap entries we've actually filled out, use that as part of deciding if
we have to set the OF_LAST flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
9d17e14cc0 xfs: fix __user annotations for xfs_ioc_getfsmap
By passing the whole fsmap_head structure and an index we can get the
user point annotations right for the embedded variable sized array
in struct fsmap_head.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: change idx to unsigned int]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
e2a641922a xfs: corruption needs to respect endianess too!
At least if we want to be able to recognize the pattern.  Add a missing
byte swap to the corruption injection case in xlog_sync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
ef2b67ecf8 xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
Found by sparse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
fad5656b22 xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
Found by sparse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
0c1d9e4a61 xfs: simplify validation of the unwritten extent bit
XFS only supports the unwritten extent bit in the data fork, and only if
the file system has a version 5 superblock or the unwritten extent
feature bit.

We currently have two routines that validate the invariant:
xfs_check_nostate_extents which return -EFSCORRUPTED when it's not met,
and xfs_validate_extent that triggers and assert in debug build.

Both of them iterate over all extents of an inode fork when called,
which isn't very efficient.

This patch instead adds a new helper that verifies the invariant one
extent at a time, and calls it from the places where we iterate over
all extents to converted them from or two the in-memory format.  The
callers then return -EFSCORRUPTED when reading invalid extents from
disk, or trigger an assert when writing them to disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
37f7f9bbf3 xfs: remove unused values from xfs_exntst_t
We only ever use the normal and unwritten states.  And the actual
ondisk format (this enum isn't despite being in xfs_format.h) only
has space for the unwritten bit anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
895e9bfc9e xfs: remove the unused XFS_MAXLINK_1 define
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Eric Sandeen
4f1adf3373 xfs: more do_div cleanups
On some architectures do_div does the pointer compare
trick to make sure that we've sent it an unsigned 64-bit
number.  (Why unsigned?  I don't know.)

Fix up the few places that squawk about this; in
xfs_bmap_wants_extents() we just used a bare int64_t so change
that to unsigned.

In xfs_adjust_extent_unmap_boundaries() all we wanted was the
mod, and we have an xfs-specific function to handle that w/o
side effects, which includes proper casting for do_div.

In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
XFS_BB_TO_FSBT returns a block in the filesystem, so use
xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
from that type as a bonus.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Eric Sandeen
90115407c5 xfs: remove use of do_div with 32-bit dividend in quota
The kbuild test robot caught this; in debug code we have another
caller of do_div with a 32-bit dividend (j) which is caught now
that we are using the kernel-supplied do_div.

None of the values used here are 64-bit; just use simple division.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Hou Tao
42bf9dba40 xfs: remove the trailing newline used in the fmt parameter of TP_printk
The trailing newlines wil lead to extra newlines in the trace file
which looks like the following output, so remove them.
>kworker/4:1H-1508  [004] .... 47879.101608: xfs_discard_extent: dev 8:0
>
>kworker/u16:2-238  [004] .... 47879.101725: xfs_extent_busy_clear: dev 8:0

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix the getfsmap tracepoints too]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Brian Foster
cb52ee334a xfs: prevent multi-fsb dir readahead from reading random blocks
Directory block readahead uses a complex iteration mechanism to map
between high-level directory blocks and underlying physical extents.
This mechanism attempts to traverse the higher-level dir blocks in a
manner that handles multi-fsb directory blocks and simultaneously
maintains a reference to the corresponding physical blocks.

This logic doesn't handle certain (discontiguous) physical extent
layouts correctly with multi-fsb directory blocks. For example,
consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
block size and a directory with the following extent layout:

 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..7]:          88..95            0 (88..95)             8
   1: [8..15]:         80..87            0 (80..87)             8
   2: [16..39]:        168..191          0 (168..191)          24
   3: [40..63]:        5242952..5242975  1 (72..95)            24

Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
extent 2 is larger than the directory block size, the readahead code
erroneously assumes the block is contiguous and issues a readahead
based on the physical mapping of the first fsb of the dirblk. This
results in read verifier failure and a spurious corruption or crc
failure, depending on the filesystem format.

Further, the subsequent readahead code responsible for walking
through the physical table doesn't correctly advance the physical
block reference for dirblk 2. Instead of advancing two physical
filesystem blocks, the first iteration of the loop advances 1 block
(correctly), but the subsequent iteration advances 2 more physical
blocks because the next physical extent (extent 3, above) happens to
cover more than dirblk 2. At this point, the higher-level directory
block walking is completely off the rails of the actual physical
layout of the directory for the respective mapping table.

Update the contiguous dirblock logic to consider the current offset
in the physical extent to avoid issuing directory readahead to
unrelated blocks. Also, update the mapping table advancing code to
consider the current offset within the current dirblock to avoid
advancing the mapping reference too far beyond the dirblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Eric Sandeen
023cc840b4 xfs: handle array index overrun in xfs_dir2_leaf_readbuf()
Carlos had a case where "find" seemed to start spinning
forever and never return.

This was on a filesystem with non-default multi-fsb (8k)
directory blocks, and a fragmented directory with extents
like this:

0:[0,133646,2,0]
1:[2,195888,1,0]
2:[3,195890,1,0]
3:[4,195892,1,0]
4:[5,195894,1,0]
5:[6,195896,1,0]
6:[7,195898,1,0]
7:[8,195900,1,0]
8:[9,195902,1,0]
9:[10,195908,1,0]
10:[11,195910,1,0]
11:[12,195912,1,0]
12:[13,195914,1,0]
...

i.e. the first extent is a contiguous 2-fsb dir block, but
after that it is fragmented into 1 block extents.

At the top of the readdir path, we allocate a mapping array
which (for this filesystem geometry) can hold 10 extents; see
the assignment to map_info->map_size.  During readdir, we are
therefore able to map extents 0 through 9 above into the array
for readahead purposes.  If we count by 2, we see that the last
mapped index (9) is the first block of a 2-fsb directory block.

At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
more readahead; the outer loop assumes one full dir block is
processed each loop iteration, and an inner loop that ensures
that this is so by advancing to the next extent until a full
directory block is mapped.

The problem is that this inner loop may step past the last
extent in the mapping array as it tries to reach the end of
the directory block.  This will read garbage for the extent
length, and as a result the loop control variable 'j' may
become corrupted and never fail the loop conditional.

The number of valid mappings we have in our array is stored
in map->map_valid, so stop this inner loop based on that limit.

There is an ASSERT at the top of the outer loop for this
same condition, but we never made it out of the inner loop,
so the ASSERT never fired.

Huge appreciation for Carlos for debugging and isolating
the problem.

Debugged-and-analyzed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Chandan Rajendra
a008c31c7e iomap_dio_rw: Prevent reading file data beyond iomap_dio->i_size
On a ppc64 machine executing overlayfs/019 with xfs as the lower and
upper filesystem causes the following call trace,

WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
Modules linked in:
CPU: 2 PID: 8034 Comm: fsstress Tainted: G             L  4.11.0-rc5-next-20170405 #100
task: c000000631314880 task.stack: c0000003915d4000
NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
REGS: c0000003915d7570 TRAP: 0700   Tainted: G             L   (4.11.0-rc5-next-20170405)
MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
  CR: 24004284  XER: 00000000
CFAR: c0000000006f7190 SOFTE: 1
GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
Call Trace:
[c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
[c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
[c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
[c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
[c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
[c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
[c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
[c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
[c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
Instruction dump:
78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378

The above problem can also be recreated on a regular xfs filesystem
using the command,

$ fsstress -d /mnt -l 1000 -n 1000 -p 1000

The reason for the call trace is,
1. When 'reserving' blocks for delayed allocation , XFS reserves more
   blocks (i.e. past file's current EOF) than required. This is done
   because XFS assumes that userspace might write more data and hence
   'reserving' more blocks might lead to the file's new data being
   stored contiguously on disk.
2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
   then cover the prealloc-ed EOF blocks in addition to the regular blocks.
3. When flushing the dirty blocks to disk, we only flush data till the
   file's EOF. But before writing out the dirty data, we allocate blocks
   on the disk for holding the file's new data. This allocation includes
   the blocks that are part of the 'prealloc EOF blocks'.
4. Later, when the last reference to the inode is being closed, XFS frees the
   unused 'prealloc EOF blocks' in xfs_inactive().

In step 3 above, When allocating space on disk for the delayed allocation
range, the space allocator might sometimes allocate less blocks than
required. If such an allocation ends right at the current EOF of the
file, We will not be able to clear the "delayed allocation" flag for the
'prealloc EOF blocks', since we won't have dirty buffer heads associated
with that range of the file.

In such a situation if a Direct I/O read operation is performed on file
range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
range [X, Y] and invalidate page cache for that range (Refer to
iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
the extent items (which are still cached in memory) for the file
range. When doing so we are not supposed to get an extent item with
IOMAP_DELALLOC flag set, since the previous "flush" operation should
have converted any delayed allocation data in the range [X, Y]. Hence we
end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().

This commit fixes the bug by preventing the read operation from going
beyond iomap_dio->i_size.

Reported-by: Santhosh G <santhog4@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Christoph Hellwig
7590632a33 xfs: remove bmap block allocation retries
Now that reflink operations don't set the firstblock value we don't
need the workarounds for non-NULL firstblock values without a prior
allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Christoph Hellwig
bf8eadbacb xfs: remove xfs_bmap_remap_alloc
The main thing that xfs_bmap_remap_alloc does is fixing the AGFL, similar
to what we do in the space allocator.  But the reflink code doesn't touch
the allocation btree unlike the normal space allocator, so we couldn't
care less about the state of the AGFL.

So remove xfs_bmap_remap_alloc and just handle the di_nblocks update in
the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Christoph Hellwig
6ebd5a4413 xfs: introduce xfs_bmapi_remap
Add a new helper to be used for reflink extent list additions instead of
funneling them through xfs_bmapi_write and overloading the firstblock
member in struct xfs_bmalloca and struct xfs_alloc_args.

With some small changes to xfs_bmap_remap_alloc this also means we do
not need a xfs_bmalloca structure for this case at all.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
6d04558f9f xfs: pass individual arguments to xfs_bmap_add_extent_hole_real
For the reflink case we'd much rather pass the required arguments than
faking up a struct xfs_bmalloca.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
39e07daa46 xfs: remove attr fork handling in xfs_bmap_finish_one
We never do COW operations for the attr fork, so don't pretend we handle
them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
52813fb13f xfs: fix integer truncation in xfs_bmap_remap_alloc
bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
xfs_aglock_t, which truncates the value to 32 bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Trond Myklebust
b3230e80a6 pNFS: Ensure we check layout segment validity in the pg_init() callback
If we have a layout segment cached in pgio->pg_lseg, we should check it
for validity before reusing it in a new RPC request. Otherwise, if we
recoalesce, we can end up looping forever.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25 10:56:19 -04:00
Amir Goldstein
4ff33aafd3 fanotify: don't expose EOPENSTALE to userspace
When delivering an event to userspace for a file on an NFS share,
if the file is deleted on server side before user reads the event,
user will not get the event.

If the event queue contained several events, the stale event is
quietly dropped and read() returns to user with events read so far
in the buffer.

If the event queue contains a single stale event or if the stale
event is a permission event, read() returns to user with the kernel
internal error code 518 (EOPENSTALE), which is not a POSIX error code.

Check the internal return value -EOPENSTALE in fanotify_read(), just
the same as it is checked in path_openat() and drop the event in the
cases that it is not already dropped.

This is a reproducer from Marko Rauhamaa:

Just take the example program listed under "man fanotify" ("fantest")
and follow these steps:

    ==============================================================
    NFS Server    NFS Client(1)     NFS Client(2)
    ==============================================================
    # echo foo >/nfsshare/bar.txt
                  # cat /nfsshare/bar.txt
                  foo
                                    # ./fantest /nfsshare
                                    Press enter key to terminate.
                                    Listening for events.
    # rm -f /nfsshare/bar.txt
                  # cat /nfsshare/bar.txt
                                    read: Unknown error 518
                  cat: /nfsshare/bar.txt: Operation not permitted
    ==============================================================

where NFS Client (1) and (2) are two terminal sessions on a single NFS
Client machine.

Reported-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Tested-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Cc: <linux-api@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-25 15:48:06 +02:00
Hou Pengyang
4086d3f61b f2fs: skip encrypted inode in ASYNC IPU policy
Async request may be throttled in block layer, so page for async may keep WRITE_BACK
for a long time.

For encrytped inode, we need wait on page writeback no matter if the device supports
BDI_CAP_STABLE_WRITES. This may result in a higher waiting page writeback time for
async encrypted inode page.

This patch skips IPU for encrypted inode's updating write.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:24 -07:00
Jaegeuk Kim
a788189305 f2fs: fix out-of free segments
This patch also reverts d0db7703ac ("f2fs: do SSR in higher priority").

This patch fixes out of free segments caused by many small file creation by
1) mkfs -s 1 2G
2) mount
3) untar
 - preoduce 60000 small files burstly
4) sync
 - flush node pages
 - flush imeta

Here, when we do f2fs_balance_fs, we missed # of imeta blocks, resulting in
skipping to check has_not_enough_free_secs.

Another test is done by
1) mkfs -s 12 2G
2) mount
3) untar
 - preoduce 60000 small files burstly
4) sync
 - flush node pages
 - flush imeta

In this case, this patch also fixes wrong block allocation under large section
size.

Reported-by: William Brana <wbrana@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:23 -07:00
Arnd Bergmann
d66450e773 f2fs: improve definition of statistic macros
With a recent addition of f2fs_lookup_extent_tree(), we get a warning about
the use of empty macros:

fs/f2fs/extent_cache.c: In function 'f2fs_lookup_extent_tree':
fs/f2fs/extent_cache.c:358:32: error: suggest braces around empty body in an 'else' statement [-Werror=empty-body]
   stat_inc_rbtree_node_hit(sbi);

A good way to avoid the warning and make the code more robust is to define
all no-op macros as 'do { } while (0)'.

Fixes: 54c2258cd6 ("f2fs: extract rb-tree operation infrastructure")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reivewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:22 -07:00
Jaegeuk Kim
d579324998 f2fs: assign allocation hint for warm/cold data
This patch gives slower device region to warm/cold data area more eagerly.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:06:53 -07:00
Jaegeuk Kim
d07efb5077 f2fs: fix _IOW usage
This patch fixes wrong _IOW usage.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 12:55:45 -07:00
Jaegeuk Kim
e066b83c9b f2fs: add ioctl to flush data from faster device to cold area
This patch adds an ioctl to flush data in faster device to cold area. User can
give device number and number of segments to move. It doesn't move it if there
is only one device.

The parameter looks like:

struct f2fs_flush_device {
	u32 dev_num;		/* device number to flush */
	u32 segments;		/* # of segments to flush */
};

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 12:55:41 -07:00
Jan Kara
61a929870d ext4: Improve comments in ext4_quota_{on|off}()
Improve comments in ext4_quota_{on|off}() to explain that returning
success despite ext4_journal_start() failing is deliberate.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:49:16 +02:00
Dan Carpenter
f4edce1afd fsnotify: remove a stray unlock
We recently shifted this code around, so we're no longer holding the
lock on this path.

Fixes: 755b5bc681 ("fsnotify: Remove indirection from mark list addition")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:41:28 +02:00
Fabian Frederick
5c26eac43a udf: use kmap_atomic for memcpy copying
Use temporary mapping for memory copying operations.

To avoid any sleeping problem,

mark_inode_dirty(inode) was moved after kunmap() in
udf_adinicb_readpage()

down_write(&iinfo->i_data_sem) set before kmap_atomic()
in udf_expand_file_adinicb()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:28:02 +02:00
Fabian Frederick
6ff6b2b329 udf: use octal for permissions
According to commit f90774e1fd ("checkpatch: look for symbolic
permissions and suggest octal instead")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:27:52 +02:00
Linus Torvalds
9ea33c44fb This pull request contains fixes for issues in both UBI and UBIFS:
- More O_TMPFILE fallout
 - RENAME_WHITEOUT regression due to a mis-merge
 - Memory leak in ubifs_mknod()
 - Power-cut problem in UBI's update volume feature
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJY/HTyAAoJEEtJtSqsAOnW7bYP/jYNCRKUls/FhWcF02h/tuPz
 gYemObWxoEvPqy8laNwyYEkzct4ADsz4C2RCGxygJks2qYCzuXy+IQd/gQlJNIhM
 B9sNa1se1ntt71vwdwgRUgVRMGjknJ6rY8ZP2actuuzztdJE9fGNKhEarMVyQQ61
 f7+AvmFc0CHTEBi5hcHX/Rtdqvzt4OOJl5pLo+VT2RRegxb3oHgl1UTD+gqkdkMg
 WOxvdriSQPi8HHz1clCN36lG2H05sKLd7YEhSWKf6Pt0YYJmF8ckxLERzRclpV4D
 XXqUEwhrovJC+oRQbXS+DaJfjsML9y8GB28izC5c+H8/teGKv2n5NXtjsyWNCfnx
 gmZT1S24TnTsofbmd1GoemMMD4Vlt9EDq2+QVFCaPgK9eWkDQ/xlhjrfPMsuz+YY
 uCWgyzsJyvRewiOlkN1jcTua6f1nHt0SP7alNy0XUQF08jwYMaaf5IJAbEHP8Ink
 seIXNObiGgHl+bBsTQ4SU9PUIJIKIar0TSQf8HXV+/Go3XDz4ZTSjc1KVXs5Hg97
 irUPllIKmjb5Doiln12A5zePmgT2uaChFG21+6SwEWkFYrD3XIxBORxkEF6E7FJU
 asRk2oBzNYezz+qpjn3NeZnIV7Uc9B8bBZpqFrsys2KRAJIZRtdhpAF+jIbrS+7N
 JrakPr2I/FKnsNU/0mZh
 =ER2q
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.11-rc7' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS fixes from Richard Weinberger:
 "This contains fixes for issues in both UBI and UBIFS:

   - more O_TMPFILE fallout

   - RENAME_WHITEOUT regression due to a mis-merge

   - memory leak in ubifs_mknod()

   - power-cut problem in UBI's update volume feature"

* tag 'upstream-4.11-rc7' of git://git.infradead.org/linux-ubifs:
  ubifs: Fix O_TMPFILE corner case in ubifs_link()
  ubifs: Fix RENAME_WHITEOUT support
  ubifs: Fix debug messages for an invalid filename in ubifs_dump_inode
  ubifs: Fix debug messages for an invalid filename in ubifs_dump_node
  ubifs: Remove filename from debug messages in ubifs_readdir
  ubifs: Fix memory leak in error path in ubifs_mknod
  ubi/upd: Always flush after prepared for an update
2017-04-23 16:49:16 -07:00
Ingo Molnar
58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
David S. Miller
fb796707d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Both conflict were simple overlapping changes.

In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 20:23:53 -07:00
Linus Torvalds
94836ecf1e Fix a 4.11 regression that triggers a BUG() on an attempt to use an
unsupported NFSv4 compound op.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJY+mlFAAoJECebzXlCjuG+eCMP/2r/lXdx91Kibg4/D/9OB+LW
 YmcyUH83iRq/F/V88yEMz5XMUTZ/UkZ3iSGSDObjhBK7oswAtZhC1KiWOiuI9ry4
 tcqZM3se7FQNpbwm2wvhtRmbjd2YOZfS3o32Uorhc/beMZzIzZhUb15pp5o3hOIy
 4aoxeUs6//B51fil4Ur0oeLI1EqUd3t+1z0ytyRRzSfORHF/QEiTTXSmmKK+K2XA
 t35XXYJE3SCLxQpWW/CHT/IODv6nsaM0rgN/kkmZUxAyaw2GXT7U6+mzkz3tSL8s
 AEqNUptCn2M3iAv/e6wgpO3BS49HR+Gdg+JxPilCcs+GniWu4zH3sE0HFjTuqQ23
 7u/NB4lkDqFs9pO1x4D5Y/69yXbs6jUHn+NMnhTZS85pTwU5o3RNhzwegoOfIp/3
 CY22R92ZLHEqb1/VTc7rpeuj3d9r7Swg3U8X7kDnn7XtAj2HlMqaHZfw2zjTIYBO
 wqKnikTDMHuzZ7MtegLld4oayNZO4hN7Vi01Npzgr3nkwvpFpz8Jbvsn8mMHvUq8
 ZYWoMfz5rbH20N7cnFUAtB5fm67U9smjb01uzx3m6ylh1SqTWxAoZPWluGJTv0IA
 JKEM4g4PSNOizhib6KI2PW0SngLkCWKLy463fDS2Mx/M7DBZf0j4697TDK21R5NB
 Pny/pwJT4mjLwePc/qt2
 =mSBF
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Fix a 4.11 regression that triggers a BUG() on an attempt to use an
  unsupported NFSv4 compound op"

* tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux:
  nfsd: fix oops on unsupported operation
2017-04-21 16:37:48 -07:00
Ilya Dryomov
19b7ccf865 block: get rid of blk_integrity_revalidate()
Commit 25520d55cd ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

    if (bi->profile)
            disk->queue->backing_dev_info->capabilities |=
                    BDI_CAP_STABLE_WRITES;
    else
            disk->queue->backing_dev_info->capabilities &=
                    ~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time.  This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there.  This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cd ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.4+, needs backporting
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-21 14:17:27 -06:00
Al Viro
4f757f3cbf make sure that mntns_install() doesn't end up with referral for root
new flag: LOOKUP_DOWN.  If the starting point is overmounted, cross
into whatever's mounted on top, triggering referrals et.al.

Use that instead of follow_down_one() loop in mntns_install(), handle
errors properly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 14:05:36 -04:00
Al Viro
93893862fb path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
we'll hit that check in link_path_walk() anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 14:05:35 -04:00
Al Viro
159b095628 make sure that fchdir() won't accept referral points, etc.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 14:05:35 -04:00
Al Viro
c63ed807d1 orangefs: use iov_iter_revert()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 13:57:32 -04:00
Benjamin Coddington
f30cb757f6 NFS: Always wait for I/O completion before unlock
NFS attempts to wait for read and write completion before unlocking in
order to ensure that the data returned was protected by the lock.  When
this waiting is interrupted by a signal, the unlock may be skipped, and
messages similar to the following are seen in the kernel ring buffer:

[20.167876] Leaked locks on dev=0x0:0x2b ino=0x8dd4c3:
[20.168286] POSIX: fl_owner=ffff880078b06940 fl_flags=0x1 fl_type=0x0 fl_pid=20183
[20.168727] POSIX: fl_owner=ffff880078b06680 fl_flags=0x1 fl_type=0x0 fl_pid=20185

For NFSv3, the missing unlock will cause the server to refuse conflicting
locks indefinitely.  For NFSv4, the leftover lock will be removed by the
server after the lease timeout.

This patch fixes this issue by skipping the usual wait in
nfs_iocounter_wait if the FL_CLOSE flag is set when signaled.  Instead, the
wait happens in the unlock RPC task on the NFS UOC rpc_waitqueue.

For NFSv3, use lockd's new nlmclnt_operations along with
nfs_async_iocounter_wait to defer NLM's unlock task until the lock
context's iocounter reaches zero.

For NFSv4, call nfs_async_iocounter_wait() directly from unlock's
current rpc_call_prepare.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00
Benjamin Coddington
b1ece737f4 lockd: Introduce nlmclnt_operations
NFS would enjoy the ability to modify the behavior of the NLM client's
unlock RPC task in order to delay the transmission of the unlock until IO
that was submitted under that lock has completed.  This ability can ensure
that the NLM client will always complete the transmission of an unlock even
if the waiting caller has been interrupted with fatal signal.

For this purpose, a pointer to a struct nlmclnt_operations can be assigned
in a nfs_module's nfs_rpc_ops that will install those nlmclnt_operations on
the nlm_host.  The struct nlmclnt_operations defines three callback
operations that will be used in a following patch:

nlmclnt_alloc_call - used to call back after a successful allocation of
	a struct nlm_rqst in nlmclnt_proc().

nlmclnt_unlock_prepare - used to call back during NLM unlock's
	rpc_call_prepare.  The NLM client defers calling rpc_call_start()
	until this callback returns false.

nlmclnt_release_call - used to call back when the NLM client's struct
	nlm_rqst is freed.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00
Benjamin Coddington
7d6ddf88c4 NFS: Add an iocounter wait function for async RPC tasks
By sleeping on a new NFS Unlock-On-Close waitqueue, rpc tasks may wait for
a lock context's iocounter to reach zero.  The rpc waitqueue is only woken
when the open_context has the NFS_CONTEXT_UNLOCK flag set in order to
mitigate spurious wake-ups for any iocounter reaching zero.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00
Benjamin Coddington
50f2112cf7 locks: Set FL_CLOSE when removing flock locks on close()
Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks.
NFS will check for this flag to ensure an unlock is sent in a following
patch.

Fuse handles flock and posix locks differently for FL_CLOSE, and so
requires a fixup to retain the existing behavior for flock.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00