Null kernfs nodes could be found at cgroups during construction.
It seems safer to handle these null pointers right in kernfs in
the same way as printf prints "(null)" for null pointer string.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The handling of the might_cancel queueing is not properly protected, so
parallel operations on the file descriptor can race with each other and
lead to list corruptions or use after free.
Protect the context for these operations with a seperate lock.
The wait queue lock cannot be reused for this because that would create a
lock inversion scenario vs. the cancel lock. Replacing might_cancel with an
atomic (atomic_t or atomic bit) does not help either because it still can
race vs. the actual list operation.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "linux-fsdevel@vger.kernel.org"
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Avoid using stripe_width for sbi->s_stripe value if it is not actually
set. It prevents using the stride for sbi->s_stripe.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When a filesystem is created using:
mkfs.ext4 -b 4096 -E stride=512 <dev>
and we try to allocate 64MB extent, we will end up directly in
ext4_mb_complex_scan_group(). This is because the request is detected
as power-of-two allocation (so we start in ext4_mb_regular_allocator()
with ac_criteria == 0) however the check before
ext4_mb_simple_scan_group() refuses the direct buddy scan because the
allocation request is too large. Since cr == 0, the check whether we
should use ext4_mb_scan_aligned() fails as well and we fall back to
ext4_mb_complex_scan_group().
Fix the problem by checking for upper limit on power-of-two requests
directly when detecting them.
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch incorrectly attempted nested mnt_want_write, and incorrectly
disabled nfsd's owner override for truncate. We'll fix those problems
and make another attempt soon, for the moment I think the safest is to
revert.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We'll OOPS in ramoops_get_next_prz() if the platform didn't ask for any
ftrace zones (i.e., cxt->fprzs will be NULL). Let's just skip this
entire FTRACE section if there's no 'fprzs'.
Regression seen on a coreboot/depthcharge-based Chromebook.
Fixes: 2fbea82bbb ("pstore: Merge per-CPU ftrace records into one")
Cc: Joel Fernandes <joelaf@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
The deamon through which the kernel module communicates with the userspace
part of Orangefs, the "client-core", sends initialization data to the
kernel module with ioctl. The initialization data was built by the
client-core in a 2k buffer and copy_from_user'd into a 1k buffer
in the kernel module. When more than 1k of initialization data needed
to be sent, some was lost, reducing the usability of the control by which
debug levels are set. This patch sets the kernel side buffer to 2K to
match the userspace side...
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This patch is simlar to one Dan Carpenter sent me, cleans
up some return codes and whitespace errors. There was one
place where he thought inserting an error message into
the ring buffer might be too chatty, I hope I convinced him
othewise. As a consolation <g> I changed a truly chatty
error message in another location into a debug message,
system-admins had already yelled at me about that one...
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Instead we submit the discard requests and use another workqueue to
release the extents from the extent busy list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Sort busy extents by the full block number instead of just the AGNO so
that we can issue consecutive discard requests that the block layer could
merge (although we'll need additional block layer fixes for fast devices).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Set the timeout for TCP connections to be 1 lease period to ensure
that we don't lose our lease due to a faulty TCP connection.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently we force the log and simply try again if we hit a busy extent,
but especially with online discard enabled it might take a while after
the log force for the busy extents to disappear, and we might have
already completed our second pass.
So instead we add a new waitqueue and a generation counter to the pag
structure so that we can do wakeups once we've removed busy extents,
and we replace the single retry with an unconditional one - after
all we hold the AGF buffer lock, so no other allocations or frees
can be racing with us in this AG.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We don't just need the structure to track busy extents which can be
avoided with a synchronous transaction, but also to keep track of
pending discard.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If pag cannot be allocated, the current error exit path will trip
a null pointer deference error when calling xfs_buf_hash_destroy
with a null pag. Fix this by adding a new error exit labels and
jumping to those accordingly, avoiding the hash destroy and
unnecessary kmem_free on pag.
Up to three things need to be properly unwound:
1) pag memory allocation
2) xfs_buf_hash_init
3) radix_tree_insert
For any given iteration through the loop, any of the above which
succeed must be unwound for /this/ pag, and then all prior
initialized pags must be unwound.
Addresses-Coverity-Id: 1397628 ("Dereference after null check")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We're changing both metadata and data, so we need to update the
timestamps for clone operations. Dedupe on the other hand does
not change file data, and only changes invisible metadata so the
timestamps should not be updated.
This follows existing btrfs behavior.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: remove redundant is_dedupe test]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we exit because the file access check failed, we currently
leak the struct nfs4_state. We need to attach it to the
open context before returning.
Fixes: 3efb972247 ("NFSv4: Refactor _nfs4_open_and_get_state..")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Record flush/channel/content entries is useless, remove them.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since commit 4f52b6bb ("NFS: Don't call COMMIT in ->releasepage()"),
no tasks wait on PagePrivate, so the wake introduced in commit 95905446
("NFS: avoid deadlocks with loop-back mounted NFS filesystems.") can
be removed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
An interrupted rename will leave the old dentry behind if the rename
succeeds. Fix this by moving the final local work of the rename to
rpc_call_done so that the results of the RENAME can always be handled,
even if the original process has already returned with -ERESTARTSYS.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make sure all callers follow the same locking protocol, given that DAX
transparantly replaced the normal buffered I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Unlike O_DIRECT DAX is not an optional opt-in feature selected by the
application, so we'll have to provide the traditional synchronіzation
of overlapping writes as we do for buffered writes.
This was broken historically for DAX, but got fixed for ext2 and XFS
as part of the iomap conversion. Fix up ext4 as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Commit 4c63c2454e incorrectly assumed that returning -ENOIOCTLCMD would
cause the native ioctl to be called. The ->compat_ioctl callback is
expected to handle all ioctls, not just compat variants. As a result,
when using 32-bit userspace on 64-bit kernels, everything except those
three ioctls would return -ENOTTY.
Fixes: 4c63c2454e ("btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl")
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6326fec112 ("mm: Use owner_priv bit for PageSwapCache, valid
when PageSwapBacked") aliased PG_swapcache to PG_owner_priv_1 (and
depending on PageSwapBacked being true).
As a result, the KPF_SWAPCACHE bit in '/proc/kpageflags' should now be
synthesized, instead of being shown on unrelated pages which just happen
to have PG_owner_priv_1 set.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If overlay was mounted by root then quota set for upper layer does not work
because overlay now always use mounter's credentials for operations.
Also overlay might deplete reserved space and inodes in ext4.
This patch drops capability SYS_RESOURCE from saved credentials.
This affects creation new files, whiteouts, and copy-up operations.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 1175b6b8d9 ("ovl: do operations on underlying file system in mounter's context")
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
overlayfs syncs all inode pages on sync_filesystem(), but it also
needs to call s_op->sync_fs() of upper fs for metadata sync.
This fixes correctness of syncfs(2) as demonstrated by following
xfs specific test:
xfs_sync_stats()
{
echo $1
echo -n "xfs_log_force = "
grep log /proc/fs/xfs/stat | awk '{ print $5 }'
}
xfs_sync_stats "before touch"
touch x
xfs_sync_stats "after touch"
xfs_io -c syncfs .
xfs_sync_stats "after syncfs"
xfs_io -c fsync x
xfs_sync_stats "after fsync"
xfs_io -c fsync x
xfs_sync_stats "after fsync #2"
When this test is run in overlay mount over xfs, log force
count does not increase with syncfs command.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now that copy up of regular file is done using O_TMPFILE,
we don't need to hold rename_lock throughout copy up.
Use the copy up waitqueue to synchronize concurrent copy up
of the same file. Different regular files can be copied up
concurrently.
The upper dir inode_lock is taken instead of rename_lock,
because it is needed for lookup and later for linking the
temp file, but it is released while copying up data.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The overlay sb 'copyup_wq' and overlay inode 'copying' condition
variable are about to replace the upper sb rename_lock, as finer
grained synchronization objects for concurrent copy up.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In preparation for concurrent copy up, implement copy up
of regular file as O_TMPFILE that is linked to upperdir
instead of a file in workdir that is moved to upperdir.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As preparation to implementing copy up with O_TMPFILE,
name the variable for dentry before final rename 'temp' and
assign it to 'newdentry' only after rename.
Also lookup upper dentry before looking up temp dentry and
move ovl_set_timestamps() into ovl_copy_up_locked(), because
that is going to be more convenient for upcoming change.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is needed for choosing between concurrent copyup
using O_TMPFILE and legacy copyup using workdir+rename.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Before calling write f_ops, call file_start_write() instead
of sb_start_write().
Replace {sb,file}_start_write() for {copy,clone}_file_range() and
for fallocate().
Beyond correct semantics, this avoids freeze protection to sb when
operating on special inodes, such as fallocate() on a blockdev.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There is no in-tree file system that implements copy_file_range()
for non regular files.
Deny an attempt to copy_file_range() a directory with EISDIR
and any other non regualr file with EINVAL to conform with
behavior of vfs_{clone,dedup}_file_range().
This change is needed prior to converting sb_start_write()
to file_start_write() in the vfs helper.
Cc: linux-api@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There was an obscure use case of fallocate of directory inode
in the vfs helper with the comment:
"Let individual file system decide if it supports preallocation
for directories or not."
But there is no in-tree file system that implements fallocate
for directory operations.
Deny an attempt to fallocate a directory with EISDIR error.
This change is needed prior to converting sb_start_write()
to file_start_write(), so freeze protection is correctly
handled for cases of fallocate file and blockdev.
Cc: linux-api@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Factor out some common vfs bits from do_tmpfile()
to be used by overlayfs for concurrent copy up.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When a completion is declared on-stack we have to use
COMPLETION_INITIALIZER_ONSTACK().
Fixes: 0b81d07790 ("fs crypto: move per-file encryption from f2fs
tree to fs/crypto")
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Previously, each filesystem configured without encryption support would
define all the public fscrypt functions to their notsupp_* stubs. This
list of #defines had to be updated in every filesystem whenever a change
was made to the public fscrypt functions. To make things more
maintainable now that we have three filesystems using fscrypt, split the
old header fscrypto.h into several new headers. fscrypt_supp.h contains
the real declarations and is included by filesystems when configured
with encryption support, whereas fscrypt_notsupp.h contains the inline
stubs and is included by filesystems when configured without encryption
support. fscrypt_common.h contains common declarations needed by both.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
res is assigned to sizeof(ctx), however, this is unused and res
is updated later on without that assigned value to res ever being
used. Remove this redundant assignment.
Fixes CoverityScan CID#1395546 "Unused value"
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Instead of preallocating all the required COW blocks in the high-level
write code do it inside the iomap code, like we do for all other I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we allocate COW fork blocks for direct I/O writes we currently first
create a delayed allocation, and then convert it to a real allocation
once we've got the delayed one.
As there is no good reason for that this patch instead makes use call
xfs_bmapi_write from the COW allocation path. The only interesting bits
are a few tweaks the low-level allocator to allow for this, most notably
the need to remove the call to xfs_bmap_extsize_align for the cowextsize
in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
for the direct allocation case it would blow up our block reservation
way beyond what we reserved for the transaction.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We'll need it for the direct I/O code. Also rename the function to
xfs_reflink_convert_cow_extent to describe it a bit better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Factor a helper to calculate the extent-size aligned block out of the
iomap code, so that it can be reused by the upcoming reflink dio code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We currently fall back from direct to buffered writes if we detect a
remaining shared extent in the iomap_begin callback. But by the time
iomap_begin is called for the potentially unaligned end block we might
have already written most of the data to disk, which we'd now write
again using buffered I/O. To avoid this reject all writes to reflinked
files before starting I/O so that we are guaranteed to only write the
data once.
The alternative would be to unshare the unaligned start and/or end block
before doing the I/O. I think that's doable, and will actually be
required to support reflinks on DAX file system. But it will take a
little more time and I'd rather get rid of the double write ASAP.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If you change the set of filesystems that are exported, then
the contents of various directories in the NFSv4 pseudo-root
is likely to change. However the change-id of those
directories is currently tied to the underlying directory,
so the client may not see the changes in a timely fashion.
This patch changes the change-id number to be derived from the
"flush_time" of the export cache. Whenever any changes are
made to the set of exported filesystems, this flush_time is
updated. The result is that clients see changes to the set
of exported filesystems much more quickly, often immediately.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We are currently using one bit in s_resize_flags; rename it in order
to allow more of the bits in that unsigned long for other purposes.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the file system requires journal recovery, and the device is
read-ony, return EROFS to the mount system call. This allows xfstests
generic/050 to pass.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
If the journal is aborted, the needs_recovery feature flag should not
be removed. Otherwise, it's the journal might not get replayed and
this could lead to more data getting lost.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
If the journal has been aborted, we shouldn't mark the underlying
buffer head as dirty, since that will cause the metadata block to get
modified. And if the journal has been aborted, we shouldn't allow
this since it will almost certainly lead to a corrupted file system.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The write_end() function must always unlock the page and drop its ref
count, even on an error.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
After successful IO or permanent error, b_first_retry_time also
needs to be cleared, else the invalid first retry time will be
used by the next retry check.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
__alloc_pages_nodemask+0x436/0x4d0
alloc_pages_current+0x97/0x1b0
__page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728
pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331
grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773
iomap_write_begin+0x50/0xd0 fs/iomap.c:118
iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190
? iomap_write_end+0x80/0x80 fs/iomap.c:150
iomap_apply+0xb3/0x130 fs/iomap.c:79
iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
? iomap_write_end+0x80/0x80
xfs_file_buffered_aio_write+0x132/0x390 [xfs]
? remove_wait_queue+0x59/0x60
xfs_file_write_iter+0x90/0x130 [xfs]
__vfs_write+0xe5/0x140
vfs_write+0xc7/0x1f0
? syscall_trace_enter+0x1d0/0x380
SyS_write+0x58/0xc0
do_syscall_64+0x6c/0x200
entry_SYSCALL64_slow_path+0x25/0x25
the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other
callers of iomap_apply loop to complete the full request. We need to
check for fatal signals and back off with a short write instead.
As the iomap_apply delegates all the work down to the actor we have to
hook into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.
Fixes: 68a9f5e700 ("xfs: implement iomap based buffered write path")
Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not used anywhere.
CC: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
The issue here is that in orangefs_bufmap_alloc() we do:
bufmap->buffer_index_array =
kzalloc(DIV_ROUND_UP(bufmap->desc_count, BITS_PER_LONG), GFP_KERNEL);
If we choose a bufmap->desc_count like -31 then it means the
DIV_ROUND_UP ends up having an integer overflow. The result is that
kzalloc() returns the ZERO_SIZE_PTR and there is a static checker
warning.
But this bug is harmless because on the next lines we use ->desc_count
to do a kcalloc(). That has integer overflow checking built in so the
kcalloc() fails and we return an error code.
Anyway, it doesn't make sense to talk about negative sizes and blocking
them silences the static checker warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Currently, lsattr for instance in udf directory gives
"udf: Invalid argument While reading flags on ..."
This patch returns -ENOIOCTLCMD
when command is unknown to have more accurate message like this:
"Inappropriate ioctl for device While reading flags on ..."
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
It only gets called from aops.c and doesn't appear in any headers.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Ever since mount propagation was introduced in cases where a mount in
propagated to parent mount mountpoint pair that is already in use the
code has placed the new mount behind the old mount in the mount hash
table.
This implementation detail is problematic as it allows creating
arbitrary length mount hash chains.
Furthermore it invalidates the constraint maintained elsewhere in the
mount code that a parent mount and a mountpoint pair will have exactly
one mount upon them. Making it hard to deal with and to talk about
this special case in the mount code.
Modify mount propagation to notice when there is already a mount at
the parent mount and mountpoint where a new mount is propagating to
and place that preexisting mount on top of the new mount.
Modify unmount propagation to notice when a mount that is being
unmounted has another mount on top of it (and no other children), and
to replace the unmounted mount with the mount on top of it.
Move the MNT_UMUONT test from __lookup_mnt_last into
__propagate_umount as that is the only call of __lookup_mnt_last where
MNT_UMOUNT may be set on any mount visible in the mount hash table.
These modifications allow:
- __lookup_mnt_last to be removed.
- attach_shadows to be renamed __attach_mnt and its shadow
handling to be removed.
- commit_tree to be simplified
- copy_tree to be simplified
The result is an easier to understand tree of mounts that does not
allow creation of arbitrary length hash chains in the mount hash table.
The result is also a very slight userspace visible difference in semantics.
The following two cases now behave identically, where before order
mattered:
case 1: (explicit user action)
B is a slave of A
mount something on A/a , it will propagate to B/a
and than mount something on B/a
case 2: (tucked mount)
B is a slave of A
mount something on B/a
and than mount something on A/a
Histroically umount A/a would fail in case 1 and succeed in case 2.
Now umount A/a succeeds in both configurations.
This very small change in semantics appears if anything to be a bug
fix to me and my survey of userspace leads me to believe that no programs
will notice or care of this subtle semantic change.
v2: Updated to mnt_change_mountpoint to not call dput or mntput
and instead to decrement the counts directly. It is guaranteed
that there will be other references when mnt_change_mountpoint is
called so this is safe.
v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
As the locking in fs/namespace.c changed between v2 and v3.
v4: Reworked the logic in propagate_mount_busy and __propagate_umount
that detects when a mount completely covers another mount.
v5: Removed unnecessary tests whose result is alwasy true in
find_topper and attach_recursive_mnt.
v6: Document the user space visible semantic difference.
Cc: stable@vger.kernel.org
Fixes: b90fa9ae8f ("[PATCH] shared mount handling: bind and rbind")
Tested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Michael Kerrisk <<mtk.manpages@gmail.com> writes:
I would like to write code that discovers the namespace setup on a live
system. The NS_GET_PARENT and NS_GET_USERNS ioctl() operations added in
Linux 4.9 provide much of what I want, but there are still a couple of
small pieces missing. Those pieces are added with this patch series.
Here's an example program that makes use of the new ioctl() operations.
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
/* ns_capable.c
(C) 2016 Michael Kerrisk, <mtk.manpages@gmail.com>
Licensed under the GNU General Public License v2 or later.
Test whether a process (identified by PID) might (subject to LSM checks)
have capabilities in a namespace (identified by a /proc/PID/ns/xxx file).
*/
} while (0)
exit(EXIT_FAILURE); } while (0)
/* Display capabilities sets of process with specified PID */
static void
show_cap(pid_t pid)
{
cap_t caps;
char *cap_string;
caps = cap_get_pid(pid);
if (caps == NULL)
errExit("cap_get_proc");
cap_string = cap_to_text(caps, NULL);
if (cap_string == NULL)
errExit("cap_to_text");
printf("Capabilities: %s\n", cap_string);
}
/* Obtain the effective UID pf the process 'pid' by
scanning its /proc/PID/file */
static uid_t
get_euid_of_process(pid_t pid)
{
char path[PATH_MAX];
char line[1024];
int uid;
snprintf(path, sizeof(path), "/proc/%ld/status", (long) pid);
FILE *fp;
fp = fopen(path, "r");
if (fp == NULL)
errExit("fopen-/proc/PID/status");
for (;;) {
if (fgets(line, sizeof(line), fp) == NULL) {
/* Should never happen... */
fprintf(stderr, "Failure scanning %s\n", path);
exit(EXIT_FAILURE);
}
if (strstr(line, "Uid:") == line) {
sscanf(line, "Uid: %*d %d %*d %*d", &uid);
return uid;
}
}
}
int
main(int argc, char *argv[])
{
int ns_fd, userns_fd, pid_userns_fd;
int nstype;
int next_fd;
struct stat pid_stat;
struct stat target_stat;
char *pid_str;
pid_t pid;
char path[PATH_MAX];
if (argc < 2) {
fprintf(stderr, "Usage: %s PID [ns-file]\n", argv[0]);
fprintf(stderr, "\t'ns-file' is a /proc/PID/ns/xxxx file; "
"if omitted, use the namespace\n"
"\treferred to by standard input "
"(file descriptor 0)\n");
exit(EXIT_FAILURE);
}
pid_str = argv[1];
pid = atoi(pid_str);
if (argc <= 2) {
ns_fd = STDIN_FILENO;
} else {
ns_fd = open(argv[2], O_RDONLY);
if (ns_fd == -1)
errExit("open-ns-file");
}
/* Get the relevant user namespace FD, which is 'ns_fd' if 'ns_fd' refers
to a user namespace, otherwise the user namespace that owns 'ns_fd' */
nstype = ioctl(ns_fd, NS_GET_NSTYPE);
if (nstype == -1)
errExit("ioctl-NS_GET_NSTYPE");
if (nstype == CLONE_NEWUSER) {
userns_fd = ns_fd;
} else {
userns_fd = ioctl(ns_fd, NS_GET_USERNS);
if (userns_fd == -1)
errExit("ioctl-NS_GET_USERNS");
}
/* Obtain 'stat' info for the user namespace of the specified PID */
snprintf(path, sizeof(path), "/proc/%s/ns/user", pid_str);
pid_userns_fd = open(path, O_RDONLY);
if (pid_userns_fd == -1)
errExit("open-PID");
if (fstat(pid_userns_fd, &pid_stat) == -1)
errExit("fstat-PID");
/* Get 'stat' info for the target user namesapce */
if (fstat(userns_fd, &target_stat) == -1)
errExit("fstat-PID");
/* If the PID is in the target user namespace, then it has
whatever capabilities are in its sets. */
if (pid_stat.st_dev == target_stat.st_dev &&
pid_stat.st_ino == target_stat.st_ino) {
printf("PID is in target namespace\n");
printf("Subject to LSM checks, it has the following capabilities\n");
show_cap(pid);
exit(EXIT_SUCCESS);
}
/* Otherwise, we need to walk through the ancestors of the target
user namespace to see if PID is in an ancestor namespace */
for (;;) {
int f;
next_fd = ioctl(userns_fd, NS_GET_PARENT);
if (next_fd == -1) {
/* The error here should be EPERM... */
if (errno != EPERM)
errExit("ioctl-NS_GET_PARENT");
printf("PID is not in an ancestor namespace\n");
printf("It has no capabilities in the target namespace\n");
exit(EXIT_SUCCESS);
}
if (fstat(next_fd, &target_stat) == -1)
errExit("fstat-PID");
/* If the 'stat' info for this user namespace matches the 'stat'
* info for 'next_fd', then the PID is in an ancestor namespace */
if (pid_stat.st_dev == target_stat.st_dev &&
pid_stat.st_ino == target_stat.st_ino)
break;
/* Next time round, get the next parent */
f = userns_fd;
userns_fd = next_fd;
close(f);
}
/* At this point, we found that PID is in an ancestor of the target
user namespace, and 'userns_fd' refers to the immediate descendant
user namespace of PID in the chain of user namespaces from PID to
the target user namespace. If the effective UID of PID matches the
owner UID of descendant user namespace, then PID has all
capabilities in the descendant namespace(s); otherwise, it just has
the capabilities that are in its sets. */
uid_t owner_uid, uid;
if (ioctl(userns_fd, NS_GET_OWNER_UID, &owner_uid) == -1) {
perror("ioctl-NS_GET_OWNER_UID");
exit(EXIT_FAILURE);
}
uid = get_euid_of_process(pid);
printf("PID is in an ancestor namespace\n");
if (owner_uid == uid) {
printf("And its effective UID matches the owner "
"of the namespace\n");
printf("Subject to LSM checks, PID has all capabilities in "
"that namespace!\n");
} else {
printf("But its effective UID does not match the owner "
"of the namespace\n");
printf("Subject to LSM checks, it has the following capabilities\n");
show_cap(pid);
}
exit(EXIT_SUCCESS);
}
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
Michael Kerrisk (2):
nsfs: Add an ioctl() to return the namespace type
nsfs: Add an ioctl() to return owner UID of a userns
fs/nsfs.c | 13 +++++++++++++
include/uapi/linux/nsfs.h | 9 +++++++--
2 files changed, 20 insertions(+), 2 deletions(-)
I'd like to write code that discovers the user namespace hierarchy on a
running system, and also shows who owns the various user namespaces.
Currently, there is no way of getting the owner UID of a user namespace.
Therefore, this patch adds a new NS_GET_CREATOR_UID ioctl() that fetches
the UID (as seen in the user namespace of the caller) of the creator of
the user namespace referred to by the specified file descriptor.
If the supplied file descriptor does not refer to a user namespace,
the operation fails with the error EINVAL. If the owner UID does
not have a mapping in the caller's user namespace return the
overflow UID as that appears easier to deal with in practice
in user-space applications.
-- EWB Changed the handling of unmapped UIDs from -EOVERFLOW
back to the overflow uid. Per conversation with
Michael Kerrisk after examining his test code.
Acked-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Michael Kerrisk <mtk-manpages@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Christoph Hellwig pointed out that there's a potentially nasty race when
performing simultaneous nearby directio cow writes:
"Thread 1 writes a range from B to c
" B --------- C
p
"a little later thread 2 writes from A to B
" A --------- B
p
[editor's note: the 'p' denote cowextsize boundaries, which I added to
make this more clear]
"but the code preallocates beyond B into the range where thread
"1 has just written, but ->end_io hasn't been called yet.
"But once ->end_io is called thread 2 has already allocated
"up to the extent size hint into the write range of thread 1,
"so the end_io handler will splice the unintialized blocks from
"that preallocation back into the file right after B."
We can avoid this race by ensuring that thread 1 cannot accidentally
remap the blocks that thread 2 allocated (as part of speculative
preallocation) as part of t2's write preparation in t1's end_io handler.
The way we make this happen is by taking advantage of the unwritten
extent flag as an intermediate step.
Recall that when we begin the process of writing data to shared blocks,
we create a delayed allocation extent in the CoW fork:
D: --RRRRRRSSSRRRRRRRR---
C: ------DDDDDDD---------
When a thread prepares to CoW some dirty data out to disk, it will now
convert the delalloc reservation into an /unwritten/ allocated extent in
the cow fork. The da conversion code tries to opportunistically
allocate as much of a (speculatively prealloc'd) extent as possible, so
we may end up allocating a larger extent than we're actually writing
out:
D: --RRRRRRSSSRRRRRRRR---
U: ------UUUUUUU---------
Next, we convert only the part of the extent that we're actively
planning to write to normal (i.e. not unwritten) status:
D: --RRRRRRSSSRRRRRRRR---
U: ------UURRUUU---------
If the write succeeds, the end_cow function will now scan the relevant
range of the CoW fork for real extents and remap only the real extents
into the data fork:
D: --RRRRRRRRSRRRRRRRR---
U: ------UU--UUU---------
This ensures that we never obliterate valid data fork extents with
unwritten blocks from the CoW fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In the data fork, we only allow extents to perform the following state
transitions:
delay -> real <-> unwritten
There's no way to move directly from a delalloc reservation to an
/unwritten/ allocated extent. However, for the CoW fork we want to be
able to do the following to each extent:
delalloc -> unwritten -> written -> remapped to data fork
This will help us to avoid a race in the speculative CoW preallocation
code between a first thread that is allocating a CoW extent and a second
thread that is remapping part of a file after a write. In order to do
this, however, we need two things: first, we have to be able to
transition from da to unwritten, and second the function that converts
between real and unwritten has to be made aware of the cow fork. Do
both of those things.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Perform basic sanity checking of the directory free block header
fields so that we avoid hanging the system on invalid data.
(Granted that just means that now we shutdown on directory write,
but that seems better than hanging...)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
no such thing as a zero-level bmbt (for that we have extents format),
so if we see this, send back an error code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Don't let anybody load an obviously bad btree pointer. Since the values
come from disk, we must return an error, not just ASSERT.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
When we open a directory, we try to readahead block 0 of the directory
on the assumption that we're going to need it soon. If the bmbt is
corrupt, the directory will never be usable and the readahead fails
immediately, so we might as well prevent the directory from being opened
at all. This prevents a subsequent read or modify operation from
hitting it and taking the fs offline.
NOTE: We're only checking for early failures in the block mapping, not
the readahead directory block itself.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We use di_format and if_flags to decide whether we're grabbing the ilock
in btree mode (btree extents not loaded) or shared mode (anything else),
but the state of those fields can be changed by other threads that are
also trying to load the btree extents -- IFEXTENTS gets set before the
_bmap_read_extents call and cleared if it fails.
We don't actually need to have IFEXTENTS set until after the bmbt
records are successfully loaded and validated, which will fix the race
between multiple threads trying to read the same directory. The next
patch strengthens directory bmbt validation by refusing to open the
directory if reading the bmbt to start directory readahead fails.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYk5bGAAoJECebzXlCjuG+vGcP/j2Sw74vbJX4ifkFQvUomOp/
83IHinaax5RDIqryf903iabvKX1SuXpVuxgpxtaSCz+oS9fV5wv/28NdIrMwfICJ
DC3I2xQ/osLUGHw1Td8BZ+Fv+P0Th+OGKwWCydp3Xejg/X+XUKrLs/Ex1/rHwTzJ
y/BIAV7BU94HlDFxif7sKbv6mi/gpETlvJK4AAbSbvpd4f5JuBXPQBtJBxUPtwi1
zM0pQzMoBgY5XzAZBWmQEYCJjTje5wqiueDHtteh8/X7FoLWtAcHsOmleglleKQn
927eRCTrHt9ioToyZJM4GkH8eW/nOMpaKKvjXCTdoKsuEZyAnlPk0VZUAikfDvql
dr9FznI21Geq/c9IeU3+1xSbe1I9eDO0L9qWp8prUikIZwI4kdEBN5z/0oMMcU4q
IXw2lb46w8JD11GIIAGZIhZb63FdO75Ck4Z2GX6UUFqt246s26Go9yJIZEDfu5sL
8FLmwgOYNMhFrSPAR9JmnBors5gNT9owNwieUB8IFvgMv1ajz2CWG2yvNO9Sq/SK
a/HJJ7A1YvX0uSzsKsvO/j5S4cY73l2kWKX4NRqMFXIYzMzNGHvIvIB238tXAzZe
Z5YaMycsjuRKe9VkP2lZQtVzl9qfnvkd5o6Tg3RkMZXkVOHMB/j2yratWB2XTl3h
I2xAGjQFJ/Pn66pJWNe0
=yQ4e
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Three more miscellaneous nfsd bugfixes"
* tag 'nfsd-4.10-2' of git://linux-nfs.org/~bfields/linux:
svcrpc: fix oops in absence of krb5 module
nfsd: special case truncates some more
NFSD: Fix a null reference case in find_or_create_lock_stateid()
We don't always have easy access to the dentry of a file or directory we
created in debugfs. Add a helper which allows us to get a dentry we
previously created.
The motivation for this change is a problem with blktrace and the blk-mq
debugfs entries introduced in 07e4fead45 ("blk-mq: create debugfs
directory tree"). Namely, in some cases, the directory that blktrace
needs to create may already exist, but in other cases, it may not. We
_could_ rely on a bunch of implied knowledge to decide whether to create
the directory or not, but it's much cleaner on our end to just look it
up.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
The "half md4" transform should not be used by any new code. And
fortunately, it's only used now by ext4. Since ext4 supports several
hashing methods, at some point it might be desirable to move to
something like SipHash. As an intermediate step, remove half md4 from
cryptohash.h and lib, and make it just a local function in ext4's
hash.c. There's precedent for doing this; the other function ext can use
for its hashes -- TEA -- is also implemented in the same place. Also, by
being a local function, this might allow gcc to perform some additional
optimizations.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
blk_get_backing_dev_info() is now a simple dereference. Remove that
function and simplify some code around that.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:
[113031.075540] Unable to handle kernel paging request for data at address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
Signed-off-by: Jens Axboe <axboe@fb.com>
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, block device inodes stay around after corresponding gendisk
hash died until memory reclaim finds them and frees them. Since we will
make block device inode pin the bdi, we want to free the block device
inode as soon as the device goes away so that bdi does not stay around
unnecessarily. Furthermore we need to avoid issues when new device with
the same major,minor pair gets created since reusing the bdi structure
would be rather difficult in this case.
Unhashing block device inode on gendisk destruction nicely deals with
these problems. Once last block device inode reference is dropped (which
may be directly in del_gendisk()), the inode gets evicted. Furthermore if
the major,minor pair gets reallocated, we are guaranteed to get new
block device inode even if old block device inode is not yet evicted and
thus we avoid issues with possible reuse of bdi.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
In the case where the child's encryption context was inconsistent with
its parent directory, we were using inode->i_sb and inode->i_ino after
the inode had already been iput(). Fix this by doing the iput() in the
correct places.
Note: only ext4 had this bug, not f2fs and ubifs.
Fixes: d9cdc90331 ("ext4 crypto: enforce context consistency")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Below is the synchronization issue between unmount and kjournald2
contexts, which results into use after free issue in kjournald2().
Fix this issue by using journal->j_state_lock to synchronize the
wait_event() done in journal_kill_thread() and the wake_up() done
in kjournald2().
TASK 1:
umount cmd:
|--jbd2_journal_destroy() {
|--journal_kill_thread() {
write_lock(&journal->j_state_lock);
journal->j_flags |= JBD2_UNMOUNT;
...
write_unlock(&journal->j_state_lock);
wake_up(&journal->j_wait_commit); TASK 2 wakes up here:
kjournald2() {
...
checks JBD2_UNMOUNT flag and calls goto end-loop;
...
end_loop:
write_unlock(&journal->j_state_lock);
journal->j_task = NULL; --> If this thread gets
pre-empted here, then TASK 1 wait_event will
exit even before this thread is completely
done.
wait_event(journal->j_wait_done_commit, journal->j_task == NULL);
...
write_lock(&journal->j_state_lock);
write_unlock(&journal->j_state_lock);
}
|--kfree(journal);
}
}
wake_up(&journal->j_wait_done_commit); --> this step
now results into use after free issue.
}
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Allow to decrypt transformed packets that are bigger than the big
buffer size. In particular it is used for read responses that can
only exceed the big buffer size.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Since we have two different types of reads (pagecache and direct)
we need to process such responses differently after decryption of
a packet. The change allows to specify a callback that copies a read
payload data into preallocated pages.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
We need to process read responses differently because the data
should go directly into preallocated pages. This can be done
by specifying a mid handle callback.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
We need to recognize and parse transformed packets in demultiplex
thread to find a corresponsing mid and process it further.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
This change allows to encrypt packets if it is required by a server
for SMB sessions or tree connections.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
In order to allow encryption on SMB connection we need to exchange
a session key and generate encryption and decryption keys.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
This will allow us to do protocol specific tranformations of packets
before sending to the server. For SMB3 it can be used to support
encryption.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Allocate and initialize SMB2 read request without RFC1001 length
field to directly call cifs_send_recv() rather than SendReceive2()
in a read codepath.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Do not process RFC1001 length in smb2_hdr_assemble() because
it is not a part of SMB2 header. This allows to cleanup the code
and adds a possibility combine several SMB2 packets into one
for compounding.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
In order to simplify further encryption support we need to separate
RFC1001 length and SMB2 header when sending a request. Put the length
field in iov[0] and the rest of the packet into following iovs.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Now SendReceive2 frees the first iov and returns a response buffer
in it that increases a code complexity. Simplify this by making
a caller responsible for freeing request buffer itself and returning
a response buffer in a separate iov.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
In order to support compounding and encryption we need to separate
RFC1001 length field and SMB2 header structure because the protocol
treats them differently. This change will allow to simplify parsing
of such complex SMB2 packets further.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Currently we call copy_page_to_iter() for uncached reading into a pipe.
This is wrong because it treats pages as VFS cache pages and copies references
rather than actual data. When we are trying to read from the pipe we end up
calling page_cache_pipe_buf_confirm() which returns -ENODATA. This error
is translated into 0 which is returned to a user.
This issue is reproduced by running xfs-tests suite (generic test #249)
against mount points with "cache=none". Fix it by mapping pages manually
and calling copy_to_iter() that copies data into the pipe.
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
List soft dependencies of cifs so that mkinitrd and dracut can include
the required helper modules.
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Steve French <sfrench@samba.org>
The sha256 and cmac crypto modules are only needed for SMB2+, so move
the select statements to config CIFS_SMB2. Also select CRYPTO_AES
there as SMB2+ needs it.
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Steve French <sfrench@samba.org>
* CIFS_SMB2 depends on CIFS, which depends on INET and selects NLS. So
these dependencies do not need to be repeated for CIFS_SMB2.
* CIFS_SMB311 depends on CIFS_SMB2, which depends on INET. So this
dependency doesn't need to be repeated for CIFS_SMB311.
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Steve French <sfrench@samba.org>
Pull fscache fixes from Al Viro.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fscache: Fix dead object requeue
fscache: Clear outstanding writes when disabling a cookie
FS-Cache: Initialise stores_lock in netfs cookie
To support unprivileged users mounting filesystems two permission
checks have to be performed: a test to see if the user allowed to
create a mount in the mount namespace, and a test to see if
the user is allowed to access the specified filesystem.
The automount case is special in that mounting the original filesystem
grants permission to mount the sub-filesystems, to any user who
happens to stumble across the their mountpoint and satisfies the
ordinary filesystem permission checks.
Attempting to handle the automount case by using override_creds
almost works. It preserves the idea that permission to mount
the original filesystem is permission to mount the sub-filesystem.
Unfortunately using override_creds messes up the filesystems
ordinary permission checks.
Solve this by being explicit that a mount is a submount by introducing
vfs_submount, and using it where appropriate.
vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
sget and friends know that a mount is a submount so they can take appropriate
action.
sget and sget_userns are modified to not perform any permission checks
on submounts.
follow_automount is modified to stop using override_creds as that
has proven problemantic.
do_mount is modified to always remove the new MS_SUBMOUNT flag so
that we know userspace will never by able to specify it.
autofs4 is modified to stop using current_real_cred that was put in
there to handle the previous version of submount permission checking.
cifs is modified to pass the mountpoint all of the way down to vfs_submount.
debugfs is modified to pass the mountpoint all of the way down to
trace_automount by adding a new parameter. To make this change easier
a new typedef debugfs_automount_t is introduced to capture the type of
the debugfs automount function.
Cc: stable@vger.kernel.org
Fixes: 069d5ac9ae ("autofs: Fix automounts by using current_real_cred()->uid")
Fixes: aeaa4a79ff ("fs: Call d_automount with the filesystems creds")
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This way we don't need to deal with cputime_t details from the core code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-32-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-12-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This API returns a task's cputime in cputime_t in order to ease the
conversion of cputime internals to use nsecs units instead. Blindly
converting all cputime readers to use this API now will later let us
convert more smoothly and step by step all these places to use the
new nsec based cputime.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cputime_t is being obsolete and replaced by nsecs units in order to make
internal timestamps less opaque and more granular.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.
Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
may_create() rejects creation of inodes with ids which lack a
mapping into s_user_ns. However for O_CREAT may_o_create() is
is used instead. Add a similar check there.
Fixes: 036d523641 ("vfs: Don't create inodes with a uid or gid unknown to the vfs")
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Instead of keeping two levels of indirection for requests types, fold it
all into the operations. The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.
Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Under some circumstances, an fscache object can become queued such that it
fscache_object_work_func() can be called once the object is in the
OBJECT_DEAD state. This results in the kernel oopsing when it tries to
invoke the handler for the state (which is hard coded to 0x2).
The way this comes about is something like the following:
(1) The object dispatcher is processing a work state for an object. This
is done in workqueue context.
(2) An out-of-band event comes in that isn't masked, causing the object to
be queued, say EV_KILL.
(3) The object dispatcher finishes processing the current work state on
that object and then sees there's another event to process, so,
without returning to the workqueue core, it processes that event too.
It then follows the chain of events that initiates until we reach
OBJECT_DEAD without going through a wait state (such as
WAIT_FOR_CLEARANCE).
At this point, object->events may be 0, object->event_mask will be 0
and oob_event_mask will be 0.
(4) The object dispatcher returns to the workqueue processor, and in due
course, this sees that the object's work item is still queued and
invokes it again.
(5) The current state is a work state (OBJECT_DEAD), so the dispatcher
jumps to it - resulting in an OOPS.
When I'm seeing this, the work state in (1) appears to have been either
LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is
fscache_osm_lookup_oob).
The window for (2) is very small:
(A) object->event_mask is cleared whilst the event dispatch process is
underway - though there's no memory barrier to force this to the top
of the function.
The window, therefore is from the time the object was selected by the
workqueue processor and made requeueable to the time the mask was
cleared.
(B) fscache_raise_event() will only queue the object if it manages to set
the event bit and the corresponding event_mask bit was set.
The enqueuement is then deferred slightly whilst we get a ref on the
object and get the per-CPU variable for workqueue congestion. This
slight deferral slightly increases the probability by allowing extra
time for the workqueue to make the item requeueable.
Handle this by giving the dead state a processor function and checking the
for the dead state address rather than seeing if the processor function is
address 0x2. The dead state processor function can then set a flag to
indicate that it's occurred and give a warning if it occurs more than once
per object.
If this race occurs, an oops similar to the following is seen (note the RIP
value):
BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
IP: [<0000000000000002>] 0x1
PGD 0
Oops: 0010 [#1] SMP
Modules linked in: ...
CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
Workqueue: fscache_object fscache_object_work_func [fscache]
task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000
RIP: 0010:[<0000000000000002>] [<0000000000000002>] 0x1
RSP: 0018:ffff880717547df8 EFLAGS: 00010202
RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200
RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480
RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510
R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000
R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600
FS: 0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00
ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000
ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900
Call Trace:
[<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache]
[<ffffffff8109d5db>] process_one_work+0x17b/0x470
[<ffffffff8109e4ac>] worker_thread+0x21c/0x400
[<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
[<ffffffff810a5acf>] kthread+0xcf/0xe0
[<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[<ffffffff816460d8>] ret_from_fork+0x58/0x90
[<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeremy McNicoll <jeremymc@redhat.com>
Tested-by: Frank Sorenson <sorenson@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fscache_disable_cookie() needs to clear the outstanding writes on the
cookie it's disabling because they cannot be completed after.
Without this, fscache_nfs_open_file() gets stuck because it disables the
cookie when the file is opened for writing but can't uncache the pages till
afterwards - otherwise there's a race between the open routine and anyone
who already has it open R/O and is still reading from it.
Looking in /proc/pid/stack of the offending process shows:
[<ffffffffa0142883>] __fscache_wait_on_page_write+0x82/0x9b [fscache]
[<ffffffffa014336e>] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache]
[<ffffffffa01740fa>] nfs_fscache_open_file+0x59/0x9e [nfs]
[<ffffffffa01ccf41>] nfs4_file_open+0x17f/0x1b8 [nfsv4]
[<ffffffff8117350e>] do_dentry_open+0x16d/0x2b7
[<ffffffff811743ac>] vfs_open+0x5c/0x65
[<ffffffff81184185>] path_openat+0x785/0x8fb
[<ffffffff81184343>] do_filp_open+0x48/0x9e
[<ffffffff81174710>] do_sys_open+0x13b/0x1cb
[<ffffffff811747b9>] SyS_open+0x19/0x1b
[<ffffffff81001c44>] do_syscall_64+0x80/0x17a
[<ffffffff8165c2da>] return_from_SYSCALL_64+0x0/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff
Reported-by: Jianhong Yin <jiyin@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Initialise the stores_lock in fscache netfs cookies. Technically, it
shouldn't be necessary, since the netfs cookie is an index and stores no
data, but initialising it anyway adds insignificant overhead.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We only need this code to support scsi, ide, cciss and virtio. And at
least for virtio it's a deprecated feature to start with.
This should shrink the kernel size for embedded device that only use,
say eMMC a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently turning on NFSv4.2 results in 4.2 clients suddenly seeing the
individual file labels as they're set on the server. This is not what
they've previously seen, and not appropriate in may cases. (In
particular, if clients have heterogenous security policies then one
client's labels may not even make sense to another.) Labeled NFS should
be opted in only in those cases when the administrator knows it makes
sense.
It's helpful to be able to turn 4.2 on by default, and otherwise the
protocol upgrade seems free of regressions. So, default labeled NFS to
off and provide an export flag to reenable it.
Users wanting labeled NFS support on an export will henceforth need to:
- make sure 4.2 support is enabled on client and server (as
before), and
- upgrade the server nfs-utils to a version supporting the new
"security_label" export flag.
- set that "security_label" flag on the export.
This is commit may be seen as a regression to anyone currently depending
on security labels. We believe those cases are currently rare.
Reported-by: tibbs@math.uh.edu
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I noticed this was missing when I was testing with link local addresses.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I noticed this was missing when I was testing with link local addresses.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is just cleanup, no change in functionality.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is just cleanup, no change in functionality.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After fae5096ad2 "nfsd: assume writeable exportabled filesystems have
f_sync" we no longer modify this argument.
This is just cleanup, no change in functionality.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Writing to /proc/fs/nfsd/versions allows individual major versions
and NFSv4 minor versions to be enabled or disabled.
However NFSv4.0 cannot currently be disabled, thought there is no good reason.
Also the minor number is parsed as a 'long' but used as an 'int'
so '4294967297' will be incorrectly treated as '1'.
This patch removes the test on 'minor == 0' and switches to kstrtouint()
to get correct range checking.
When reading from /proc/fs/nfsd/versions, 4.0 is current not reported.
To allow the disabling for v4.0 to be visible, while maintaining
backward compatibility, change code to report "-4.0" if appropriate, but
not "+4.0".
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Both the NFS protocols and the Linux VFS use a setattr operation with a
bitmap of attributs to set to set various file attributes including the
file size and the uid/gid.
The Linux syscalls never mixes size updates with unrelated updates like
the uid/gid, and some file systems like XFS and GFS2 rely on the fact
that truncates might not update random other attributes, and many other
file systems handle the case but do not update the different attributes
in the same transaction. NFSD on the other hand passes the attributes
it gets on the wire more or less directly through to the VFS, leading to
updates the file systems don't expect. XFS at least has an assert on
the allowed attributes, which caught an unusual NFS client setting the
size and group at the same time.
To handle this issue properly this switches nfsd to call vfs_truncate
for size changes, and then handle all other attributes through
notify_change. As a side effect this also means less boilerplace code
around the size change as we can now reuse the VFS code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd assigns the nfs4_free_lock_stateid to .sc_free in init_lock_stateid().
If nfsd doesn't go through init_lock_stateid() and put stateid at end,
there is a NULL reference to .sc_free when calling nfs4_put_stid(ns).
This patch let the nfs4_stid.sc_free assignment to nfs4_alloc_stid().
Cc: stable@vger.kernel.org
Fixes: 356a95ece7 "nfsd: clean up races in lock stateid searching..."
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The "full" argument was used only by the fiemap formatter,
which is now gone with the iomap updates.
Remove the unused arg.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It's possible for post-eof blocks to end up being used for direct I/O
writes. dio write performs an upfront unwritten extent allocation, sends
the dio and then updates the inode size (if necessary) on write
completion. If a file release occurs while a file extending dio write is
in flight, it is possible to mistake the post-eof blocks for speculative
preallocation and incorrectly truncate them from the inode. This means
that the resulting dio write completion can discover a hole and allocate
new blocks rather than perform unwritten extent conversion.
This requires a strange mix of I/O and is thus not likely to reproduce
in real world workloads. It is intermittently reproduced by generic/299.
The error manifests as an assert failure due to transaction overrun
because the aforementioned write completion transaction has only
reserved enough blocks for btree operations:
XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
file: fs/xfs//xfs_trans.c, line: 309
The root cause is that xfs_free_eofblocks() uses i_size to truncate
post-eof blocks from the inode, but async, file extending direct writes
do not update i_size until write completion, long after inode locks are
dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
to the incorrect size.
Update xfs_free_eofblocks() to serialize against dio similar to how
extending writes are serialized against i_size updates before post-eof
block zeroing. Specifically, wait on dio while under the iolock. This
ensures that dio write completions have updated i_size before post-eof
blocks are processed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_eofblocks.eof_scan_owner field is an internal field to
facilitate invoking eofb scans from the kernel while under the iolock.
This is necessary because the eofb scan acquires the iolock of each
inode. Synchronous scans are invoked on certain buffered write failures
while under iolock. In such cases, the scan owner indicates that the
context for the scan already owns the particular iolock and prevents a
double lock deadlock.
eofblocks scans while under iolock are still livelock prone in the event
of multiple parallel scans, however. If multiple buffered writes to
different inodes fail and invoke eofblocks scans at the same time, each
scan avoids a deadlock with its own inode by virtue of the
eof_scan_owner field, but will never be able to acquire the iolock of
the inode from the parallel scan. Because the low free space scans are
invoked with SYNC_WAIT, the scan will not return until it has processed
every tagged inode and thus both scans will spin indefinitely on the
iolock being held across the opposite scan. This problem can be
reproduced reliably by generic/224 on systems with higher cpu counts
(x16).
To avoid this problem, simplify the semantics of eofblocks scans to
never invoke a scan while under iolock. This means that the buffered
write context must drop the iolock before the scan. It must reacquire
the lock before the write retry and also repeat the initial write
checks, as the original state might no longer be valid once the iolock
was dropped.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
different contexts where the lock may or may not be held. The
need_iolock parameter exists for this reason, to indicate whether
xfs_free_eofblocks() must acquire the iolock itself before it can
proceed.
This is ugly and confusing. Simplify the semantics of
xfs_free_eofblocks() to require the caller to acquire the iolock
appropriately and kill the need_iolock parameter. While here, the mp
param can be removed as well as the xfs_mount is accessible from the
xfs_inode structure. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
After scratching my head looking for "xfs_busy_extent" I realized
it's not used; it's xfs_extent_busy, and the declaration for the
other name is bogus. Remove that and a few others as well.
(struct xfs_log_callback is used, but the 2nd declaration is
unnecessary).
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that xfs_btree_init_block_int is able to determine crc
status from the passed-in mp, we can determine the proper
magic as well if we are given a btree number, rather than
an explicit magic value.
Change xfs_btree_init_block[_int] callers to pass in the
btree number, and let xfs_btree_init_block_int use the
xfs_magics array via the xfs_btree_magic macro to determine
which magic value is needed. This makes all of the
if (crc) / else stanzas identical, and the if/else can be
removed, leading to a single, common init_block call.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Right now the xfs_btree_magic() define takes only a cursor;
change this to take crc and btnum args to make it more generically
useful, and move to a function.
This will allow xfs_btree_init_block_int callers which don't
have a cursor to make use of the xfs_magics array, which will
happen in the next patch.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_btree_init_block_int() can determine whether crcs are
in effect without the passed-in XFS_BTREE_CRC_BLOCKS flag;
the mp argument allows us to determine this from the
superblock. Remove the flag from callers, and use
xfs_sb_version_hascrc(&mp->m_sb) internally instead.
This removes one difference between the if & else cases
in the callers.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Fixes the following sparse warning:
fs/nfs/nfs4state.c:862:60: warning: Using plain integer as NULL pointer
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fixes the following sparse warning:
fs/nfs/flexfilelayout/flexfilelayout.c:2114:34: warning:
symbol 'layoutreturn_ops' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This function doesn't add much, since all it does is access the server's
nfs_client variable.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There is no need for a goto just to return an error code without any
cleanup. Returning the error directly helps to clean up the code.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This tracepoint displays information about the slot that was chosen for
the RPC, in addition to session information. This could be useful
information for debugging, and we can set the session id hash to 0 to
indicate that there is no session.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This creates a single place for all the work to happen, using the
presence of a session to determine if extra values need to be set.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This puts the check in a single place, rather than needing to implement
it twice for v4.0 and v4.1.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The inline ifdef lets us put everything in a single place, rather than
having two (very similar) versions of this function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This does the right thing depending on if we have a session, rather than
needing to handle this manually in multiple places.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I want to have all callers use this function, rather than calling the
NFS v4.0 and v4.1 versions directly. This includes pNFS, which only has
access to the nfs_client structure in some places.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
pNFS only has access to the nfs_client structure, and not the
nfs_server, so we need to make this change so the function can be used
by pNFS as well.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This puts session related functions together in the same space. I only
keep one version of this function, since this variable will always be
NULL when using NFS v4.0.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This function is a bit clumsy, incorrectly producing
",mountproto=" if mountd_protocol is 0 and !showdefaults,
and duplicating the code for reporting "auto".
Tidy it up so that it only makes a single seq_printf() call,
and more obviously does the right thing.
Fixes: ee671b016f ("NFS: convert proto= option to use netids rather than a protoname")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Allow line continuations to work properly with KERN_CONT.
Signed-off-by: Joe Perches <joe@perches.com>
[Anna: Add fallback dprintk_cont() for when CONFIG_SUNRPC_DEBUG=n]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This patch modifies functions gfs2_trans_add_meta and _data so that
they check whether the buffer_head is already in a transaction,
and if so, avoid taking the gfs2_log_lock.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch relaxes async discard commands to avoid waiting its end_io during
checkpoint.
Instead of waiting them during checkpoint, it will be done when actually reusing
them.
Test on initial partition of nvme drive.
# time fstrim /mnt/test
Before : 6.158s
After : 4.822s
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
A test program gets the SEEK_DATA with two values between
a new created file and the exist file on f2fs filesystem.
F2FS filesystem, (the first "test1" is a new file)
SEEK_DATA size != 0 (offset = 8192)
SEEK_DATA size != 0 (offset = 4096)
PNFS filesystem, (the first "test1" is a new file)
SEEK_DATA size != 0 (offset = 4096)
SEEK_DATA size != 0 (offset = 4096)
int main(int argc, char **argv)
{
char *filename = argv[1];
int offset = 1, i = 0, fd = -1;
if (argc < 2) {
printf("Usage: %s f2fsfilename\n", argv[0]);
return -1;
}
/*
if (!access(filename, F_OK) || errno != ENOENT) {
printf("Needs a new file for test, %m\n");
return -1;
}*/
fd = open(filename, O_RDWR | O_CREAT, 0777);
if (fd < 0) {
printf("Create test file %s failed, %m\n", filename);
return -1;
}
for (i = 0; i < 20; i++) {
offset = 1 << i;
ftruncate(fd, 0);
lseek(fd, offset, SEEK_SET);
write(fd, "test", 5);
/* Get the alloc size by seek data equal zero*/
if (lseek(fd, 0, SEEK_DATA)) {
printf("SEEK_DATA size != 0 (offset = %d)\n", offset);
break;
}
}
close(fd);
return 0;
}
Reported-and-Tested-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the renaming bug on encrypted filenames, which was pointed by
(ext4: don't allow encrypted operations without keys)
Cc: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to show the max number of atomic operations which are
conducting concurrently.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements IO alignment by filling dummy blocks in DATA and NODE
write bios. If we can guarantee, for example, 32KB or 64KB for such the IOs,
we can eliminate underlying dummy page problem which FTL conducts in order to
close MLC or TLC partial written pages.
Note that,
- it requires "-o mode=lfs".
- IO size should be power of 2, not exceed BIO_MAX_PAGES, 256.
- read IO is still 4KB.
- do checkpoint at fsync, if dummy NODE page was written.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If userspace issue a fstrim with a range not involve prefree segments,
it will reuse these segments without discard. This patch fix it.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the range we write cover the whole valid data in the last page,
we do not need to read it.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
[Jaegeuk Kim: nullify the remaining area (fix: xfstests/f2fs/001)]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fix a problem of using memory after free
in function __try_merge_extent_node.
Fixes: 0f825ee6e8 ("f2fs: add new interfaces for extent tree")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We checked that "inode" is not an error pointer earlier so there is
no need to check again here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we run out of memory, in cache_nat_entry, it's better to avoid loop
for allocating memory to cache nat entry, so in low memory scenario, for
read path of node block, I expect this can avoid unneeded latency.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch remove unused values in function recover_fsync_data
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable patches:
- NFSv4.1: Fix a deadlock in layoutget
- NFSv4 must not bump sequence ids on NFS4ERR_MOVED errors
- NFSv4 Fix a regression with OPEN EXCLUSIVE4 mode
- Fix a memory leak when removing the SUNRPC module
Bugfixes:
- Fix a reference leak in _pnfs_return_layout
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYjM6UAAoJEGcL54qWCgDy+WoP/2NdwqzOa3xxn8piMhLh/TLJ
lxWfVmUZJ8UNVSVl45NDslt2V63P0gpITUiaVo27rPfjKV8pv8XTU7HkxCuXzj0m
uo11jEErsYxbXELNDL2y1fmsPTygYvg4MAV3NK0inKv/ciu6TTMddoga+QQXGJPz
QqQ7znTHyjDSTqZM1D5+f8EhbY7AIAzJbkgVr90WPPjsT6YtdlTQpMb1O9tlDtfj
oYidIOxRphqLOYTdoD8NW4x19dlgTNIYyCgOFomCbpUTFKKK5RFjVy+bpml3NxV1
oUldw8lhiUE8x/5J/CfbCMt1bPZ3yJeFWNghINOIF7cgGnvKEjw0uO2o8Lk6fMzi
yF2aNoZcNQnyoHHCjsqK8t5dB51abwxlzeElrFB1u73B56CcXB91xEP0GsX1vFt1
BIGCnfsOvtYPZxp/OX+B3AweAJ2DRuIvdze9cS2IeHBW9c5XdVdDmMK7CFeL9dc3
YEsWmaWNWtDt41erxCvQ0ezkr3lPHXOcjwS440LIiVpr38/rzjp81+Tf+i2D1GzB
XTsy9IHzaC2zLb0Vz5vOR3S10k4OiFtXT6ikoZLwgpTfI2gt146e0YHx3g90AiFh
QYBRNLJrjSYVedjv2GGFLG1WWsHbhbP6dpmT3BwNmhZTzjPoqZ76f5QFIQUya4jt
4Y2ksyhx7pOeOJy++6SN
=S5qH
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Stable patches:
- NFSv4.1: Fix a deadlock in layoutget
- NFSv4 must not bump sequence ids on NFS4ERR_MOVED errors
- NFSv4 Fix a regression with OPEN EXCLUSIVE4 mode
- Fix a memory leak when removing the SUNRPC module
Bugfixes:
- Fix a reference leak in _pnfs_return_layout"
* tag 'nfs-for-4.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS: Fix a reference leak in _pnfs_return_layout
nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED"
SUNRPC: cleanup ida information when removing sunrpc module
NFSv4.0: always send mode in SETATTR after EXCLUSIVE4
nfs: Don't increment lock sequence ID after NFS4ERR_MOVED
NFSv4.1: Fix a deadlock in layoutget
And require all drivers that want to support BLOCK_PC to allocate it
as the first thing of their private data. To support this the legacy
IDE and BSG code is switched to set cmd_size on their queues to let
the block layer allocate the additional space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When CONFIG_POSIX_TIMERS is disabled, it is preferable to remove related
structures from struct task_struct and struct signal_struct as they
won't contain anything useful and shouldn't be relied upon by mistake.
Code still referencing those structures is also disabled here.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
- Fix race conditions in the CoW code
- Fix some incorrect input validation checks
- Avoid crashing fs by running out of space when freeing inodes
- Fix toctou race wrt whether or not an inode has an attr
- Fix build error on arm
- Fix page refcount corruption when readahead fails
- Don't corrupt userspace in the bmap ioctl
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJYi4RJAAoJEPh/dxk0SrTrsU4QAIZBUUSpvpwggyfbTcOb6QWb
F/vBoj50f+cxZB2jxLrch4aRsvlhW6IkRnsKiciG4cQaDbjPbjhSMH4JEUUqrPvG
TRTcBvT6rEbxnB3adIH2DVrDAaEEVpRSkPaV/vLjYEfJp8Nyv8yXb6U8zH//NeZQ
Pwrwe0RX0bJRyAEBIRwBnTayMP6xIccCE9Ml7+sMG/UVnb0Fa8t5zg/9igfd0MrB
xf3MdkW0fpFCcp1Bbby8cnDmTjjxEtB6OApL82UnSZ7l2/U5AHhiA0NgHreYuzt8
47ezqEQfk+IWK5LY1c6V/vARKhVvM738jS2dG1tsFhnbTbq9yXA2yiCMdA+sKB7+
wlRIuTq7tuhN4Lk/9eheXHR4xHKDbOKY+zWEWi/AlFRaWmld0otMykVC6wbp6soo
1gYgbaCjJcJcResKYAdby92jqvIRONqknpUF2L0jOiGIPgz8rmjA6BIvymjaXEuO
4MLfSjeVP4Ip2tDcaa0R3dSQ40lP778UQNiuqcKb1WODx1AyljJB+gemK0jE8kwN
OEY7IBSs+wP/UBYN+XbYhoIGlJ4ckwyhIctl4bMvVOduQ40uASlyQS6hmqng5Df/
NIFd+fCwuBCa45mwUJ2LPTzx3WndMyLv30z/ladtshV+WlUbu60yTzT4bIiQDcpZ
CYALhDjBiCHzrs6rIhxW
=Co7t
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.10-rc6-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs uodates from Darrick Wong:
"I have some more fixes this week: better input validation, corruption
avoidance, build fixes, memory leak fixes, and a couple from Christoph
to avoid an ENOSPC failure.
Summary:
- Fix race conditions in the CoW code
- Fix some incorrect input validation checks
- Avoid crashing fs by running out of space when freeing inodes
- Fix toctou race wrt whether or not an inode has an attr
- Fix build error on arm
- Fix page refcount corruption when readahead fails
- Don't corrupt userspace in the bmap ioctl"
* tag 'xfs-for-linus-4.10-rc6-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: prevent quotacheck from overloading inode lru
xfs: fix bmv_count confusion w/ shared extents
xfs: clear _XBF_PAGES from buffers when readahead page
xfs: extsize hints are not unlikely in xfs_bmap_btalloc
xfs: remove racy hasattr check from attr ops
xfs: use per-AG reservations for the finobt
xfs: only update mount/resv fields on success in __xfs_ag_resv_init
xfs: verify dirblocklog correctly
xfs: fix COW writeback race
Pull btrfs updates from Chris Mason:
"Some fixes that we've collected from the list.
We still have one more pending to nail down a regression in lzo
compression, but I wanted to get this batch out the door"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations
Btrfs: disable xattr operations on subvolume directories
Btrfs: remove old tree_root case in btrfs_read_locked_inode()
Btrfs: fix truncate down when no_holes feature is enabled
Btrfs: Fix deadlock between direct IO and fast fsync
btrfs: fix false enospc error when truncating heavily reflinked file
Pull block fixes from Jens Axboe:
"A set of fixes for this series. This contains:
- Set of fixes for the nvme target code
- A revert of patch from this merge window, causing a regression with
WRITE_SAME on iSCSI targets at least.
- A fix for a use-after-free in the new O_DIRECT bdev code.
- Two fixes for the xen-blkfront driver"
* 'for-linus' of git://git.kernel.dk/linux-block:
Revert "sd: remove __data_len hack for WRITE SAME"
nvme-fc: use blk_rq_nr_phys_segments
nvmet-rdma: Fix missing dma sync to nvme data structures
nvmet: Call fatal_error from keep-alive timout expiration
nvmet: cancel fatal error and flush async work before free controller
nvmet: delete controllers deletion upon subsystem release
nvmet_fc: correct logic in disconnect queue LS handling
block: fix use after free in __blkdev_direct_IO
xen-blkfront: correct maximum segment accounting
xen-blkfront: feature flags handling adjustments
ext4_journalled_write_end() did not propely handle all the cases when
generic_perform_write() did not copy all the data into the target page
and could mark buffers with uninitialized contents as uptodate and dirty
leading to possible data corruption (which would be quickly fixed by
generic_perform_write() retrying the write but still). Fix the problem
by carefully handling the case when the page that is written to is not
uptodate.
CC: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If filesystem groups are artifically small (using parameter -g to
mkfs.ext4), ext4_mb_normalize_request() can result in a request that is
larger than a block group. Trim the request size to not confuse
allocation code.
Reported-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Quotacheck runs at mount time in situations where quota accounting must
be recalculated. In doing so, it uses bulkstat to visit every inode in
the filesystem. Historically, every inode processed during quotacheck
was released and immediately tagged for reclaim because quotacheck runs
before the superblock is marked active by the VFS. In other words,
the final iput() lead to an immediate ->destroy_inode() call, which
allowed the XFS background reclaim worker to start reclaiming inodes.
Commit 17c12bcd3 ("xfs: when replaying bmap operations, don't let
unlinked inodes get reaped") marks the XFS superblock active sooner as
part of the mount process to support caching inodes processed during log
recovery. This occurs before quotacheck and thus means all inodes
processed by quotacheck are inserted to the LRU on release. The
s_umount lock is held until the mount has completed and thus prevents
the shrinkers from operating on the sb. This means that quotacheck can
excessively populate the inode LRU and lead to OOM conditions on systems
without sufficient RAM.
Update the quotacheck bulkstat handler to set XFS_IGET_DONTCACHE on
inodes processed by quotacheck. This causes ->drop_inode() to return 1
and in turn causes iput_final() to evict the inode. This preserves the
original quotacheck behavior and prevents it from overloading the LRU
and running out of memory.
CC: stable@vger.kernel.org # v4.9
Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This patch simply combines function meta_lo_add with its only
caller, trans_add_meta. This makes the code easier to read and
will make it easier to reduce contention on gfs2_log_lock.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch eliminates the int variable tr_touched in favor of a
new flag in the transaction. This is a step toward reducing contention
on the gfs2_log_lock spin_lock.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When you snapshot a subvolume containing a subvolume, you get a
placeholder directory where the subvolume would be. These directory
inodes have ->i_ops set to btrfs_dir_ro_inode_operations. Previously,
these i_ops didn't include the xattr operation callbacks. The conversion
to xattr_handlers missed this case, leading to bogus attempts to set
xattrs on these inodes. This manifested itself as failures when running
delayed inodes.
To fix this, clear IOP_XATTR in ->i_opflags on these inodes.
Fixes: 6c6ef9f26e ("xattr: Stop calling {get,set,remove}xattr inode operations")
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Reported-by: Chris Murphy <lists@colorremedies.com>
Tested-by: Chris Murphy <lists@colorremedies.com>
Cc: <stable@vger.kernel.org> # 4.9.x
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
As Jeff explained in c2951f32d3 ("btrfs: remove old tree_root dirent
processing in btrfs_real_readdir()"), supporting this old format is no
longer necessary since the Btrfs magic number has been updated since we
changed to the current format. There are other places where we still
handle this old format, but since this is part of a fix that is going to
stable, I'm only removing this one for now.
Cc: <stable@vger.kernel.org> # 4.9.x
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
IF NFS_LAYOUT_RETURN_REQUESTED is not set, then we currently exit
without freeing the list of invalidated layout segments, leading
to a reference leak.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 24408f5282 ("pNFS: Fix bugs in _pnfs_return_layout")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Lock sequence IDs are bumped in decode_lock by calling
nfs_increment_seqid(). nfs_increment_sequid() does not use the
seqid_mutating_err() function fixed in commit 059aa73482 ("Don't
increment lock sequence ID after NFS4ERR_MOVED").
Fixes: 059aa73482 ("Don't increment lock sequence ID after ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Xuan Qi <xuan.qi@oracle.com>
Cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In a bmapx call, bmv_count is the total size of the array, including the
zeroth element that userspace uses to supply the search key. The output
array starts at offset 1 so that we can set up the user for the next
invocation. Since we now can split an extent into multiple bmap records
due to shared/unshared status, we have to be careful that we don't
overflow the output array.
In the original patch f86f403794 ("xfs: teach get_bmapx about shared
extents and the CoW fork") I used cur_ext (the output index) to check
for overflows, albeit with an off-by-one error. Since nexleft no longer
describes the number of unfilled slots in the output, we can rip all
that out and use cur_ext for the overflow check directly.
Failure to do this causes heap corruption in bmapx callers such as
xfs_io and xfs_scrub. xfs/328 can reproduce this problem.
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we try to allocate memory pages to back an xfs_buf that we're trying
to read, it's possible that we'll be so short on memory that the page
allocation fails. For a blocking read we'll just wait, but for
readahead we simply dump all the pages we've collected so far.
Unfortunately, after dumping the pages we neglect to clear the
_XBF_PAGES state, which means that the subsequent call to xfs_buf_free
thinks that b_pages still points to pages we own. It then double-frees
the b_pages pages.
This results in screaming about negative page refcounts from the memory
manager, which xfs oughtn't be triggering. To reproduce this case,
mount a filesystem where the size of the inodes far outweighs the
availalble memory (a ~500M inode filesystem on a VM with 300MB memory
did the trick here) and run bulkstat in parallel with other memory
eating processes to put a huge load on the system. The "check summary"
phase of xfs_scrub also works for this purpose.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
With COW files they are the hotpath, just like for files with the
extent size hint attribute. We really shouldn't micro-manage anything
but failure cases with unlikely.
Additionally Arnd Bergmann recently reported that one of these two
unlikely annotations causes link failures together with an upcoming
kernel instrumentation patch, so let's get rid of it ASAP.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize
away a lock cycle in cases where the fork does not exist or is otherwise
empty. This check is not safe, however, because an attribute fork short
form to extent format conversion includes a transient state that causes
the xfs_inode_hasattr() check to fail. Specifically,
xfs_attr_shortform_to_leaf() creates an empty extent format attribute
fork and then adds the existing shortform attributes to it.
This means that lookup of an existing xattr can spuriously return
-ENOATTR when racing against a setxattr that causes the associated
format conversion. This was originally reproduced by an untar on a
particularly configured glusterfs volume, but can also be reproduced on
demand with properly crafted xattr requests.
The format conversion occurs under the exclusive ilock. xfs_attr_get()
and xfs_attr_remove() already have the proper locking and checks further
down in the functions to handle this situation correctly. Drop the
unlocked checks to avoid the spurious failure and rely on the existing
logic.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently we try to rely on the global reserved block pool for block
allocations for the free inode btree, but I have customer reports
(fairly complex workload, need to find an easier reproducer) where that
is not enough as the AG where we free an inode that requires a new
finobt block is entirely full. This causes us to cancel a dirty
transaction and thus a file system shutdown.
I think the right way to guard against this is to treat the finot the same
way as the refcount btree and have a per-AG reservations for the possible
worst case size of it, and the patch below implements that.
Note that this could increase mount times with large finobt trees. In
an ideal world we would have added a field for the number of finobt
fields to the AGI, similar to what we did for the refcount blocks.
We should do add it next time we rev the AGI or AGF format by adding
new fields.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Try to reserve the blocks first and only then update the fields in
or hanging off the mount structure. This way we can call __xfs_ag_resv_init
again after a previous failure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Linux 4.9 added two ioctl() operations that can be used to discover:
* the parental relationships for hierarchical namespaces (user and PID)
[NS_GET_PARENT]
* the user namespaces that owns a specified non-user-namespace
[NS_GET_USERNS]
For no good reason that I can glean, NS_GET_USERNS was made synonymous
with NS_GET_PARENT for user namespaces. It might have been better if
NS_GET_USERNS had returned an error if the supplied file descriptor
referred to a user namespace, since it suggests that the caller may be
confused. More particularly, if it had generated an error, then I wouldn't
need the new ioctl() operation proposed here. (On the other hand, what
I propose here may be more generally useful.)
I would like to write code that discovers namespace relationships for
the purpose of understanding the namespace setup on a running system.
In particular, given a file descriptor (or pathname) for a namespace,
N, I'd like to obtain the corresponding user namespace. Namespace N
might be a user namespace (in which case my code would just use N) or
a non-user namespace (in which case my code will use NS_GET_USERNS to
get the user namespace associated with N). The problem is that there
is no way to tell the difference by looking at the file descriptor
(and if I try to use NS_GET_USERNS on an N that is a user namespace, I
get the parent user namespace of N, which is not what I want).
This patch therefore adds a new ioctl(), NS_GET_NSTYPE, which, given
a file descriptor that refers to a user namespace, returns the
namespace type (one of the CLONE_NEW* constants).
Signed-off-by: Michael Kerrisk <mtk-manpages@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Commit 8a59f5d252 ("fs/romfs: return f_fsid for statfs(2)") generates
a 64bit id from sb->s_bdev->bd_dev. This is only correct when romfs is
defined with CONFIG_ROMFS_ON_BLOCK. If romfs is only defined with
CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev
will triger an oops.
Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y,
both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined.
Therefore when calling huge_encode_dev() to generate a 64bit id, I use
the follow order to choose parameter,
- CONFIG_ROMFS_ON_BLOCK defined
use sb->s_bdev->bd_dev
- CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined
use sb->s_dev when,
- both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined
leave id as 0
When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev
is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index,
otherwise sb->s_dev is 0.
This is a try-best effort to generate a uniq file system ID, if all the
above conditions are not meet, f_fsid of this romfs instance will be 0.
Generally only one romfs can be built on single MTD block device, this
method is enough to identify multiple romfs instances in a computer.
Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.de
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Nong Li <nongli1031@gmail.com>
Tested-by: Nong Li <nongli1031@gmail.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have seen proc_pid_readdir() invocations holding cpu for more than 50
ms. Add a cond_resched() to be gentle with other tasks.
[akpm@linux-foundation.org: coding style fix]
Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.
This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.
Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.
It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.
Debug code collecting the stack trace of the wakeup showed this:
$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
save_stack_trace+0x2b/0x50
try_to_wake_up+0x2a6/0x580
wake_up_q+0x32/0x70
rwsem_wake+0xe0/0x120
call_rwsem_wake+0x1b/0x30
up_write+0x3b/0x40
vm_mmap_pgoff+0x9c/0xc0
SyS_mmap_pgoff+0x1a9/0x240
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xbd
0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xb8/0x112
handle_userfault+0x572/0x650
handle_mm_fault+0x12cb/0x1520
__do_page_fault+0x175/0x500
trace_do_page_fault+0x61/0x270
do_async_page_fault+0x19/0x90
async_page_fault+0x25/0x30
This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).
This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().
This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.
We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.
False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).
Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As reported by Arnd:
https://lkml.org/lkml/2017/1/10/756
Compiling with the following configuration:
# CONFIG_EXT2_FS is not set
# CONFIG_EXT4_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_FS_IOMAP depends on the above filesystems, as is not set
CONFIG_FS_DAX=y
generates build warnings about unused functions in fs/dax.c:
fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function]
static int dax_insert_mapping(struct address_space *mapping,
^~~~~~~~~~~~~~~~~~
fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function]
static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
^~~~~~~~~~~~~
fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function]
static int dax_load_hole(struct address_space *mapping, void **entry,
^~~~~~~~~~~~~
fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function]
static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
^~~~~~~~~~~~~~~~~~
Now that the struct buffer_head based DAX fault paths and I/O path have
been removed we really depend on iomap support being present for DAX.
Make this explicit by selecting FS_IOMAP if we compile in DAX support.
This allows us to remove conditional selections of FS_IOMAP when FS_DAX
was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c.
Link: http://lkml.kernel.org/r/1484087383-29478-1-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sb_dirblklog is added to sb_blocklog to compute the directory block size
in bytes. Therefore, we must compare the sum of both those values
against XFS_MAX_BLOCKSIZE_LOG, not just dirblklog.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Some nfsv4.0 servers may return a mode for the verifier following an open
with EXCLUSIVE4 createmode, but this does not mean the client should skip
setting the mode in the following SETATTR. It should only do that for
EXCLUSIVE4_1 or UNGAURDED createmode.
Fixes: 5334c5bdac ("NFS: Send attributes in OPEN request for NFS4_CREATE_EXCLUSIVE4_1")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # v4.3+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We can't dereference the dio structure after submitting the last bio for
this request, as I/O completion might have happened before the code is
run. Introduce a local is_sync variable instead.
Fixes: 542ff7bf ("block: new direct I/O implementation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Matias Bjørling <m@bjorling.me>
Tested-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
We cannot call nfs4_handle_exception() without first ensuring that the
slot has been freed. If not, we end up deadlocking with the process
waiting for recovery to complete, and recovery waiting for the slot
table to drain.
Fixes: 2e80dbe7ac ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of making the files owned by the GLOBAL_ROOT_USER. Make
non-dumpable files whose mm has always lived in a user namespace owned
by the user namespace root. This allows the container root to have
things work as expected in a container.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With previous changes every location that tests for
LSM_UNSAFE_PTRACE_CAP also tests for LSM_UNSAFE_PTRACE making the
LSM_UNSAFE_PTRACE_CAP redundant, so remove it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.
Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.
Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace. Additionally, in order to preserve the
sysctl ABI make the existing inotify instances/watches sysctls
modify the values of the initial user namespace.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Due to the way how xfs_iomap_write_allocate tries to convert the whole
found extents from delalloc to real space we can run into a race
condition with multiple threads doing writes to this same extent.
For the non-COW case that is harmless as the only thing that can happen
is that we call xfs_bmapi_write on an extent that has already been
converted to a real allocation. For COW writes where we move the extent
from the COW to the data fork after I/O completion the race is, however,
not quite as harmless. In the worst case we are now calling
xfs_bmapi_write on a region that contains hole in the COW work, which
will trip up an assert in debug builds or lead to file system corruption
in non-debug builds. This seems to be reproducible with workloads of
small O_DSYNC write, although so far I've not managed to come up with
a with an isolated reproducer.
The fix for the issue is relatively simple: tell xfs_bmapi_write
that we are only asked to convert delayed allocations and skip holes
in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The last BUG_ON in mb_find_extent() is apparently triggering in some
rare cases. Most of the time it indicates a bug in the buddy bitmap
algorithms, but there are some weird cases where it can trigger when
buddy bitmap is still in memory, but the block bitmap has to be read
from disk, and there is disk or memory corruption such that the block
bitmap and the buddy bitmap are out of sync.
Google-Bug-Id: #33702157
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
marked for stable) and two fixups for this merge window's patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYghs8AAoJEEp/3jgCEfOLOz0IAI/xNUMO121S57GEhzkKDdWC
5PCHjg9itU+2eMCCZ2Nyuikj2NVwEFh9HLpMz5jtFa3oWCIhljh9wT8zlKDgpn5R
Q1GCT4LkHGhV+HA2sM04aynKBmC90ZVAHfDt/BTs5mLzW7neSpxFOQEPdS4FG6Zg
NxUGcI/GhqmfpcLnm5IqXxI1cc0bXf6BmEzlGrPAkvzJBhHXWKCVpr1Q/nBW96Q5
ko1EpP16wZoeRvsr1ztXmBTNURUrCi7S6PyK4M5MAro381U3a7zwQuFq9uuREahO
nJtCjWD3bd6U3ENDe/Gacz3czXQyjOjE2/w42jL1dA84UMQbz+wv1SyNCkQgiyI=
=1LTx
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.10-rc5' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Three filesystem endianness fixes (one goes back to the 2.6 era, all
marked for stable) and two fixups for this merge window's patches"
* tag 'ceph-for-4.10-rc5' of git://github.com/ceph/ceph-client:
ceph: fix bad endianness handling in parse_reply_info_extra
ceph: fix endianness bug in frag_tree_split_cmp
ceph: fix endianness of getattr mask in ceph_d_revalidate
libceph: make sure ceph_aes_crypt() IV is aligned
ceph: fix ceph_get_caps() interruption
Pull overlayfs fix from Miklos Szeredi:
"This fixes a regression introduced in this cycle"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix possible use after free on redirect dir lookup
Pull fuse fixes from Miklos Szeredi:
"Fix two regressions, one introduced in 4.9 and a less recent one in
4.2"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix time_to_jiffies nsec sanity check
fuse: clear FR_PENDING flag when moving requests out of pending queue
udf_fill_super() used udf_parse_options() to flag UDF_FLAG_BLOCKSIZE_SET
when blocksize was specified otherwise used 512 bytes
(bdev_logical_block_size) and 2048 bytes (UDF_DEFAULT_BLOCKSIZE)
IOW both 1024 and 4096 specifications were required or resulted in
"mount: wrong fs type, bad option, bad superblock on /dev/loop1"
This patch loops through different block values but also updates
udf_load_vrs() to return -EINVAL instead of 0 when udf_check_vsd()
fails (and uopt->novrs = 0).
The later being the reason for the RFC; we have that case when mounting
a 4kb blocksize against other values but maybe VRS is not mandatory
there ?
Tested with 512, 1024, 2048 and 4096 blocksize
Reported-by: Jan Kara <jack@suse.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
- Inode i_mode sanitization
- Prevent overflows in getnextquota
- Minor build fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJYf9VQAAoJEPh/dxk0SrTr6CMQAJRUULHRS2SJe22pJSRzkjsl
H838pav3bmTDGafWx5SgaazDHsVd205DHLU9ovi9zyqQGviG1kgIl9a7TvXtEU6R
DXbjnxRTb9tgDbFKYstvl0lDXiWUFoB1nMOkNa8BDhBUe1sGadMXY5gh+CnpoKeg
qEwNiE3yUJpnPVog4d+SRQSMTUbD3VQBtkh4IcucR+zqWsOiT7jiepqJvee5ZXYj
Wz2Eo0QFhSGq7lk9IRl7C3ter77QMz6f4Ba3oZMjM2WD317JmKk5vDU1B7mHgRIZ
cnwOayICECKs9GQWmgD5ew+GEaCRHuSzeoiP72O2bjwiLgJbG/eHQJ/8T7t3sgV0
mAIYZJXf8jjmTjzjV3n7aEt4iTKYssaLvmMQfg6AYwDXapzm7xYkcSDQa1pDFnEd
DKDBsd17Xx399dtD/pjSw4X+/Z2ILEhcLvVKOO68jvM/wXIVQENKRt6/5QEUdBkw
HjuLccA5e5gpjZ00S7cZo44TRgbJPG+hw611Fy3reJPW1sHQJEbpspOMIRB3vpff
HxoTQm864ZNxR7pXv5sX9gxHp4JLAZ2wfxa+iLe/0rAelq01OHlz7fygc27jXilU
v6URno92PbARzvtyeVAZ+FJLRBpQy90Y4ZRM87GDebCGCDo8xqtxxhiG1xIqriON
PYLuQNeF2MELGUm3nlJO
=4Mxu
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linux-4.10-rc5-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"I have a few more patches this week -- one to make the behavior of a
quota id ioctl consistent with the other filesystems, and the rest
improve validation of i_mode & i_size values coming into xfs so that
we don't read off the ends of arrays or crash when handed garbage disk
data.
Summary:
- inode i_mode sanitization
- prevent overflows in getnextquota
- minor build fixes"
* tag 'xfs-for-linux-4.10-rc5-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix xfs_mode_to_ftype() prototype
xfs: don't wrap ID in xfs_dq_get_next_id
xfs: sanity check inode di_mode
xfs: sanity check inode mode when creating new dentry
xfs: replace xfs_mode_to_ftype table with switch statement
xfs: add missing include dependencies to xfs_dir2.h
xfs: sanity check directory inode di_size
xfs: make the ASSERT() condition likely
For such a file mapping,
[0-4k][hole][8k-12k]
In NO_HOLES mode, we don't have the [hole] extent any more.
Commit c1aa45759e ("Btrfs: fix shrinking truncate when the no_holes feature is enabled")
fixed disk isize not being updated in NO_HOLES mode when data is not flushed.
However, even if data has been flushed, we can still have trouble
in updating disk isize since we updated disk isize to 'start' of
the last evicted extent.
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following deadlock is seen when executing generic/113 test,
---------------------------------------------------------+----------------------------------------------------
Direct I/O task Fast fsync task
---------------------------------------------------------+----------------------------------------------------
btrfs_direct_IO
__blockdev_direct_IO
do_blockdev_direct_IO
do_direct_IO
btrfs_get_blocks_direct
while (blocks needs to written)
get_more_blocks (first iteration)
btrfs_get_blocks_direct
btrfs_create_dio_extent
down_read(&BTRFS_I(inode) >dio_sem)
Create and add extent map and ordered extent
up_read(&BTRFS_I(inode) >dio_sem)
btrfs_sync_file
btrfs_log_dentry_safe
btrfs_log_inode_parent
btrfs_log_inode
btrfs_log_changed_extents
down_write(&BTRFS_I(inode) >dio_sem)
Collect new extent maps and ordered extents
wait for ordered extent completion
get_more_blocks (second iteration)
btrfs_get_blocks_direct
btrfs_create_dio_extent
down_read(&BTRFS_I(inode) >dio_sem)
--------------------------------------------------------------------------------------------------------------
In the above description, Btrfs direct I/O code path has not yet started
submitting bios for file range covered by the initial ordered
extent. Meanwhile, The fast fsync task obtains the write semaphore and
waits for I/O on the ordered extent to get completed. However, the
Direct I/O task is now blocked on obtaining the read semaphore.
To resolve the deadlock, this commit modifies the Direct I/O code path
to obtain the read semaphore before invoking
__blockdev_direct_IO(). The semaphore is then given up after
__blockdev_direct_IO() returns. This allows the Direct I/O code to
complete I/O on all the ordered extents it creates.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Below test script can reveal this bug:
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint
echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i <= $count; i++)); do
dst_offset=$((blocksize * i))
xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
testfile > /dev/null
done
sync
truncate --size 0 testfile
The last truncate operation will fail for ENOSPC reason, but indeed
it should not fail.
In btrfs_truncate(), we use a temporary block_rsv to do truncate
operation. With every btrfs_truncate_inode_items() call, we migrate space
to this block_rsv, but forget to cleanup previous reservation, which
will make this block_rsv's reserved bytes keep growing, and this reserved
space will only be released in the end of btrfs_truncate(), this metadata
leak will impact other's metadata reservation. In this case, it's
"btrfs_start_transaction(root, 2);" fails for enospc error, which make
this truncate operation fail.
Call btrfs_block_rsv_release() to fix this bug.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a number of usermode helper binaries that are "hard coded" in
the kernel today, so mark them as "const" to make it harder for someone
to change where the variables point to.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Sailer <t.sailer@alumni.ethz.ch>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Alex Elder <elder@kernel.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A harmless warning just got introduced:
fs/xfs/libxfs/xfs_dir2.h:40:8: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]
Removing the 'const' modifier avoids the warning and has no
other effect.
Fixes: 1fc4d33fed ("xfs: replace xfs_mode_to_ftype table with switch statement")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
sparse says:
fs/ceph/mds_client.c:291:23: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:293:28: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:294:28: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:296:28: warning: restricted __le32 degrades to integer
The op value is __le32, so we need to convert it before comparing it.
Cc: stable@vger.kernel.org # needs backporting for < 3.14
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
sparse says:
fs/ceph/inode.c:308:36: warning: incorrect type in argument 1 (different base types)
fs/ceph/inode.c:308:36: expected unsigned int [unsigned] [usertype] a
fs/ceph/inode.c:308:36: got restricted __le32 [usertype] frag
fs/ceph/inode.c:308:46: warning: incorrect type in argument 2 (different base types)
fs/ceph/inode.c:308:46: expected unsigned int [unsigned] [usertype] b
fs/ceph/inode.c:308:46: got restricted __le32 [usertype] frag
We need to convert these values to host-endian before calling the
comparator.
Fixes: a407846ef7 ("ceph: don't assume frag tree splits in mds reply are sorted")
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Commit 5c341ee328 ("ceph: fix scheduler warning due to nested
blocking") causes infinite loop when process is interrupted. Fix it.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ovl_lookup_layer() iterates on path elements of d->name.name
but also frees and allocates a new pointer for d->name.name.
For the case of lookup in upper layer, the initial d->name.name
pointer is stable (dentry->d_name), but for lower layers, the
initial d->name.name can be d->redirect, which can be freed during
iteration.
[SzM]
Keep the count of remaining characters in the redirect path and calculate
the current position from that. This works becuase only the prefix is
modified, the ending always stays the same.
Fixes: 02b69b284c ("ovl: lookup redirects")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The GETNEXTQOTA ioctl takes whatever ID is sent in,
and looks for the next active quota for an user
equal or higher to that ID.
But if we are at the maximum ID and then ask for the "next"
one, we may wrap back to zero. In this case, userspace
may loop forever, because it will start querying again
at zero.
We'll fix this in userspace as well, but for the kernel,
return -ENOENT if we ask for the next quota ID
past UINT_MAX so the caller knows to stop.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Check for invalid file type in xfs_dinode_verify()
and fail to load the inode structure from disk.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The helper xfs_dentry_to_name() is used by 2 different
classes of callers: Callers that pass zero mode and don't care
about the returned name.type field and Callers that pass
non zero mode and do care about the name.type field.
Change xfs_dentry_to_name() to not take the mode argument and
change the call sites of the first class to not pass the mode
argument.
Create a new helper xfs_dentry_mode_to_name() which does pass
the mode argument and returns -EFSCORRUPTED if mode is invalid.
Callers that translate non zero mode to on-disk file type now
check the return value and will export the error to user instead
of staging an invalid file type to be written to directory entry.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The size of the xfs_mode_to_ftype[] conversion table
was too small to handle an invalid value of mode=S_IFMT.
Instead of fixing the table size, replace the conversion table
with a conversion helper that uses a switch statement.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_dir2.h dereferences some data types in inline functions
and fails to include those type definitions, e.g.:
xfs_dir2_data_aoff_t, struct xfs_da_geometry.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This changes fixes an assertion hit when fuzzing on-disk
i_mode values.
The easy case to fix is when changing an empty file
i_mode to S_IFDIR. In this case, xfs_dinode_verify()
detects an illegal zero size for directory and fails
to load the inode structure from disk.
For the case of non empty file whose i_mode is changed
to S_IFDIR, the ASSERT() statement in xfs_dir2_isblock()
is replaced with return -EFSCORRUPTED, to avoid interacting
with corrupted jusk also when XFS_DEBUG is disabled.
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The ASSERT() condition is the normal case, not the exception,
so testing the condition should be likely(), not unlikely().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When replaying the journal it can happen that a journal entry points to
a garbage collected node.
This is the case when a power-cut occurred between a garbage collect run
and a commit. In such a case nodes have to be read using the failable
read functions to detect whether the found node matches what we expect.
One corner case was forgotten, when the journal contains an entry to
remove an inode all xattrs have to be removed too. UBIFS models xattr
like directory entries, so the TNC code iterates over
all xattrs of the inode and removes them too. This code re-uses the
functions for walking directories and calls ubifs_tnc_next_ent().
ubifs_tnc_next_ent() expects to be used only after the journal and
aborts when a node does not match the expected result. This behavior can
render an UBIFS volume unmountable after a power-cut when xattrs are
used.
Fix this issue by using failable read functions in ubifs_tnc_next_ent()
too when replaying the journal.
Cc: stable@vger.kernel.org
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: Rock Lee <rockdotlee@gmail.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
In several places, ubifs checked for an encryption key before creating a
file in an encrypted directory. This was redundant with
fscrypt_setup_filename() or ubifs_new_inode(), and in the case of
ubifs_link() it broke linking to special files. So remove the extra
checks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The ubifs encryption ioctls did not work when called by a 32-bit program
on a 64-bit kernel. Since 'struct fscrypt_policy' is not affected by
the word size, ubifs just needs to allow these ioctls through, like what
ext4 and f2fs do.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
This came up during the v4.10 merge window:
warning: (UBIFS_FS_ENCRYPTION) selects FS_ENCRYPTION which has unmet direct dependencies (BLOCK)
fs/crypto/crypto.c: In function 'fscrypt_zeroout_range':
fs/crypto/crypto.c:355:9: error: implicit declaration of function 'bio_alloc';did you mean 'd_alloc'? [-Werror=implicit-function-declaration]
bio = bio_alloc(GFP_NOWAIT, 1);
The easiest way out is to limit UBIFS_FS_ENCRYPTION to configurations
that also enable BLOCK.
Fixes: d475a50745 ("ubifs: Add skeleton for fscrypto")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
err is no longer being set on a successful return path, causing
a garbage value being returned. Fix this by setting err to zero
for the successful return path.
Found with static analysis by CoverityScan, CID 1389473
Fixes: 7799953b34 ("ubifs: Implement encrypt/decrypt for all IO")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Bugfixes:
- Fix invalid fget()/fput() calls when doing file locking
- Fix multiple directory cache invalidation issues due to the client failing
to recognise that the directory wasn't changed.
- Fix client recovery when server reboots multiple times
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYfSXZAAoJEGcL54qWCgDy3K8P/0YGi1J5VoWkcMD5+Ljh2H3O
AV1oZ3EWLXH2jlJZmX7A1A9ReOEmsYmCX+8QM2ZWbBZI+vIzEJEWd7fU+nhtgfiq
pcw29CSpeBhfcqSswlRE6ouxKpmufnzhwbuGcy9WUTyGYaz1k05/4YHTunyxiZr6
QRHL2Y0YkU0cWD1FHwoOkQr8ft7kdQ4iXHru9utttdYsb4vJnX9ShLpaLW5MDBqS
hPd1ivJHPlWiOVEzLSlTHBvTvw8j5PjhsRQ/q5UwiDvQQDoWL7/qnP4nxZee44mh
MQ+61/0WxS+ohMkIYrZ6hPezk2QX/i7JRQvBPNU07U/+TqtsWX/F5+OerCeQ4/6W
RhP7XDjJV8TpeeJcUU58r1aoEpaWRt9cmtwBWMKBu9tTHCFyRXHsaIh1ZR+Hy85e
HUiiaD4rPldr+ZbCcTwR5bSSkF96WaRF9B3L8Nvgys9v4wwO6LuTnHv5uY8H7ct5
I66nL14drrA+jC01D+SuMl0AnFhrCp7mJyJbrZDvXBJ9hxzzgEj/Jz3IC0mc0SxI
ZjFhQoG2NjV2oJ0PTu9RMUw+Fex0yz6PsXoHLKr45VXkkwQL3Uldq6SVWfkkzqUk
SPFTYD49i9TqCHhGZEzm7kGNM6ASUDKQtAloiUA2QvL6v178RLwG2sjYxy2kEApz
s208kzh8He79iMLEkphR
=K8F7
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- fix invalid fget()/fput() calls when doing file locking
- fix multiple directory cache invalidation issues due to the client
failing to recognise that the directory wasn't changed
- fix client recovery when server reboots multiple times
* tag 'nfs-for-4.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix client recovery when server reboots multiple times
NFSv4: update_changeattr should update the attribute timestamp
NFSv4: Don't call update_changeattr() unless the unlink is successful
NFSv4: Don't apply change_info4 twice on rename within a directory
NFSv4: Call update_changeattr() from _nfs4_proc_open only if a file was created
nfs: Don't take a reference on fl->fl_file for LOCK operation
The bulk readpages support introduced a harmless warning:
fs/afs/file.c: In function 'afs_readpages_page_done':
fs/afs/file.c:270:20: error: unused variable 'vnode' [-Werror=unused-variable]
This adds an #ifdef to match the user of that variable. The user of the
variable has to be conditional because it accesses a member of a struct
that is also conditional.
Fixes: 91b467e0a3 ("afs: Make afs_readpages() fetch data in bulk")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bugs.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYfOk6AAoJECebzXlCjuG+Lj4QALaLKRRbIdrz6nmg7gUmpTWc
CdW8NMbzwSCXmYoivsTHBlhXZKsi5vVjnFXMCM/P85ddmipXdcTFCDLmmNoKUQ0M
jODlLX90ctaZKCDBVSaH4htAz2gkFv7z5IllX0YDQqHyiuzh/9KoV+AFCgPZPTpL
O1XRmfWz+yJDydz4hb3i5f2JvMk9P/tCXLnheuxxTIMSl2/fIfgF81eWwDpFqcA2
27+PyWWjZehVnZ77ca/mWJj2n0+gBINiKafcfF39NK/Hv2q4aauB3k7c4blecc9Q
m/IT3mKifvHvdNCmvHD5s74h4OikEGYpqaSjonMptZnWgfM4/gtF7yTiQjsOMDx/
w6W/tfHlGrvegpzhjaIaoZZ50EZp7xwGNNZYgH4J44kytYpolrhsOR6NqCLTqpej
xG2Kd89ZtnAgc/7T7ET/1PqpZ8f9M9pyV3E8s36OvF4AYQUNrfzbWSTQcZy3WGBP
YuoUCzacIbNbGgu4m6Zx5l/vKW5yn45xbUMp7T9S4WoxYMx6a5vViU0NiF7KsQDu
pcDT92DZ57KJFtCw7Ig08ILKsSXmNApH5/4mIrkX3quZuH4j2XapEJ9u//fmfZBd
Q+Sgv8RXcGELUJIg9yfmoWgPDA/oYslc7ynBV0lXLNgBuod//dGSlZ+6KfFFJYr8
XVOxwPTiiBIlc9lvB9eA
=tb4L
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Miscellaneous nfsd bugfixes, one for a 4.10 regression, three for
older bugs"
* tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: avoid duplicate dma unmapping during error recovery
sunrpc: don't call sleeping functions from the notifier block callbacks
svcrpc: don't leak contexts on PROC_DESTROY
nfsd: fix supported attributes for acl & labels
Pull namespace fixes from Eric Biederman:
"This tree contains 4 fixes.
The first is a fix for a race that can causes oopses under the right
circumstances, and that someone just recently encountered.
Past that are several small trivial correct fixes. A real issue that
was blocking development of an out of tree driver, but does not appear
to have caused any actual problems for in-tree code. A potential
deadlock that was reported by lockdep. And a deadlock people have
experienced and took the time to track down caused by a cleanup that
removed the code to drop a reference count"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sysctl: Drop reference added by grab_header in proc_sys_readdir
pid: fix lockdep deadlock warning due to ucount_lock
libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
mnt: Protect the mountpoint hashtable with mount_lock
Pull vfs fixes from Al Viro.
The most notable fix here is probably the fix for a splice regression
("fix a fencepost error in pipe_advance()") noticed by Alan Wylie.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a fencepost error in pipe_advance()
coredump: Ensure proper size of sparse core files
aio: fix lock dep warning
tmpfs: clear S_ISGID when setting posix ACLs
Pull block fixes from Jens Axboe:
- the virtio_blk stack DMA corruption fix from Christoph, fixing and
issue with VMAP stacks.
- O_DIRECT blkbits calculation fix from Chandan.
- discard regression fix from Christoph.
- queue init error handling fixes for nbd and virtio_blk, from Omar and
Jeff.
- two small nvme fixes, from Christoph and Guilherme.
- rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
to more closely follow what we do in other places in the block layer.
This interface is new for this series, so let's get the naming right
before releasing a kernel with this feature. From Damien.
* 'for-linus' of git://git.kernel.dk/linux-block:
block: don't try to discard from __blkdev_issue_zeroout
sd: remove __data_len hack for WRITE SAME
nvme: use blk_rq_payload_bytes
scsi: use blk_rq_payload_bytes
block: add blk_rq_payload_bytes
block: Rename blk_queue_zone_size and bdev_zone_size
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
nvme-rdma: fix nvme_rdma_queue_is_ready
virtio_blk: fix panic in initialization error path
nbd: blk_mq_init_queue returns an error code on failure, not NULL
virtio_blk: avoid DMA to stack for the sense buffer
do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.
After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.
This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@vger.kernel.org>
file_info_lock is not initalized in initiate_cifs_search(), leading to the
following splat after a simple "mount.cifs ... dir && ls dir/":
BUG: spinlock bad magic on CPU#0, ls/486
lock: 0xffff880009301110, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 PID: 486 Comm: ls Not tainted 4.9.0 #27
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
ffffc900042f3db0 ffffffff81327533 0000000000000000 ffff880009301110
ffffc900042f3dd0 ffffffff810baf75 ffff880009301110 ffffffff817ae077
ffffc900042f3df0 ffffffff810baff6 ffff880009301110 ffff880008d69900
Call Trace:
[<ffffffff81327533>] dump_stack+0x65/0x92
[<ffffffff810baf75>] spin_dump+0x85/0xe0
[<ffffffff810baff6>] spin_bug+0x26/0x30
[<ffffffff810bb159>] do_raw_spin_lock+0xe9/0x130
[<ffffffff8159ad2f>] _raw_spin_lock+0x1f/0x30
[<ffffffff8127e50d>] cifs_closedir+0x4d/0x100
[<ffffffff81181cfd>] __fput+0x5d/0x160
[<ffffffff81181e3e>] ____fput+0xe/0x10
[<ffffffff8109410e>] task_work_run+0x7e/0xa0
[<ffffffff81002512>] exit_to_usermode_loop+0x92/0xa0
[<ffffffff810026f9>] syscall_return_slowpath+0x49/0x50
[<ffffffff8159b484>] entry_SYSCALL_64_fastpath+0xa7/0xa9
Fixes: 3afca265b5 ("Clarify locking of cifs file and tcon structures and make more granular")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Since we need to change the implementation, stop exposing internals.
Provide kref_read() to read the current reference count; typically
used for debug messages.
Kills two anti-patterns:
atomic_read(&kref->refcount)
kref->refcount.counter
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we need to change the implementation, stop exposing internals.
Provide KREF_INIT() to allow static initialization of struct kref.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an ext4 fs is bogged down by a lot of metadata IOs (in the
reported case, it was deletion of millions of files, but any massive
amount of journal writes would do), after the journal is filled up,
tasks which try to access the filesystem and aren't currently
performing the journal writes end up waiting in
__jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.
Because those mutex sleeps aren't marked as iowait, this condition can
lead to misleadingly low iowait and /proc/stat:procs_blocked. While
iowait propagation is far from strict, this condition can be triggered
fairly easily and annotating these sleeps correctly helps initial
diagnosis quite a bit.
Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
these sleeps are properly marked as iowait.
Reported-by: Mingbo Wan <mingbo@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull btrfs fixes from Chris Mason:
"These are all over the place.
The tracepoint part of the pull fixes a crash and adds a little more
information to two tracepoints, while the rest are good old fashioned
fixes"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: make tracepoint format strings more compact
Btrfs: add truncated_len for ordered extent tracepoints
Btrfs: add 'inode' for extent map tracepoint
btrfs: fix crash when tracepoint arguments are freed by wq callbacks
Btrfs: adjust outstanding_extents counter properly when dio write is split
Btrfs: fix lockdep warning about log_mutex
Btrfs: use down_read_nested to make lockdep silent
btrfs: fix locking when we put back a delayed ref that's too new
btrfs: fix error handling when run_delayed_extent_op fails
btrfs: return the actual error value from from btrfs_uuid_tree_iterate
window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYeQymAAoJEEp/3jgCEfOLLVsH/28qRsjVPWr5JuL1SF86//kd
rAi7QUfbNgXHqbb10a9za9pNuLhHr3kImIfvQ04wYiYQY+IaAapiRXwQev8BsNAa
yENUc8XwNgydw4FU1ia5PkGOJLDtujtfgjWT2v+gf1HUzLaV6alBzqDwUZBt3xJz
mlYC82oFkXPa0BFmLUXtT/jJu/ZI8caO4KB34/UKi7LjBQk1ca7E2xVUoDtdQmEm
ciPE98akU4JiB99aOgGdwemBzkAMHEGQpImTzqHr/tbIUj0MqVAjH9FVOhRCbjMy
6MSR+U9yUzJkBzefS5enijAoExVc8cD/A0nIaKGVb6qWrIrk51/Opl6iILeVLUo=
=28cq
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two small fixups for the filesystem changes that went into this merge
window"
* tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client:
ceph: fix get_oldest_context()
ceph: fix mds cluster availability check
If the server reboots multiple times, the client should rely on the
server to tell it that it cannot reclaim state as per section 9.6.3.4
in RFC7530 and section 8.4.2.1 in RFC5661.
Currently, the client is being to conservative, and is assuming that
if the server reboots while state recovery is in progress, then it must
ignore state that was not recovered before the reboot.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit bcb6f6d2b9 ("fuse: use timespec64") introduced clamped nsec values
in time_to_jiffies but used the max of nsec and NSEC_PER_SEC - 1 instead of
the min. Because of this, dentries would stay in the cache longer than
requested and go stale in scenarios that relied on their timely eviction.
Fixes: bcb6f6d2b9 ("fuse: use timespec64")
Signed-off-by: David Sheets <dsheets@docker.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.9
fuse_abort_conn() moves requests from pending list to a temporary list
before canceling them. This operation races with request_wait_answer()
which also tries to remove the request after it gets a fatal signal. It
checks FR_PENDING flag to determine whether the request is still in the
pending list.
Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer()
does not remove the request from temporary list.
This bug causes an Oops when trying to delete an already deleted list entry
in end_requests().
Fixes: ee314a870e ("fuse: abort: no fc->lock needed for request ending")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.2+
Oops--in 916d2d844a I moved some constants into an array for
convenience, but here I'm accidentally writing to that array.
The effect is that if you ever encounter a filesystem lacking support
for ACLs or security labels, then all queries of supported attributes
will report that attribute as unsupported from then on.
Fixes: 916d2d844a "nfsd: clean up supported attribute handling"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If a file is renamed, but stays in the same directory, we will still receive
2 change_info4 structures describing the change to that directory, but we
only want to apply it once.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We don't want to invalidate the directory attribute and data cache unless we
know that a file was created, or the change attribute differs from the one
in our cache.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
- Fix free space request handling when low on disk space
- Remove redundant log failure error messages
- Free truncate dirty pages instead of letting them build up forever
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJYdnh5AAoJEPh/dxk0SrTrI5sP/2FYAxrj3Iw684e92PWeIe1c
pZir+0LEmlXgly4wprli7/BO6WZjESuMF3huBZZxtYLx3UUvGRavHup620z3Xa2e
YRdSiBnvchOe3F0UfFO9wT15vEOzwWS61FfX/g35t4mD6ItY0XISesTx4+XA3Cqi
8zf7OAjXI5WQietthwc9zmfhpyWgyw8CkeXVtqd5whNJN/6E80wh5IE6D6cJ1xkX
2qNicrrruTUACPyJxrTrR/0kkoxDoYgBapy3kqV0R1uK+ttNrGlGjfdapKs0tqMb
ezksj0sy/rx+HjfGEHx52mTiYRRTHEsGt/yMa1pT8o0p2wvR0nxnvzkvV8DeMoX3
0wYRxATsv/t8Oog+ug6khB/FtppSGJML+XxG3bV9itgCkAIFAbb7vZRrThnzVqOr
ChbQvBhchY4lYA1pef862QHLRraJF84HuY/ypyE6DX+nQWkGjDfEeHFPcm6eVSXG
xEplK8wjs09pSklH/OQD+GsO4eedBHc4hdzDuBTjCmt867/Lk8Uw4RIjGMYjbBx2
ovU3pTHfdlO6/qJmfSWnBAAZjKAjna/p47ZfDLzio23hb1hK68Qe/z8+tPnAzkHW
rowA6KfBi13tX6Njds/lceiYMrnMRPqNBA+Og1wCyljP2Q4shbADa6czrFDzPn6P
xZbfhKK2CTbYSvxi5S3G
=8bsr
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"As promised last week, here's some stability fixes from Christoph and
Jan Kara:
- fix free space request handling when low on disk space
- remove redundant log failure error messages
- free truncated dirty pages instead of letting them build up
forever"
* tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Timely free truncated dirty pages
xfs: don't print warnings when xfs_log_force fails
xfs: don't rely on ->total in xfs_alloc_space_available
xfs: adjust allocation length in xfs_alloc_space_available
xfs: fix bogus minleft manipulations
xfs: bump up reserved blocks in xfs_alloc_set_aside
For no snapshot case, we should use ci->truncate_{seq,size}.
Fixes: 5f743e4566 ("ceph: record truncate size/seq for snap data writeback")
Signed-off-by: Geng, Jichao <geng.jichao@h3c.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
We should apply the check after getting the initial mdsmap.
Fixes: e9e427f0a1 ("ceph: check availability of mds cluster on mount")
Link: http://tracker.ceph.com/issues/18161
Signed-off-by: Yan, Zheng <zyan@redhat.com>
I have reports of a crash that look like __fput() was called twice for
a NFSv4.0 file. It seems possible that the state manager could try to
reclaim a lock and take a reference on the fl->fl_file at the same time the
file is being released if, during the close(), a signal interrupts the wait
for outstanding IO while removing locks which then skips the removal
of that lock.
Since 83bfff23e9 ("nfs4: have do_vfs_lock take an inode pointer") has
removed the need to traverse fl->fl_file->f_inode in nfs4_lock_done(),
taking that reference is no longer necessary.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Collapsed the two patches, they were nonsensically split and broke
bisection.
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no need to call ext4_mark_inode_dirty while holding xattr_sem
or i_data_sem, so where it's easy to avoid it, move it out from the
critical region.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In order to test the inode extra isize expansion code, it is useful to
be able to easily create file systems that have inodes with extra
isize values smaller than the current desired value.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 99579ccec4 "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:
while true; do
dd if=/dev/zero of=file bs=1M count=100
rm file
done
will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.
So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.
CC: stable@vger.kernel.org
CC: Brian Foster <bfoster@redhat.com>
CC: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Petr Tůma <petr.tuma@d3s.mff.cuni.cz>
Fixes: 99579ccec4
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.
The crash backtrace is below:
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: Beginning quota recovery on device (253,18) for slot 2
ocfs2: Finishing quota recovery on device (253,18) for slot 2
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
------------[ cut here ]------------
kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
invalid opcode: 0000 [#1] SMP
Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: No, Unsupported modules are loaded
CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282
RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
Call Trace:
ocfs2_setattr+0x698/0xa90 [ocfs2]
notify_change+0x1ae/0x380
do_truncate+0x5e/0x90
do_sys_ftruncate.constprop.11+0x108/0x160
entry_SYSCALL_64_fastpath+0x12/0x6d
Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?
The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.
The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.
The following diagram briefly illustrates how this crash happens:
RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
The 1st round:
Node1 Node2
RSB1: PR
RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
ocfs2_dlm_lock(no DLM_LKF_VALBLK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)
RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()
/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/
if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/
The 2nd round:
Node 1 Node2
RSB1(become master from recovery)
ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */
The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.
Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss in the following sequence:
1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
pmd_t dirty and writeable
2) fsync, flushing out PMD data and cleaning the radix tree entry. We
currently fail to mark the pmd_t as clean and write protected.
3) more mmap writes to the PMD. These don't cause any page faults since
the pmd_t is dirty and writeable. The radix tree entry remains clean.
4) fsync, which fails to flush the dirty PMD data because the radix tree
entry was clean.
5) crash - dirty data that should have been fsync'd as part of 4) could
still have been in the processor cache, and is lost.
Fix this by marking the pmd_t clean and write protected in
dax_mapping_entry_mkclean(), which is called as part of the fsync
operation 2). This will cause the writes in step 3) above to generate
page faults where we'll re-dirty the PMD radix tree entry, resulting in
flushes in the fsync that happens in step 4).
Fixes: 4b4bb46d00 ("dax: clear dirty entry tags on cache flush")
Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code currently uses sdio->blkbits to compute the number of blocks to
be cleaned. However sdio->blkbits is derived from the logical block size
of the underlying block device (Refer to the definition of
do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
fail when executed on an ext4 filesystem with 64k as the block size and
when using a virtio based disk (having 512 byte as the logical block
size) inside a kvm guest.
This commit fixes the bug by using inode->i_blkbits to compute the
number of blocks to be cleaned.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once,
to avoid issues with block size changes with IO in flight.
Signed-off-by: Jens Axboe <axboe@fb.com>
We were checking block number without checking partition.
sbi->s_partmaps[iloc->partitionReferenceNum] could lead to
bad memory access. See udf_nfs_get_inode() path for instance.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
Move all module attributes at the end of one file like other FS.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
udf_update_extent_cache() is only called from inode_bmap()
with 1 for next_epos
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
loc & 0x02 is empty since first git version in 2005 in
udf_add_extendedattr()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
Having struct kernel_long_ad laarr[EXTENT_MERGE_SIZE]
in all function arguments could be understood as by-value parameter.
Use kernel_long_ad pointer for functions depending on
inode_getblk()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b ("posix_acl: Clear SGID bit when setting
file permissions")
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:
touch $testfile
chown 100:100 $testfile
chmod 2755 $testfile
_runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile
Signed-off-by: Gu Zheng <guzheng1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In all but one case, the last two arguments are NULL and 0 resp.;
almost everyone just wants to switch nameidata to non-RCU mode.
The only exception is lookup_fast(), where we have a child dentry
we want to legitimize as well. Split these two cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add MS_KERNMOUNT to the flags that are passed.
Use sget_userns and force &init_user_ns instead of calling sget so that
even if called from a weird context the internal filesystem will be
considered to be in the intial user namespace.
Luis Ressel reported that the the failure to pass MS_KERNMOUNT into
mount_pseudo broke his in development graphics driver that uses the
generic drm infrastructure. I am not certain the deriver was bug
free in it's usage of that infrastructure but since
mount_pseudo_xattr can never be triggered by userspace it is clearer
and less error prone, and less problematic for the code to be explicit.
Reported-by: Luis Ressel <aranea@aixah.de>
Tested-by: Luis Ressel <aranea@aixah.de>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Protecting the mountpoint hashtable with namespace_sem was sufficient
until a call to umount_mnt was added to mntput_no_expire. At which
point it became possible for multiple calls of put_mountpoint on
the same hash chain to happen on the same time.
Kristen Johansen <kjlx@templeofstupid.com> reported:
> This can cause a panic when simultaneous callers of put_mountpoint
> attempt to free the same mountpoint. This occurs because some callers
> hold the mount_hash_lock, while others hold the namespace lock. Some
> even hold both.
>
> In this submitter's case, the panic manifested itself as a GP fault in
> put_mountpoint() when it called hlist_del() and attempted to dereference
> a m_hash.pprev that had been poisioned by another thread.
Al Viro observed that the simple fix is to switch from using the namespace_sem
to the mount_lock to protect the mountpoint hash table.
I have taken Al's suggested patch moved put_mountpoint in pivot_root
(instead of taking mount_lock an additional time), and have replaced
new_mountpoint with get_mountpoint a function that does the hash table
lookup and addition under the mount_lock. The introduction of get_mounptoint
ensures that only the mount_lock is needed to manipulate the mountpoint
hashtable.
d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
already set. This allows get_mountpoint to use the setting of
DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
happens exactly once.
Cc: stable@vger.kernel.org
Fixes: ce07d891a0 ("mnt: Honor MNT_LOCKED when detaching mounts")
Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
one is an I/O error, for which xlog_bdstrat already logs a warning, and
the second is an already shutdown log due to a previous I/O errors. In
the latter case we'll already have a previous indication for the actual
error, but the large stream of misleading warnings from xfs_log_force
will probably scroll it out of the message buffer.
Simply removing the warnings thus makes the XFS log reporting significantly
better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
->total is a bit of an odd parameter passed down to the low-level
allocator all the way from the high-level callers. It's supposed to
contain the maximum number of blocks to be allocated for the whole
transaction [1].
But in xfs_iomap_write_allocate we only convert existing delayed
allocations and thus only have a minimal block reservation for the
current transaction, so xfs_alloc_space_available can't use it for
the allocation decisions. Use the maximum of args->total and the
calculated block requirement to make a decision. We probably should
get rid of args->total eventually and instead apply ->minleft more
broadly, but that will require some extensive changes all over.
[1] which creates lots of confusion as most callers don't decrement it
once doing a first allocation. But that's for a separate series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We must decide in xfs_alloc_fix_freelist if we can perform an
allocation from a given AG is possible or not based on the available
space, and should not fail the allocation past that point on a
healthy file system.
But currently we have two additional places that second-guess
xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
maxlen parameter to remove the reservation before doing the
allocation (but ignores the various minium freespace requirements),
and xfs_alloc_fix_minleft tries to fix up the allocated length
after we've found an extent, but ignores the reservations and also
doesn't take the AGFL into account (and thus fails allocations
for not matching minlen in some cases).
Remove all these later fixups and just correct the maxlen argument
inside xfs_alloc_fix_freelist once we have the AGF buffer locked.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We can't just set minleft to 0 when we're low on space - that's exactly
what we need minleft for: to protect space in the AG for btree block
allocations when we are low on free space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Setting aside 4 blocks globally for bmbt splits isn't all that useful,
as different threads can allocate space in parallel. Bump it to 4
blocks per AG to allow each thread that is currently doing an
allocation to dip into it separately. Without that we may no have
enough reserved blocks if there are enough parallel transactions
in an almost out space file system that all run into bmap btree
splits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWHNwyPSw1s6N8H32AQKoqw//Wi8fpY/7SlQ8UT0RcF4KlBtfKux4dhMh
c4P2ARqEi3hVHz0MAJSYwhJDiXmPT8FboXq7yQmXj7DpkwDUgEHJlOZyoZFrStWC
hE72lbwD/m57jYgTG694wJZnGvTtqBEEkoMMIiUTSpEkSxB8aGsL+8dP9E6Q5hBS
ixLUHINdjaubsu+uzlI3MZdDk7TWBwp5fNekf4Jbjlb9anoICEkJsjZJHTR9n3nM
d9QpEbh42+YHAn2EFL8gXN+Cb7o75QppT3K+b68Pz43yvPgMLd78Q4tSN0aCo190
9ynR1szpniiw3T/xW0dGanpRjKLs7HZubTujc1oQ+TD1Q1Uh+2/nZWb9PxWAAe3S
CW+ssn6slv9IS+KXyoIMbDtyPaJOu1pMxYcFVXlZOAPXnYGl8P0A610f8u9833jT
OEqVKQ/bHAPiiTl2X/ATzCePhATtoYUq7jIc71pP01WK+o054bzm0r9Wyjxgs7g6
iPi4cfueZFOJMilkE9ZWuIws43YDv5wIEOWtpTkRCIHKCmkeVXkDfdRnnXhJCUeF
6y3iW0staR/pnTqI6g8LEnGku2gbteBQNCueYoJA5jsxLyl6oJw1Bur7yGTzzPnJ
SP+9+RBlyGI5EzIcqQWsReOhGY4U/hOWDtltYR/gmlhlQ2o/iO4U1aiN0qa1AiaH
3ixixVygYOA=
=H/FD
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20170109' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
afs: Refcount afs_call struct
These patches provide some tracepoints for AFS and fix a potential leak by
adding refcounting to the afs_call struct.
The patches are:
(1) Add some tracepoints for logging incoming calls and monitoring
notifications from AF_RXRPC and data reception.
(2) Get rid of afs_wait_mode as it didn't turn out to be as useful as
initially expected. It can be brought back later if needed. This
clears some stuff out that I don't then need to fix up in (4).
(3) Allow listen(..., 0) to be used to disable listening. This makes
shutting down the AFS cache manager server in the kernel much easier
and the accounting simpler as we can then be sure that (a) all
preallocated afs_call structs are relesed and (b) no new incoming
calls are going to be started.
For the moment, listening cannot be reenabled.
(4) Add refcounting to the afs_call struct to fix a potential multiple
release detected by static checking and add a tracepoint to follow the
lifecycle of afs_call objects.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Processes can only alter their own security attributes via
/proc/pid/attr nodes. This is presently enforced by each individual
security module and is also imposed by the Linux credentials
implementation, which only allows a task to alter its own credentials.
Move the check enforcing this restriction from the individual
security modules to proc_pid_attr_write() before calling the security hook,
and drop the unnecessary task argument to the security hook since it can
only ever be the current task.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
A static checker warning occurs in the AFS filesystem:
fs/afs/cmservice.c:155 SRXAFSCB_CallBack()
error: dereferencing freed memory 'call'
due to the reply being sent before we access the server it points to. The
act of sending the reply causes the call to be freed if an error occurs
(but not if it doesn't).
On top of this, the lifetime handling of afs_call structs is fragile
because they get passed around through workqueues without any sort of
refcounting.
Deal with the issues by:
(1) Fix the maybe/maybe not nature of the reply sending functions with
regards to whether they release the call struct.
(2) Refcount the afs_call struct and sort out places that need to get/put
references.
(3) Pass a ref through the work queue and release (or pass on) that ref in
the work function. Care has to be taken because a work queue may
already own a ref to the call.
(4) Do the cleaning up in the put function only.
(5) Simplify module cleanup by always incrementing afs_outstanding_calls
whenever a call is allocated.
(6) Set the backlog to 0 with kernel_listen() at the beginning of the
process of closing the socket to prevent new incoming calls from
occurring and to remove the contribution of preallocated calls from
afs_outstanding_calls before we wait on it.
A tracepoint is also added to monitor the afs_call refcount and lifetime.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: 08e0e7c82e: "[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC."
The afs_wait_mode struct isn't really necessary. Client calls only use one
of a choice of two (synchronous or the asynchronous) and incoming calls
don't use the wait at all. Replace with a boolean parameter.
Signed-off-by: David Howells <dhowells@redhat.com>
'inode' is an important field for btrfs_get_extent, lets trace it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enabling btrfs tracepoints leads to instant crash, as reported. The wq
callbacks could free the memory and the tracepoints started to
dereference the members to get to fs_info.
The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2
removed the tracepoints but we could preserve them by passing only the
required data in a safe way.
Fixes: bc074524e1 ("btrfs: prefix fsid to all trace events")
CC: stable@vger.kernel.org # 4.8+
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add three tracepoints to the AFS filesystem:
(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().
(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.
(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.
To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.
Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.
Signed-off-by: David Howells <dhowells@redhat.com>
Inside ext4_ext_shift_extents() function ext4_find_extent() is called
without EXT4_EX_NOCACHE flag, which should prevent cache population.
This leads to oudated offsets in the extents tree and wrong blocks
afterwards.
Patch fixes the problem providing EXT4_EX_NOCACHE flag for each
ext4_find_extents() call inside ext4_ext_shift_extents function.
Fixes: 331573febb
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: stable@vger.kernel.org
While doing 'insert range' start block should be also shifted right.
The bug can be easily reproduced by the following test:
ptr = malloc(4096);
assert(ptr);
fd = open("./ext4.file", O_CREAT | O_TRUNC | O_RDWR, 0600);
assert(fd >= 0);
rc = fallocate(fd, 0, 0, 8192);
assert(rc == 0);
for (i = 0; i < 2048; i++)
*((unsigned short *)ptr + i) = 0xbeef;
rc = pwrite(fd, ptr, 4096, 0);
assert(rc == 4096);
rc = pwrite(fd, ptr, 4096, 4096);
assert(rc == 4096);
for (block = 2; block < 1000; block++) {
rc = fallocate(fd, FALLOC_FL_INSERT_RANGE, 4096, 4096);
assert(rc == 0);
for (i = 0; i < 2048; i++)
*((unsigned short *)ptr + i) = block;
rc = pwrite(fd, ptr, 4096, 4096);
assert(rc == 4096);
}
Because start block is not included in the range the hole appears at
the wrong offset (just after the desired offset) and the following
pwrite() overwrites already existent block, keeping hole untouched.
Simple way to verify wrong behaviour is to check zeroed blocks after
the test:
$ hexdump ./ext4.file | grep '0000 0000'
The root cause of the bug is a wrong range (start, stop], where start
should be inclusive, i.e. [start, stop].
This patch fixes the problem by including start into the range. But
not to break left shift (range collapse) stop points to the beginning
of the a block, not to the end.
The other not obvious change is an iterator check on validness in a
main loop. Because iterator is unsigned the following corner case
should be considered with care: insert a block at 0 offset, when stop
variables overflows and never becomes less than start, which is 0.
To handle this special case iterator is set to NULL to indicate that
end of the loop is reached.
Fixes: 331573febb
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: stable@vger.kernel.org
There was an unnecessary amount of complexity around requesting the
filesystem-specific key prefix. It was unclear why; perhaps it was
envisioned that different instances of the same filesystem type could
use different key prefixes, or that key prefixes could be binary.
However, neither of those things were implemented or really make sense
at all. So simplify the code by making key_prefix a const char *.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
While we allow deletes without the key, the following should not be
permitted:
# cd /vdc/encrypted-dir-without-key
# ls -l
total 4
-rw-r--r-- 1 root root 0 Dec 27 22:35 6,LKNRJsp209FbXoSvJWzB
-rw-r--r-- 1 root root 286 Dec 27 22:35 uRJ5vJh9gE7vcomYMqTAyD
# mv uRJ5vJh9gE7vcomYMqTAyD 6,LKNRJsp209FbXoSvJWzB
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Before this patch, if a process called function gfs2_log_reserve to
reserve some journal blocks, but the journal not enough blocks were
free, it would call io_schedule. However, in the log flush daemon,
it woke up the waiters only if an gfs2_ail_flush was no longer
required. This resulted in situations where processes would wait
forever because the number of blocks required was so high that it
pushed the journal into a perpetual state of flush being required.
This patch changes the logd daemon so that it wakes up io waiters
every time the log is actually flushed.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Make afs_readpages() use afs_vnode_fetch_data()'s new ability to take a
list of pages and do a bulk fetch.
Signed-off-by: David Howells <dhowells@redhat.com>
Make afs_fs_fetch_data() take a list of pages for bulk data transfer. This
will allow afs_readpages() to be made more efficient.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull audit fixes from Paul Moore:
"Two small fixes relating to audit's use of fsnotify.
The first patch plugs a leak and the second fixes some lock
shenanigans. The patches are small and I banged on this for an
afternoon with our testsuite and didn't see anything odd"
* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
audit: Fix sleep in atomic
fsnotify: Remove fsnotify_duplicate_mark()
Before this patch, the logd daemon only tried to flush things when
the log blocks pinned exceeded a certain threshold. But when we're
deleting very large files, it may require a huge number of journal
blocks, and that, in turn, may exceed the threshold. This patch
factors that into account.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch limits the number of transaction blocks requested during
file truncates. If we have very large multi-terabyte files, and want
to delete or truncate them, they might span so many resource groups
that we overflow the journal blocks, and cause an assert failure.
By limiting the number of blocks in the transaction, we prevent this
overflow and give other running processes time to do transactions.
The limiting factor I chose is sd_log_thresh2 which is currently
set to 4/5ths of the journal. This same ratio is used in function
gfs2_ail_flush_reqd to determine when a log flush is required.
If we make the maximum value less than this, we can get into a
infinite hang whereby the log stops moving because the number of
used blocks is less than the threshold and the iterative loop
needs more, but since we're under the threshold, the log daemon
never starts any IO on the log.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
UDF encodes symlinks in a more complex fashion and thus i_size of a
symlink does not match the lenght of a string returned by readlink(2).
This confuses some applications (see bug 191241) and may be considered a
violation of POSIX. Fix the problem by reading the link into page cache
in response to stat(2) call and report the length of the decoded path.
Signed-off-by: Jan Kara <jack@suse.cz>
- Fixes for crashes and double-cleanup errors
- XFS maintainership handover
- Fix to prevent absurdly large block reservations
- Fix broken sysfs getter/setters
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJYbH10AAoJEPh/dxk0SrTrozkQAIo1ikGMKI0x52izCJyKA+HP
gUfqqFnFDOsz0pWBDhRMBBGXbQlgU6uPRMicTdkNSRq3BnKPyfC7EK8WkXlOGT8A
JGQhz96sfr4gShuo4lul2nhgfThOyL7M3Vu+xHL7wgrrOV1Y4Haz2m2FKYzNentj
2ca2WeNCcuEQLdWwtwkeOnsjnC2gV9cA5pRsx59rktr/t6fU1Q+AcBSjpwDX43op
cQ0uTqiBB51pWe2tbw+VHSsYzyakkjsNsiYCZNOghN+p/5g4QpgT7LgiO9r5CVvu
UhAXv6285K4pVsD7cIy2yHvSFbYNz1khM1Tvv26npg72NoQxMXhxAaPkke03u2lT
4nakgqLpxybrrnviVsJy2VbnmVYN5mcXSV2XTmRdAtxZmrW9C6H2PCs9pQYSV1Ji
H6UT7kqusqp4ceQUBb5dtFCaUndNGnLxKDrmNOz/It7PPpI0zSRMiv+IC+qrQ/ci
oEExMUtRLLah5zXOXQej2IchZmP2SBXm/a2JhmQywTIyLiSPoDwKsNET9zT4CiXw
CHQj3spGh8uE7rh0QbujWjD7odE1IWDvOZ9Zh5vXVrwqTIeVOlxjfBqO/9J0z2Ak
z7WpRrQ2IhWMwD6pHg0oDUD9oB02LxnW24telQs35wgqgdZm5TfG6BaiKutzA7KH
nVlR22SQ4m+eTD8+IFMO
=TgC5
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.10-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- fixes for crashes and double-cleanup errors
- XFS maintainership handover
- fix to prevent absurdly large block reservations
- fix broken sysfs getter/setters
* tag 'xfs-for-linus-4.10-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix max_retries _show and _store functions
xfs: update MAINTAINERS
xfs: fix crash and data corruption due to removal of busy COW extents
xfs: use the actual AG length when reserving blocks
xfs: fix double-cleanup when CUI recovery fails
Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series, one fixing a regression with
block size < page cache size in the alias series from Jan. Outside of
that, two small cleanups for wbt from Bart, a nvme pull request from
Christoph, and a few small fixes of documentation updates"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix up io_poll documentation
block: Avoid that sparse complains about context imbalance in __wbt_wait()
block: Make wbt_wait() definition consistent with declaration
clean_bdev_aliases: Prevent cleaning blocks that are not in block range
genhd: remove dead and duplicated scsi code
block: add back plugging in __blkdev_direct_IO
nvmet/fcloop: remove some logically dead code performing redundant ret checks
nvmet: fix KATO offset in Set Features
nvme/fc: simplify error handling of nvme_fc_create_hw_io_queues
nvme/fc: correct some printk information
nvme/scsi: Remove START STOP emulation
nvme/pci: Delete misleading queue-wrap comment
nvme/pci: Fix whitespace problem
nvme: simplify stripe quirk
nvme: update maintainers information
max_retries _show and _store functions should test against cfg->max_retries,
not cfg->retry_timeout
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is a race window between write_cache_pages calling
clear_page_dirty_for_io and XFS calling set_page_writeback, in which
the mapping for an inode is tagged neither as dirty, nor as writeback.
If the COW shrinker hits in exactly that window we'll remove the delayed
COW extents and writepages trying to write it back, which in release
kernels will manifest as corruption of the bmap btree, and in debug
kernels will trip the ASSERT about now calling xfs_bmapi_write with the
COWFORK flag for holes. A complex customer load manages to hit this
window fairly reliably, probably by always having COW writeback in flight
while the cow shrinker runs.
This patch adds another check for having the I_DIRTY_PAGES flag set,
which is still set during this race window. While this fixes the problem
I'm still not overly happy about the way the COW shrinker works as it
still seems a bit fragile.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We need to use the actual AG length when making per-AG reservations,
since we could otherwise end up reserving more blocks out of the last
AG than there are actual blocks.
Complained-about-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dan Carpenter reported a double-free of rcur if _defer_finish fails
while we're recovering CUI items. Fix the error recovery to prevent
this.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently how btrfs dio deals with split dio write is not good
enough if dio write is split into several segments due to the
lack of contiguous space, a large dio write like 'dd bs=1G count=1'
can end up with incorrect outstanding_extents counter and endio
would complain loudly with an assertion.
This fixes the problem by compensating the outstanding_extents
counter in inode if a large dio write gets split.
Reported-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While checking INODE_REF/INODE_EXTREF for a corner case, we may acquire a
different inode's log_mutex with holding the current inode's log_mutex, and
lockdep has complained this with a possilble deadlock warning.
Fix this by using mutex_lock_nested() when processing the other inode's
log_mutex.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If @block_group is not @used_bg, it'll try to get @used_bg's lock without
droping @block_group 's lock and lockdep has throwed a scary deadlock warning
about it.
Fix it by using down_read_nested.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __btrfs_run_delayed_refs, when we put back a delayed ref that's too
new, we have already dropped the lock on locked_ref when we set
->processing = 0.
This patch keeps the lock to cover that assignment.
Fixes: d7df2c796d (Btrfs: attach delayed ref updates to delayed ref heads)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op
fails sets locked_ref->processing = 0 but doesn't re-increment
delayed_refs->num_heads_ready. As a result, we end up triggering
the WARN_ON in btrfs_select_ref_head.
Fixes: d7df2c796d (Btrfs: attach delayed ref updates to delayed ref heads)
Reported-by: Jon Nelson <jnelson-suse@jamponi.net>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
CURRENT_TIME is not y2038 safe.
CURRENT_TIME macro is also not appropriate for filesystems
as it doesn't use the right granularity for filesystem
timestamps.
Logical Volume Integrity format is described to have the
same timestamp format for "Recording Date and time" as
the other [a,c,m]timestamps.
The function udf_time_to_disk_format() does this conversion.
Hence the timestamp is passed directly to the function and
not truncated. This is as per Arnd's suggestion on the
thread.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
crypto tree during the merge window.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlhrCIIACgkQ8vlZVpUN
gaP0rAf8DehnxAXTdGwCDKJ76Xgkd4C0vYwNYsWrwbEsD6dMXPmfhDVA40ZefFWY
4UQaPeoDSXQnIxw+6gi6LFCJeYs+dc9ZWHk++w5kEMclIUONomODDAQLMJbpG+5t
pkEwOzjTaKbIQ5n4r3rMJtlBlrZX+ZVJmMt3sYAMWhIq7Bf7dRy6AC7+vyM5VTce
AYvFpureLd7pJT0AcNvg5oPnXIFiPlKi6knlmAdJ32I4FQQO07aDA37mLPKdff4/
uKs4PGKTa9MCGw+blMDJ/208kBQPPn8JZ7yGQCdGw16CUaoSXLregqu6SNs2MKaQ
WjmBFyEUssScTeAq8rYJVlU7FYxwdQ==
=VVfb
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt fixes from Ted Ts'o:
"Two fscrypt bug fixes, one of which was unmasked by an update to the
crypto tree during the merge window"
* tag 'fscrypt-for-stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: fix renaming and linking special files
fscrypt: fix the test_dummy_encryption mount option
Currently, the test_dummy_encryption ext4 mount option, which exists
only to test encrypted I/O paths with xfstests, overrides all
per-inode encryption keys with a fixed key.
This change minimizes test_dummy_encryption-specific code path changes
by supplying a fake context for directories which are not encrypted
for use when creating new directories, files, or symlinks. This
allows us to properly exercise the keyring lookup, derivation, and
context inheritance code paths.
Before mounting a file system using test_dummy_encryption, userspace
must execute the following shell commands:
mode='\x00\x00\x00\x00'
raw="$(printf ""\\\\x%02x"" $(seq 0 63))"
if lscpu | grep "Byte Order" | grep -q Little ; then
size='\x40\x00\x00\x00'
else
size='\x00\x00\x00\x40'
fi
key="${mode}${raw}${size}"
keyctl new_session
echo -n -e "${key}" | keyctl padd logon fscrypt:4242424242424242 @s
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The first block to be cleaned may start at a non-zero page offset. In
such a scenario clean_bdev_aliases() will end up cleaning blocks that
do not fall in the range of blocks to be cleaned. This commit fixes the
issue by skipping blocks that do not fall in valid block range.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
That way we can get rid of the direct dependency on CONFIG_BLOCK.
Fixes: d475a50745 ("ubifs: Add skeleton for fscrypto")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It was possible for the ->get_context() operation to fail with a
specific error code, which was then not returned to the caller of
FS_IOC_SET_ENCRYPTION_POLICY or FS_IOC_GET_ENCRYPTION_POLICY. Make sure
to pass through these error codes. Also reorganize the code so that
->get_context() only needs to be called one time when setting an
encryption policy, and handle contexts of unrecognized sizes more
appropriately.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Several warning messages were not rate limited and were user-triggerable
from FS_IOC_SET_ENCRYPTION_POLICY. These shouldn't really have been
there in the first place, but either way they aren't as useful now that
the error codes have been improved. So just remove them.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As part of an effort to clean up fscrypt-related error codes, make
FS_IOC_SET_ENCRYPTION_POLICY fail with EEXIST when the file already uses
a different encryption policy. This is more descriptive than EINVAL,
which was ambiguous with some of the other error cases.
I am not aware of any users who might be relying on the previous error
code of EINVAL, which was never documented anywhere.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As part of an effort to clean up fscrypt-related error codes, make
FS_IOC_SET_ENCRYPTION_POLICY fail with ENOTDIR when the file descriptor
does not refer to a directory. This is more descriptive than EINVAL,
which was ambiguous with some of the other error cases.
I am not aware of any users who might be relying on the previous error
code of EINVAL, which was never documented anywhere, and in some buggy
kernels did not exist at all as the S_ISDIR() check was missing.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As part of an effort to clean up fscrypt-related error codes, make
attempting to create a file in an encrypted directory that hasn't been
"unlocked" fail with ENOKEY. Previously, several error codes were used
for this case, including ENOENT, EACCES, and EPERM, and they were not
consistent between and within filesystems. ENOKEY is a better choice
because it expresses that the failure is due to lacking the encryption
key. It also matches the error code returned when trying to open an
encrypted regular file without the key.
I am not aware of any users who might be relying on the previous
inconsistent error codes, which were never documented anywhere.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Attempting to link a device node, named pipe, or socket file into an
encrypted directory through rename(2) or link(2) always failed with
EPERM. This happened because fscrypt_has_permitted_context() saw that
the file was unencrypted and forbid creating the link. This behavior
was unexpected because such files are never encrypted; only regular
files, directories, and symlinks can be encrypted.
To fix this, make fscrypt_has_permitted_context() always return true on
special files.
This will be covered by a test in my encryption xfstests patchset.
Fixes: 9bd8212f98 ("ext4 crypto: add encryption policy and password salt support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit f1c131b454: "crypto: xts - Convert to skcipher" now fails
the setkey operation if the AES key is the same as the tweak key.
Previously this check was only done if FIPS mode is enabled. Now this
check is also done if weak key checking was requested. This is
reasonable, but since we were using the dummy key which was a constant
series of 0x42 bytes, it now caused dummy encrpyption test mode to
fail.
Fix this by using 0x42... and 0x24... for the two keys, so they are
different.
Fixes: f1c131b454
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add ->open/release() methods to kernfs_ops. ->open() is called when
the file is opened and ->release() when the file is either released or
severed. These callbacks can be used, for example, to manage
persistent caching objects over multiple seq_file iterations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
More kernfs_open_file->mutex synchronized flags are planned to be
added. Convert ->mmapped to a bitfield in preparation.
While at it, make kernfs_fop_mmap() use "true" instead of "1" on
->mmapped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX (if it faults when copying from/to user provided
buffers).
Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently dax_iomap_rw() takes care of invalidating page tables and
evicting hole pages from the radix tree when write(2) to the file
happens. This invalidation is only necessary when there is some block
allocation resulting from write(2). Furthermore in current place the
invalidation is racy wrt page fault instantiating a hole page just after
we have invalidated it.
So perform the page invalidation inside dax_iomap_actor() where we can
do it only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.
Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
No point in going through loops and hoops instead of just comparing the
values.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.
Get rid of the union and just keep ktime_t as simple typedef of type s64.
The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cifs fixes from Steve French:
"This ncludes various cifs/smb3 bug fixes, mostly for stable as well.
In the next week I expect that Germano will have some reconnection
fixes, and also I expect to have the remaining pieces of the snapshot
enablement and SMB3 ACLs, but wanted to get this set of bug fixes in"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs_get_root shouldn't use path with tree name
Fix default behaviour for empty domains and add domainauto option
cifs: use %16phN for formatting md5 sum
cifs: Fix smbencrypt() to stop pointing a scatterlist at the stack
CIFS: Fix a possible double locking of mutex during reconnect
CIFS: Fix a possible memory corruption during reconnect
CIFS: Fix a possible memory corruption in push locks
CIFS: Fix missing nls unload in smb2_reconnect()
CIFS: Decrease verbosity of ioctl call
SMB3: parsing for new snapshot timestamp mount parm
There are only two calls sites of fsnotify_duplicate_mark(). Those are
in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
for audit tree, inode pointer and group gets set in
fsnotify_add_mark_locked() later anyway, mask and free_mark are already
set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
actively harmful because following fsnotify_add_mark_locked() will leak
group reference by overwriting the group pointer. So just remove the two
calls to fsnotify_duplicate_mark() and the function.
Signed-off-by: Jan Kara <jack@suse.cz>
[PM: line wrapping to fit in 80 chars]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull final vfs updates from Al Viro:
"Assorted cleanups and fixes all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sg_write()/bsg_write() is not fit to be called under KERNEL_DS
ufs: fix function declaration for ufs_truncate_blocks
fs: exec: apply CLOEXEC before changing dumpable task flags
seq_file: reset iterator to first record for zero offset
vfs: fix isize/pos/len checks for reflink & dedupe
[iov_iter] fix iterate_all_kinds() on empty iterators
move aio compat to fs/aio.c
reorganize do_make_slave()
clone_private_mount() doesn't need to touch namespace_sem
remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJYW7vZAAoJEGu/nxmHO1GNeGUIAJil3Q4ZaeOaaj5uNs4h64kc
0BAfGSwzGNgreX5PWm+jQVeh6xbAqXnYtsWIDSibpxnXOhAZcXHbpzKLTwlMl4rh
qpXAAWhHcBsOKiNcg++RRmouubYtpgMoOKCgo/DzGp51mSV7/8K2mugzDRohPUsR
jUDqUa9qvt65uqI5xCuK1n3aLtCQ9m3RUzDfQbH4fK/yBXpNIE83xegU1SBJKZHj
uGPJpjHhc1vaba6Y8vDDBHuJR9IJxfeSnoJE0xMmGlIub40exw7P4Dek1Tc/3G+R
qiqT9aGAbegkFDerps5sqOLbU4Lm4Js8Ov78l3IN1FSVdYWsptzRibjIbUidPdc=
=zypk
-----END PGP SIGNATURE-----
Merge tag 'befs-v4.10-rc1' of git://github.com/luisbg/linux-befs
Pull befs updates from Luis de Bethencourt:
"A series of small fixes and adding NFS export support"
* tag 'befs-v4.10-rc1' of git://github.com/luisbg/linux-befs:
befs: add NFS export support
befs: remove trailing whitespaces
befs: remove signatures from comments
befs: fix style issues in header files
befs: fix style issues in linuxvfs.c
befs: fix typos in linuxvfs.c
befs: fix style issues in io.c
befs: fix style issues in inode.c
befs: fix style issues in debug.c
sparse says:
fs/ufs/inode.c:1195:6: warning: symbol 'ufs_truncate_blocks' was not declared. Should it be static?
Note that the forward declaration in the file is already marked static.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If you have a process that has set itself to be non-dumpable, and it
then undergoes exec(2), any CLOEXEC file descriptors it has open are
"exposed" during a race window between the dumpable flags of the process
being reset for exec(2) and CLOEXEC being applied to the file
descriptors. This can be exploited by a process by attempting to access
/proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE.
The race in question is after set_dumpable has been (for get_link,
though the trace is basically the same for readlink):
[vfs]
-> proc_pid_link_inode_operations.get_link
-> proc_pid_get_link
-> proc_fd_access_allowed
-> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
Which will return 0, during the race window and CLOEXEC file descriptors
will still be open during this window because do_close_on_exec has not
been called yet. As a result, the ordering of these calls should be
reversed to avoid this race window.
This is of particular concern to container runtimes, where joining a
PID namespace with file descriptors referring to the host filesystem
can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
against access of CLOEXEC file descriptors -- file descriptors which may
reference filesystem objects the container shouldn't have access to).
Cc: dev@opencontainers.org
Cc: <stable@vger.kernel.org> # v3.2+
Reported-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If kernfs file is empty on a first read, successive read operations
using the same file descriptor will return no data, even when data is
available. Default kernfs 'seq_next' implementation advances iterator
position even when next object is not there. Kernfs 'seq_start' for
following requests will not return iterator as position is already on
the second object.
This defect doesn't allow to monitor badblocks sysfs files from MD raid.
They are initially empty but if data appears at some stage, userspace is
not able to read it.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Strengthen the checking of pos/len vs. i_size, clarify the return values
for the clone prep function, and remove pointless code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and fix the minor buglet in compat io_submit() - native one
kills ioctx as cleanup when put_user() fails. Get rid of
bogus compat_... in !CONFIG_AIO case, while we are at it - they
should simply fail with ENOSYS, same as for native counterparts.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This allows sending larger than 1 MB requests to devices that support
large I/O sizes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Removing all trailing whitespaces in befs.
I was skeptic about tainting the history with this, but whitespace changes
can be ignored by using 'git blame -w' and 'git log -w'.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
No idea why some comments have signatures. These predate git. Removing them
since they add noise and no information.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Fixing checkpatch.pl issues in befs header files:
WARNING: Missing a blank line after declarations
+ befs_inode_addr iaddr;
+ iaddr.allocation_group = blockno >> BEFS_SB(sb)->ag_shift;
WARNING: space prohibited between function name and open parenthesis '('
+ return BEFS_SB(sb)->block_size / sizeof (befs_disk_inode_addr);
ERROR: "foo * bar" should be "foo *bar"
+ const char *key, befs_off_t * value);
ERROR: Macros with complex values should be enclosed in parentheses
+#define PACKED __attribute__ ((__packed__))
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Fix the following type of checkpatch.pl issues:
WARNING: line over 80 characters
+static struct dentry *befs_lookup(struct inode *, struct dentry *, unsigned int);
ERROR: code indent should use tabs where possible
+ if (!bi)$
WARNING: please, no spaces at the start of a line
+ if (!bi)$
WARNING: labels should not be indented
+ unacquire_bh:
WARNING: space prohibited between function name and open parenthesis '('
+ sizeof (struct befs_inode_info),
WARNING: braces {} are not necessary for single statement blocks
+ if (!*out) {
+ return -ENOMEM;
+ }
WARNING: Block comments use a trailing */ on a separate line
+ * in special cases */
WARNING: Missing a blank line after declarations
+ int token;
+ if (!*p)
ERROR: do not use assignment in if condition
+ if (!(bh = sb_bread(sb, sb_block))) {
ERROR: space prohibited after that open parenthesis '('
+ if( befs_sb->num_blocks > ~((sector_t)0) ) {
ERROR: space prohibited before that close parenthesis ')'
+ if( befs_sb->num_blocks > ~((sector_t)0) ) {
ERROR: space required before the open parenthesis '('
+ if( befs_sb->num_blocks > ~((sector_t)0) ) {
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Fixing the two following checkpatch.pl issues:
ERROR: trailing whitespace
+ * Based on portions of file.c and inode.c $
WARNING: labels should not be indented
+ error:
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Fixing the following checkpatch.pl errors and warning:
ERROR: trailing whitespace
+ * $
WARNING: Block comments use * on subsequent lines
+/*
+ Validates the correctness of the befs inode
ERROR: "foo * bar" should be "foo *bar"
+befs_check_inode(struct super_block *sb, befs_inode * raw_inode,
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Commit 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
caused a regression when there were no more readers left on a pipe that
was being spliced into: rather than the expected SIGPIPE and -EPIPE
return value, the writer would end up waiting forever for space to free
up (which obviously was not going to happen with no readers around).
Fixes: 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
Reported-and-tested-by: Andreas Schwab <schwab@linux-m68k.org>
Debugged-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v4.9
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights include:
- Further attribute cache improvements to make revalidation more fine grained
- NFSv4 locking improvements
Bugfixes:
- nfs4_fl_prepare_ds must be careful about reporting success in files layout
- pNFS/flexfiles: Instead of marking a device inactive, remove it from the cache
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYWoWvAAoJEGcL54qWCgDydigQAMFTUZTjHGx34l3yqie3uV0D
P0Ws6qfQJwN6xwgbhLMt6pbECadN7d55KszIM71gu/Iqa1UbtHZht2Slf/Tw5wY8
vjptzLkvOie5g9sGIhhQqkgrXnG3A96aEuaknIj9JXWALJb01M94ZfC0STgUjmil
sp100cDmmOonqflk7Tgh/jmfiUucBw9HnyNRY59sUaR0lAmHRrpJuqftmarYKRrq
1rTCeGx6MyM2kVyqGsY7FP1prQcNcIJXwti/EXrOywXbHZxmjl+l0WLu6iUOEj/I
zvfWCeMFyMXeVsTrRLYeM1vQ/maUbSev6gms2RshbmDITE6Ir9nHfBTCfOmh95wV
paKnAzuEKsUR+LhEbsORk03efQYkeDQuJW3QqhdVuqGnEwhaboTiP4GRJLjEo287
dHy6sEGEbAcieUsolXhDIaxggoF39gGihOT3+m+WgIlXRd38W2mvlQC9oaq89ehw
Ipetm3swnkY8bTavsNYFW1x+V5zrGH9Dlu2/mpnqVAIwGT+ByolD8tYDAHf4zj3b
cy9XhFzdLDRy4V95l4hzZFDCdjy4dZZu7cJEO5zqUB3236qhiZbGLjL232ZEHFi3
wNHn13nADV+DzoOuB5HJwfhKyOOBQsQpBIvzAgt1n7muZwxa35nX/2waX1xt2HX5
J+s+e02xJnMStcr/2O5v
=8gLp
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull more NFS client updates from Trond Myklebust:
"Highlights include:
- further attribute cache improvements to make revalidation more fine
grained
- NFSv4 locking improvements
Bugfixes:
- nfs4_fl_prepare_ds must be careful about reporting success in files
layout
- pNFS/flexfiles: Instead of marking a device inactive, remove it
from the cache"
* tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Retry the DELEGRETURN if the embedded GETATTR is rejected with EACCES
NFS: Retry the CLOSE if the embedded GETATTR is rejected with EACCES
NFSv4: Place the GETATTR operation before the CLOSE
NFSv4: Also ask for attributes when downgrading to a READ-only state
NFS: Don't abuse NFS_INO_REVAL_FORCED in nfs_post_op_update_inode_locked()
pNFS: Return RW layouts on OPEN_DOWNGRADE
NFSv4: Add encode/decode of the layoutreturn op in OPEN_DOWNGRADE
NFS: Don't disconnect open-owner on NFS4ERR_BAD_SEQID
NFSv4: ensure __nfs4_find_lock_state returns consistent result.
NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.
pNFS/flexfiles: delete deviceid, don't mark inactive
NFS: Clean up nfs_attribute_timeout()
NFS: Remove unused function nfs_revalidate_inode_rcu()
NFS: Fix and clean up the access cache validity checking
NFS: Only look at the change attribute cache state in nfs_weak_revalidate()
NFS: Clean up cache validity checking
NFS: Don't revalidate the file on close if we hold a delegation
NFSv4: Don't discard the attributes returned by asynchronous DELEGRETURN
NFSv4: Update the attribute cache info in update_changeattr
If our DELEGRETURN RPC call is rejected with an EACCES call, then we should
remove the GETATTR call from the compound RPC and retry.
This could potentially happen when there is a conflict between an
ACL denying attribute reads and our use of SP4_MACH_CRED.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If our CLOSE RPC call is rejected with an EACCES call, then we should
remove the GETATTR call from the compound RPC and retry.
This could potentially happen when there is a conflict between an
ACL denying attribute reads and our use of SP4_MACH_CRED.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In order to benefit from the DENY share lock protection, we should
put the GETATTR operation before the CLOSE. Otherwise, we might race
with a Windows machine that thinks it is now safe to modify the file.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we're downgrading from a READ+WRITE mode to a READ-only mode, then
ask for cache consistency attributes so that we avoid the revalidation
in nfs_close_context()
Fixes: 3947b74d0f ("NFSv4: Don't request a GETATTR on open_downgrade.")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The NFS_INO_REVAL_FORCED flag now really only has meaning for the
case when we've just been handed a delegation for a file that was already
cached, and we're unsure about that cache.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the client holds no more writeable open state, and does not hold a
write delegation, then send a layoutreturn as part of the OPEN_DOWNGRADE.
We do this only for writes, since some layout drivers may require you to
also hold a read layout if you are doing a R/W workload.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
While we do not need to return the RW layout when downgrading from a
read/write open state to read-only, we might want to do so in order
to reduce the burden on the metadataserver so that it does not need
to check for changed data when responding to GETATTR requests.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When an NFS4ERR_BAD_SEQID is received the open-owner is removed from
the ->state_owners rbtree so that it will no longer be used.
If any stateids attached to this open-owner are still in use, and if a
request using one gets an NFS4ERR_BAD_STATEID reply, this can for bad.
The state is marked as needing recovery and the nfs4_state_manager()
is scheduled to clean up. nfs4_state_manager() finds states to be
recovered by walking the state_owners rbtree. As the open-owner is
not in the rbtree, the bad state is not found so nfs4_state_manager()
completes having done nothing. The request is then retried, with a
predicatable result (indefinite retries).
If the stateid is for a delegation, this open_owner will be used
to open files when the delegation is returned. For that to work,
a new open-owner needs to be presented to the server.
This patch changes NFS4ERR_BAD_SEQID handling to leave the open-owner
in the rbtree but updates the 'create_time' so it looks like a new
open-owner. With this the indefinite retries no longer happen.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a file has both flock locks and OFD locks, then it is possible that
two different nfs4 lock states could apply to file accesses from a
single process.
It is not possible to know, efficiently, which one is "correct".
Presumably the state which represents a lock that covers the region
undergoing IO would be the "correct" one to use, but finding that has
a non-trivial cost and would provide miniscule value.
Currently we just return whichever is first in the list, which could
result in inconsistent behaviour if an application ever put it self in
this position. As consistent behaviour is preferable (when perfectly
correct behaviour is not available), change the search to return a
consistent result in this circumstance.
Specifically: if there is both a flock and OFD lock state, always return
the flock one.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Various places assume that if nfs4_fl_prepare_ds() turns a non-NULL 'ds',
then ds->ds_clp will also be non-NULL.
This is not necessasrily true in the case when the process received a fatal signal
while nfs4_pnfs_ds_connect is waiting in nfs4_wait_ds_connect().
In that case ->ds_clp may not be set, and the devid may not recently have been marked
unavailable.
So add a test for ds_clp == NULL and return NULL in that case.
Fixes: c23266d532 ("NFS4.1 Fix data server connection race")
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Olga Kornievskaia <aglo@umich.edu>
Acked-by: Adamson, Andy <William.Adamson@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of marking a device inactive, remove it from the cache entirely.
Flexfiles has a way to report errors back to the server, so we don't want
to stop devices from being tried again for 120 seconds.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The access cache needs to check whether or not the mode bits, ownership,
or ACL has changed or the cache has timed out.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Just like in nfs_check_verifier(), we want to use
nfs_mapping_need_revalidate_inode() to check our knowledge of the
change attribute is up to date.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Consolidate the open-coded checking of NFS_I(inode)->cache_validity
into a couple of helper functions.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we're holding a delegation, we can skip sending the close-to-open
GETATTR until we're returning that delegation.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
DELEGRETURN will always carry a reference to the inode except when
the latter is being freed, so let's ensure that we always use that
inode information to ensure close-to-open cache consistency, even
when the DELEGRETURN call is asynchronous.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we successfully updated the change attribute, we should timestamp the
cache. While we do know that the other attributes are not completely up
to date, we have the NFS_INO_INVALID_ATTR flag that let us know that,
so it is valid to say that the cache has not timed out.
We can also clear NFS_INO_REVAL_PAGECACHE, since our change attribute
is now known to be valid.
Conversely, if the change attribute did not match, we should make sure to
also revalidate the access and ACL caches.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In function btrfs_uuid_tree_iterate(), errno is assigned to variable ret
on errors. However, it directly returns 0. It may be better to return
ret. This patch also removes the warning, because the caller already
prints a warning.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188731
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
[ edited subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pull quota, fsnotify and ext2 updates from Jan Kara:
"Changes to locking of some quota operations from dedicated quota mutex
to s_umount semaphore, a fsnotify fix and a simple ext2 fix"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Fix bogus warning in dquot_disable()
fsnotify: Fix possible use-after-free in inode iteration on umount
ext2: reject inodes with negative size
quota: Remove dqonoff_mutex
ocfs2: Use s_umount for quota recovery protection
quota: Remove dqonoff_mutex from dquot_scan_active()
ocfs2: Protect periodic quota syncing with s_umount semaphore
quota: Use s_umount protection for quota operations
quota: Hold s_umount in exclusive mode when enabling / disabling quotas
fs: Provide function to get superblock with exclusive s_umount
dquot_disable() was warning when sb_has_quota_loaded() was true when
invalidating page cache for quota files. The thinking behind this
warning was that we must have raced with somebody else turning quotas on
and this should not happen because all places modifying quota state must
hold s_umount exclusively now. However sb_has_quota_loaded() can be also
true at this point when we are just suspending quotas on remount
read-only. Just restore the behavior to situation before commit
c3b004460d ("quota: Remove dqonoff_mutex") which introduced the
warning.
The code in dquot_disable() can be further simplified with the new
locking of quota state changes however let's leave that to a separate
commit that can get more testing exposure.
Fixes: c3b004460d
Signed-off-by: Jan Kara <jack@suse.cz>
Pull partial readlink cleanups from Miklos Szeredi.
This is the uncontroversial part of the readlink cleanup patch-set that
simplifies the default readlink handling.
Miklos and Al are still discussing the rest of the series.
* git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
vfs: make generic_readlink() static
vfs: remove ".readlink = generic_readlink" assignments
vfs: default to generic_readlink()
vfs: replace calling i_op->readlink with vfs_readlink()
proc/self: use generic_readlink
ecryptfs: use vfs_get_link()
bad_inode: add missing i_op initializers
Pull more vfs updates from Al Viro:
"In this pile:
- autofs-namespace series
- dedupe stuff
- more struct path constification"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
ocfs2: charge quota for reflinked blocks
ocfs2: fix bad pointer cast
ocfs2: always unlock when completing dio writes
ocfs2: don't eat io errors during _dio_end_io_write
ocfs2: budget for extent tree splits when adding refcount flag
ocfs2: prohibit refcounted swapfiles
ocfs2: add newlines to some error messages
ocfs2: convert inode refcount test to a helper
simple_write_end(): don't zero in short copy into uptodate
exofs: don't mess with simple_write_{begin,end}
9p: saner ->write_end() on failing copy into non-uptodate page
fix gfs2_stuffed_write_end() on short copies
fix ceph_write_end()
nfs_write_end(): fix handling of short copies
vfs: refactor clone/dedupe_file_range common functions
fs: try to clone files first in vfs_copy_file_range
vfs: misc struct path constification
namespace.c: constify struct path passed to a bunch of primitives
quota: constify struct path in quota_on
...
Make sure that clone_mnt() never returns a mount with MNT_SHARED in
flags, but without a valid ->mnt_group_id. That allows to demystify
do_make_slave() quite a bit, among other things.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.
- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.
- a new [no]require_active_mds mount option (Zheng Yan). On mount, we
will now check whether any of the MDSes are available and bail rather
than block if none are. This check can be avoided by specifying the
"no" option.
- a couple of MDS cap handling fixes and a few assorted patches
throughout.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYVByGAAoJEEp/3jgCEfOLBqkH/A7nVf7ObSDYmLuYgg1gJ8zq
4zDDE42S4yZwayAVpn3UjbfPuez5J44lsdXitExdfiHOdIQZDa/WqAbSqQ48HCSg
7sG6ecRWg3G5zG0psPZnB+S5wGMvsLXmj2hvzV1lt2t0lI5bDLSlNRSnElbhilD/
8Z7+Ni2go8DMC9o49SJU32lBW7IByKl4p4flveItgwUvGkIFNd8OT3CyPBUqonQs
lRCeImRYU8Jghb+ifnRxWSbuDf7pZAPc9kL0vibpUUT/1bH6iHsedKp37WQKqc/w
KDSNnKiZcz0gY/hJeLqE3ymCIKO6SU+JkMQSaYNTouLO5fQsRr8/uWQXSe6S5oc=
=ypWx
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A varied set of changes:
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.
- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.
- a new [no]require_active_mds mount option (Zheng Yan).
On mount, we will now check whether any of the MDSes are available
and bail rather than block if none are. This check can be avoided
by specifying the "no" option.
- a couple of MDS cap handling fixes and a few assorted patches
throughout"
* tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
libceph: remove now unused finish_request() wrapper
libceph: always signal completion when done
ceph: avoid creating orphan object when checking pool permission
ceph: properly set issue_seq for cap release
ceph: add flags parameter to send_cap_msg
ceph: update cap message struct version to 10
ceph: define new argument structure for send_cap_msg
ceph: move xattr initialzation before the encoding past the ceph_mds_caps
ceph: fix minor typo in unsafe_request_wait
ceph: record truncate size/seq for snap data writeback
ceph: check availability of mds cluster on mount
ceph: fix splice read for no Fc capability case
ceph: try getting buffer capability for readahead/fadvise
ceph: fix scheduler warning due to nested blocking
ceph: fix printing wrong return variable in ceph_direct_read_write()
crush: include mapper.h in mapper.c
rbd: silence bogus -Wmaybe-uninitialized warning
libceph: no need to drop con->mutex for ->get_authorizer()
libceph: drop len argument of *verify_authorizer_reply()
libceph: verify authorize reply on connect
...
Pull overlayfs updates from Miklos Szeredi:
"This update contains:
- try to clone on copy-up
- allow renaming a directory
- split source into managable chunks
- misc cleanups and fixes
It does not contain the read-only fd data inconsistency fix, which Al
didn't like. I'll leave that to the next year..."
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits)
ovl: fix reStructuredText syntax errors in documentation
ovl: fix return value of ovl_fill_super
ovl: clean up kstat usage
ovl: fold ovl_copy_up_truncate() into ovl_copy_up()
ovl: create directories inside merged parent opaque
ovl: opaque cleanup
ovl: show redirect_dir mount option
ovl: allow setting max size of redirect
ovl: allow redirect_dir to default to "on"
ovl: check for emptiness of redirect dir
ovl: redirect on rename-dir
ovl: lookup redirects
ovl: consolidate lookup for underlying layers
ovl: fix nested overlayfs mount
ovl: check namelen
ovl: split super.c
ovl: use d_is_dir()
ovl: simplify lookup
ovl: check lower existence of rename target
ovl: rename: simplify handling of lower/merged directory
...
Pull btrfs updates from Chris Mason:
"Jeff Mahoney and Dave Sterba have a really nice set of cleanups in
here, and Christoph pitched in corrections/improvements to make btrfs
use proper helpers for bio walking instead of doing it by hand.
There are some key fixes as well, including some long standing bugs
that took forever to track down in btrfs_drop_extents and during
balance"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits)
btrfs: limit async_work allocation and worker func duration
Revert "Btrfs: adjust len of writes if following a preallocated extent"
Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
btrfs: opencode chunk locking, remove helpers
btrfs: remove root parameter from transaction commit/end routines
btrfs: split btrfs_wait_marked_extents into normal and tree log functions
btrfs: take an fs_info directly when the root is not used otherwise
btrfs: simplify btrfs_wait_cache_io prototype
btrfs: convert extent-tree tracepoints to use fs_info
btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
btrfs: root->fs_info cleanup, add fs_info convenience variables
btrfs: root->fs_info cleanup, update_block_group{,flags}
btrfs: root->fs_info cleanup, lock/unlock_chunks
btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
btrfs: pull node/sector/stripe sizes out of root and into fs_info
btrfs: root->fs_info cleanup, io_ctl_init
btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
btrfs: struct reada_control.root -> reada_control.fs_info
btrfs: struct btrfsic_state->root should be an fs_info
btrfs: alloc_reserved_file_extent trace point should use extent_root
...
that makes ACL inheritance a little more useful in environments that
default to restrictive umasks. Requires client-side support, also on
its way for 4.10.
Other than that, miscellaneous smaller fixes and cleanup, especially to
the server rdma code.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYVAEqAAoJECebzXlCjuG+VM0QAKaR+ibSM31Ahpnrgit5/wrb
n630KDFztO7iqEeuHfPQ4/n05T2QR0JWsLpjLMFvx88Gy4gyXYk9cuDPIrNKX1IS
3/nnhBo0+EVnjODjufommCrtbPZlqOSsS3N03vWkB7rTi8QYsWBOThh+XLRJYOXo
LZzJE1WmXNeCXV1kXPBsauryywql1fmwTXBzmIf1HbzoGAVROMEA2qqh4Z3nb7BP
sJuGchWx0STBOuAa278ighXQPUW2lUft9uzw2bssOtMwfNyOs/Pd6nx4F1Lg6WwD
1UQXoiR8K3PqelZfoeFJ05v0css/sbNKep+huWRdOXZj3Kjpa20lKBX8xHfat7sN
1OQ4FHx8ToigX3c+wwtlCqRMCcIxqUYkRjqzPHyeBiSSSp0rLrId44rI5x/K0yay
3bkGw7hFDSzc0Nq2uZgmtlbyTC71hLNhkWe7ThofcVG/pS0JtAqBiKIVwXJPh/e0
PLmVHYGU6Xowjag5edJlXY1tlIlxtWfqsWUarCXS5bfKUa3UjMVSjyuljsDqqJsn
96fEWu7DiUo4HeGYmf8MJoeZYV2y0DKSQGeguVkUKWp2DoTzinQHTfdKvrZVwNuu
hVE9/QeWzUvPY13HOUaKD2skozhbUChqv0NHESKUv8gxE3svTEpYZkXrE74WNqMk
l/WXAhw+RdKZof4+qdjU
=JANY
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"The one new feature is support for a new NFSv4.2 mode_umask attribute
that makes ACL inheritance a little more useful in environments that
default to restrictive umasks. Requires client-side support, also on
its way for 4.10.
Other than that, miscellaneous smaller fixes and cleanup, especially
to the server rdma code"
[ The client side of the umask attribute was merged yesterday ]
* tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux:
nfsd: add support for the umask attribute
sunrpc: use DEFINE_SPINLOCK()
svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
svcrdma: Break up dprintk format in svc_rdma_accept()
svcrdma: Remove unused variable in rdma_copy_tail()
svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
svcrdma: Remove svc_rdma_op_ctxt::wc_status
svcrdma: Remove DMA map accounting
svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
svcrdma: Renovate sendto chunk list parsing
svcauth_gss: Close connection when dropping an incoming message
svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
nfsd: constify reply_cache_stats_operations structure
nfsd: update workqueue creation
sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
nfsd: catch errors in decode_fattr earlier
nfsd: clean up supported attribute handling
nfsd: fix error handling for clients that fail to return the layout
nfsd: more robust allocation failure handling in nfsd_reply_cache_init
Pull vfs updates from Al Viro:
- more ->d_init() stuff (work.dcache)
- pathname resolution cleanups (work.namei)
- a few missing iov_iter primitives - copy_from_iter_full() and
friends. Either copy the full requested amount, advance the iterator
and return true, or fail, return false and do _not_ advance the
iterator. Quite a few open-coded callers converted (and became more
readable and harder to fuck up that way) (work.iov_iter)
- several assorted patches, the big one being logfs removal
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
logfs: remove from tree
vfs: fix put_compat_statfs64() does not handle errors
namei: fold should_follow_link() with the step into not-followed link
namei: pass both WALK_GET and WALK_MORE to should_follow_link()
namei: invert WALK_PUT logics
namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
namei: saner calling conventions for mountpoint_last()
namei.c: get rid of user_path_parent()
switch getfrag callbacks to ..._full() primitives
make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
[iov_iter] new primitives - copy_from_iter_full() and friends
don't open-code file_inode()
ceph: switch to use of ->d_init()
ceph: unify dentry_operations instances
lustre: switch to use of ->d_init()
If kcalloc() failed, the return value of ovl_fill_super() is -EINVAL,
not -ENOMEM. So this patch sets this value to -ENOMEM before calling
kcalloc(), and sets it back to -EINVAL after calling kcalloc().
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FWIW, there's a bit of abuse of struct kstat in overlayfs object
creation paths - for one thing, it ends up with a very small subset
of struct kstat (mode + rdev), for another it also needs link in
case of symlinks and ends up passing it separately.
IMO it would be better to introduce a separate object for that.
In principle, we might even lift that thing into general API and switch
->mkdir()/->mknod()/->symlink() to identical calling conventions. Hell
knows, perhaps ->create() as well...
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The benefit of making directories opaque on creation is that lookups can
stop short when they reach the original created directory, instead of
continue lookup the entire depth of parent directory stack.
The best case is overlay with N layers, performing lookup for first level
directory, which exists only in upper. In that case, there will be only
one lookup instead of N.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
oe->opaque is set for
a) whiteouts
b) directories having the "trusted.overlay.opaque" xattr
Case b can be simplified, since setting the xattr always implies setting
oe->opaque. Also once set, the opaque flag is never cleared.
Don't need to set opaque flag for non-directories.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a module option to allow tuning the max size of absolute redirects.
Default is 256.
Size of relative redirects is naturally limited by the the underlying
filesystem's max filename length (usually 255).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch introduces a kernel config option and a module param. Both can
be used independently to turn the default value of redirect_dir on or off.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Before introducing redirect_dir feature, the condition
!ovl_lower_positive(dentry) for a directory, implied that it is a pure
upper directory, which may be removed if empty.
Now that directory can be redirect, it is possible that upper does not
cover any lower (i.e. !ovl_lower_positive(dentry)), but the directory is a
merge (with redirected path) and maybe non empty.
Check for this case in ovl_remove_upper().
This change fixes the following test case from rename-pop-dir.py
of unionmount-testsuite:
"""Remove dir and rename old name"""
d = ctx.non_empty_dir()
d2 = ctx.no_dir()
ctx.rmdir(d, err=ENOTEMPTY)
ctx.rename(d, d2)
ctx.rmdir(d, err=ENOENT)
ctx.rmdir(d2, err=ENOTEMPTY)
./run --ov rename-pop-dir
/mnt/a/no_dir103: Expected error (Directory not empty) was not produced
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Current code returns EXDEV when a directory would need to be copied up to
move. We could copy up the directory tree in this case, but there's
another, simpler solution: point to old lower directory from moved upper
directory.
This is achieved with a "trusted.overlay.redirect" xattr storing the path
relative to the root of the overlay. After such attribute has been set,
the directory can be moved without further actions required.
This is a backward incompatible feature, old kernels won't be able to
correctly mount an overlay containing redirected directories.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If a directory has the "trusted.overlay.redirect" xattr, it means that the
value of the xattr should be used to find the underlying directory on the
next lower layer.
The redirect may be relative or absolute. Absolute redirects begin with a
slash.
A relative redirect means: instead of the current dentry's name use the
value of the redirect to find the directory in the next lower
layer. Relative redirects must not contain a slash.
An absolute redirect means: look up the directory relative to the root of
the overlay using the value of the redirect in the next lower layer.
Redirects work on lower layers as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use a common helper for lookup of upper and lower layers. This paves the
way for looking up directory redirects.
No functional change.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When the upper overlayfs checks "trusted.overlay.*" xattr on the underlying
overlayfs mount, it gets -EPERM, which confuses the upper overlayfs.
Fix this by returning -EOPNOTSUPP instead of -EPERM from
ovl_own_xattr_get() and ovl_own_xattr_set(). This behavior is consistent
with the behavior of ovl_listxattr(), which filters out the private
overlayfs xattrs.
Note: nested overlays are deprecated. But this change makes sense
regardless: these xattrs are private to the overlay and should always be
hidden. Hence getting and setting them should indicate this.
[SzMi: Use EOPNOTSUPP instead of ENODATA and use it for both getting and
setting "trusted.overlay." xattrs. This is a perfectly valid error code
for "we don't support this prefix", which is the case here.]
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We already calculate f_namelen in statfs as the maximum of the name lengths
provided by the filesystems taking part in the overlay.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fs/overlayfs/super.c is the biggest of the overlayfs source files and it
contains various utility functions as well as the rather complicated lookup
code. Split these parts out to separate files.
Before:
1446 fs/overlayfs/super.c
After:
919 fs/overlayfs/super.c
267 fs/overlayfs/namei.c
235 fs/overlayfs/util.c
51 fs/overlayfs/ovl_entry.h
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If encountering a non-directory, then stop looking at lower layers.
In this case the oe->opaque flag is not set anymore, which doesn't matter
since existence of lower file is now checked at remove/rename time.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Check if something exists on the lower layer(s) under the target or rename
to decide if directory needs to be marked "opaque".
Marking opaque is done before the rename, and on failure the marking was
undone. Also the opaque xattr was removed if the target didn't cover
anything.
This patch changes behavior so that removal of "opaque" is not done in
either of the above cases. This means that directory may have the opaque
flag even if it doesn't cover anything. However this shouldn't affect the
performance or semantics of the overalay, while simplifying the code.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
d_is_dir() is safe to call on a negative dentry. Use this fact to simplify
handling of the lower or merged directories.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently ovl_lookup() checks existence of lower file even if there's a
non-directory on upper (which is always opaque). This is done so that
remove can decide whether a whiteout is needed or not.
It would be better to defer this check to unlink, since most of the time
the gathered information about opaqueness will be unused.
This adds a helper ovl_lower_positive() that checks if there's anything on
the lower layer(s).
The following patches also introduce changes to how the "opaque" attribute
is updated on directories: this attribute is added when the directory is
creted or moved over a whiteout or object covering something on the lower
layer. However following changes will allow the attribute to remain on the
directory after being moved, even if the new location doesn't cover
anything. Because of this, we need to check lower layers even for opaque
directories, so that whiteout is only created when necessary.
This function will later be also used to decide about marking a directory
opaque, so deal with negative dentries as well. When dealing with
negative, it's enough to check for being a whiteout
If the dentry is positive but not upper then it also obviously needs
whiteout/opaque.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since commit 07a2daab49 ("ovl: Copy up underlying inode's ->i_mode to
overlay inode") sticky checking on overlay inode is performed by the vfs,
so checking against sticky on underlying inode is not needed.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is redundant, the vfs already performed this check (and was broken,
see commit 9409e22acd ("vfs: rename: check backing inode being equal")).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
No sense in opening special files on the underlying layers, they work just
as well if opened on the overlay.
Side effect is that it's no longer possible to connect one side of a pipe
opened on overlayfs with the other side opened on the underlying layer.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When copying up within the same fs, try to use vfs_clone_file_range().
This is very efficient when lower and upper are on the same fs
with file reflink support. If vfs_clone_file_range() fails for any
reason, copy up falls back to the regular data copy code.
Tested correct behavior when lower and upper are on:
1. same ext4 (copy)
2. same xfs + reflink patches + mkfs.xfs (copy)
3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)
For comparison, on my laptop, xfstest overlay/001 (copy up of large
sparse files) takes less than 1 second in the xfs reflink setup vs.
25 seconds on the rest of the setups.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This reverts commit 03bea60409.
Commit 4d0c5ba2ff ("vfs: do get_write_access() on upper layer of
overlayfs") makes the writecount checks inside overlayfs superfluous, the
file is already copied up and write access acquired on the upper inode when
ovl_setattr is called with ATTR_SIZE.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With overlayfs, it is wrong to compare file_inode(inode)->i_sb
of regular files with those of non-regular files, because the
former reference the real (upper/lower) sb and the latter reference
the overlayfs sb.
Move the test for same super block after the sanity tests for
clone range of directory and non-regular file.
This change fixes xfstest generic/157, which returned EXDEV instead
of EISDIR/EINVAL in the following test cases over overlayfs:
echo "Try to reflink a dir"
_reflink_range $testdir1/dir1 0 $testdir1/file2 0 $blksz
echo "Try to reflink a device"
_reflink_range $testdir1/dev1 0 $testdir1/file2 0 $blksz
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move sb_start_write()/sb_end_write() out of the vfs helper and up into the
ioctl handler.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FICLONE/FICLONERANGE ioctls return -EXDEV if src and dest
files are not on the same mount point.
Practically, clone only requires that src and dest files
are on the same file system.
Move the check for same mount point to ioctl handler and keep
only the check for same super block in the vfs helper.
A following patch is going to use the vfs_clone_file_range()
helper in overlayfs to copy up between lower and upper
mount points on the same file system.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We've checked for file_out being opened for write. This ensures that we
already have mnt_want_write() on target.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This reverts commit 9409e22acd.
Since commit 51f7e52dc9 ("ovl: share inode for hard link") there's no
need to call d_real_inode() to check two overlay inodes for equality.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Highlights include:
Stable bugfixes:
- Fix a pnfs deadlock between read resends and layoutreturn
- Don't invalidate the layout stateid while a layout return is outstanding
- Don't schedule a layoutreturn if the layout stateid is marked as invalid
- On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is complete
- SUNRPC: fix refcounting problems with auth_gss messages.
Features:
- Add client support for the NFSv4 umask attribute.
- NFSv4: Correct support for flock() stateids.
- Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when return-on-close
is specified
- Allow the pNFS/flexfiles layoutstat information to piggyback on LAYOUTRETURN
- Optimise away redundant GETATTR calls when doing state recovery and/or
when not required by cache revalidation rules or close-to-open cache
consistency.
- Attribute cache improvements
- RPC/RDMA support for SG_GAP devices
Bugfixes:
- NFS: Fix performance regressions in readdir
- pNFS/flexfiles: Fix a deadlock on LAYOUTGET
- NFSv4: Add missing nfs_put_lock_context()
- NFSv4.1: Fix regression in callback retry handling
- Fix false positive NFSv4.0 trunking detection.
- pNFS/flexfiles: Only send layoutstats updates for mirrors that were updated
- Various layout stateid related bugfixes
- RPC/RDMA bugfixes
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYUyemAAoJEGcL54qWCgDy96wP/Ry86cknfLUqLKJCbFVV4nV8
HdovCY8if8JQO0HUPDJ25ITvoRJNVRRwJMWnVq5XHRrPUHletDks6/UYfa63UDMv
umHvGST1cQPU1G+vBIQ3sdkVi1X1GeyBY4rU8aDWxLyKWwyeNptCK12i80ifyaGV
GZIIxuKVDOFS15M7NwMPRkrBacF8TyVK6S7275z6ZNmhFtvYwMAbvMxLabTwWAe8
4A03m4RDBTYhQIc2xLJbHfOTYoHi34l90wrn3C7Wv0I2zp8EJlzCY2tSbYKhfPg7
0HVKNdruRL+cHwLwJEcjFbxOg9MArgRxyup3dwAYQq7Ivsf9oR8/D61CDhanXAzy
cAWyrCyxaAoPWCOb8k4OFRh6jOF9LBGb5WTNpXRi1LoGrbvi6/WLlJccV60325wd
gmSAiwIE7aLG8pFk54J0Et86VaQ6qQNBUtJY/4m87uf1FSv3yzQvh7qDr7s+t8ZQ
kmSTZJzMWZLEEeyvEPZCfjygFu7n4PuTePJu31217styvat39TpY2p0HaaMhgC0V
/Y0ygGH7VlGp0oaVQ70CtBzGsCWTKU2DU8di7nvsCKg6iLv89QBILIJhVeP42tKd
juNCWVw4bpW1Zex7HXKecKfMXkDJ4qSDLFzGWj6Ue85f/rCOSKQH01jfvwBlvtBc
3E6fk85ExTw2+siHWiGy
=MapM
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable bugfixes:
- Fix a pnfs deadlock between read resends and layoutreturn
- Don't invalidate the layout stateid while a layout return is
outstanding
- Don't schedule a layoutreturn if the layout stateid is marked as
invalid
- On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is
complete
- SUNRPC: fix refcounting problems with auth_gss messages.
Features:
- Add client support for the NFSv4 umask attribute.
- NFSv4: Correct support for flock() stateids.
- Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when
return-on-close is specified
- Allow the pNFS/flexfiles layoutstat information to piggyback on
LAYOUTRETURN
- Optimise away redundant GETATTR calls when doing state recovery
and/or when not required by cache revalidation rules or
close-to-open cache consistency.
- Attribute cache improvements
- RPC/RDMA support for SG_GAP devices
Bugfixes:
- NFS: Fix performance regressions in readdir
- pNFS/flexfiles: Fix a deadlock on LAYOUTGET
- NFSv4: Add missing nfs_put_lock_context()
- NFSv4.1: Fix regression in callback retry handling
- Fix false positive NFSv4.0 trunking detection.
- pNFS/flexfiles: Only send layoutstats updates for mirrors that were
updated
- Various layout stateid related bugfixes
- RPC/RDMA bugfixes"
* tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (82 commits)
SUNRPC: fix refcounting problems with auth_gss messages.
nfs: add support for the umask attribute
pNFS/flexfiles: Ensure we have enough buffer for layoutreturn
pNFS/flexfiles: Remove a redundant parameter in ff_layout_encode_ioerr()
pNFS/flexfiles: Fix a deadlock on LAYOUTGET
pNFS: Layoutreturn must free the layout after the layout-private data
pNFS/flexfiles: Fix ff_layout_add_ds_error_locked()
NFSv4: Add missing nfs_put_lock_context()
pNFS: Release NFS_LAYOUT_RETURN when invalidating the layout stateid
NFSv4.1: Don't schedule lease recovery in nfs4_schedule_session_recovery()
NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE
NFS: Only look at the change attribute cache state in nfs_check_verifier
NFS: Fix incorrect size revalidation when holding a delegation
NFS: Fix incorrect mapping revalidation when holding a delegation
pNFS/flexfiles: Support sending layoutstats in layoutreturn
pNFS/flexfiles: Minor refactoring before adding iostats to layoutreturn
NFS: Fix up read of mirror stats
pNFS/flexfiles: Clean up layoutstats
pNFS/flexfiles: Refactor encoding of the layoutreturn payload
pNFS: Add a layoutreturn callback to performa layout-private setup
...
This includes the new virtio crypto device, and fixes all over the
place. In particular enabling endian-ness checks for sparse builds
found some bugs which this fixes. And it appears that everyone is in
agreement that disabling endian-ness sparse checks shouldn't be
necessary any longer.
So this enables them for everyone, and drops __CHECK_ENDIAN__
and __bitwise__ APIs.
IRQ handling in virtio has been refactored somewhat, the
larger switch to IRQ_SHARED will have to wait as
it proved too aggressive.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYUxYEAAoJECgfDbjSjVRp5lgH/22HKRyb3+M+z3oH6R9rJmz5
T4y3XI4yDOTlh93VzxlrHjHNBnoWRvzV5hn6BKH6bTbSZ87TabNhfws11FKGvhER
G1ipl/DvwytvvWgZ5dFdcC4x/0wpWawt2jgpEpPP33VDVkGJFEEAGj6GX10ClX99
ggrNfzUCHOAFaIWzC29i7gYMnYHIJDUqK6ycDxZebzsE/c12SNRGASxei2D+6eYC
YkdVg0c/d7Wsk+ZO1ugiA6omO4UdvPAVvxUkvd4YphRikwEWH7gGuz558wiSo4VN
iEMZvyYXSEjx4B2Hg8+mH63zWROEpCmaToUix9+4AF7YhkaeX5fICNdkAPdtxc8=
=urXH
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"virtio, vhost: new device, fixes, speedups
This includes the new virtio crypto device, and fixes all over the
place. In particular enabling endian-ness checks for sparse builds
found some bugs which this fixes. And it appears that everyone is in
agreement that disabling endian-ness sparse checks shouldn't be
necessary any longer.
So this enables them for everyone, and drops the __CHECK_ENDIAN__ and
__bitwise__ APIs.
IRQ handling in virtio has been refactored somewhat, the larger switch
to IRQ_SHARED will have to wait as it proved too aggressive"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits)
Makefile: drop -D__CHECK_ENDIAN__ from cflags
fs/logfs: drop __CHECK_ENDIAN__
Documentation/sparse: drop __CHECK_ENDIAN__
linux: drop __bitwise__ everywhere
checkpatch: replace __bitwise__ with __bitwise
Documentation/sparse: drop __bitwise__
tools: enable endian checks for all sparse builds
linux/types.h: enable endian checks for all sparse builds
virtio_mmio: Set dev.release() to avoid warning
vhost: remove unused feature bit
virtio_ring: fix description of virtqueue_get_buf
vhost/scsi: Remove unused but set variable
tools/virtio: use {READ,WRITE}_ONCE() in uaccess.h
vringh: kill off ACCESS_ONCE()
tools/virtio: fix READ_ONCE()
crypto: add virtio-crypto driver
vhost: cache used event for better performance
vsock: lookup and setup guest_cid inside vhost_vsock_lock
virtio_pci: split vp_try_to_find_vqs into INTx and MSI-X variants
virtio_pci: merge vp_free_vectors into vp_del_vqs
...
Clients can set the umask attribute when creating files to cause the
server to apply it always except when inheriting permissions from the
parent directory. That way, the new files will end up with the same
permissions as files created locally.
See https://tools.ietf.org/html/draft-ietf-nfsv4-umask-02 for more
details.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This patch fixes a place where function gfs2_glock_iter_next can
reference an invalid error pointer.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
1. Axe some dead code: christophe.jaillet@wanadoo.fr
2. fix memory leak: colin.king@canonical.com (found by Coverity)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJYUsxVAAoJEM9EDqnrzg2+F2AP/2fEGe9K0hiFiOFqeCWlGmI0
cJ8EXdbhzHqd8V3S4HLls3UlJ/4FjE99w4L1CpB42THy7FJ5wdMk4zObDAYLkgEs
euhTSe5MJTaNfIxDwna4t6a62Sdsu1wU7neVqT7A+ku1g3DcDoOgV8cOFV+j2wNO
/AtvG46XPjL4JooP1H/4xFG4saF6qLgi7WgKK0emLM8ulfMyYyh1GktuyA7YSRJy
tJVZ6a13C7J2G2W6N9KAD3ZQ3oaLwbghrL2k2ZNawmHr/2epxFhrlB8guWsl8hz9
dsSbA3TgFummxxI8YJjzdZnlNUkE/yWiMckcTTT02SSxewZtnzyMYfnSS1M7LmP4
/r1d8H9J/kfXFQBDXHKJe/kgBlF+uiiahQFpW9dGrXPkqzLfLq949GOj6Dc/j5bI
cPtanoUGDWu3MMyEKnTI+ZFBjET+YDFlFzJPiUXDKK4LnPCiwvvx20+lY5k8YGli
6nR9Pnjky6SC1zW7Wz3ngB+j44aPqkMbfFJxg0tGD2x5+gDTVI05nrmg5YrMv5IS
ildQDL6JdPonkk164sGedg6OWsaqWNcK0I/jX85ogjuYbQc8wUM/F2uDZGfrSnpH
XvmHeRKeOQ+UEkP1Hq+dI9RJreharkQKuDjUgBKLnWmx0pvcdnDlSiMnGdY2YHHI
n/KvTe7iHMW7QhHBpfn/
=GHrD
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.10-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Two small fixes sent in by other developers:
- axe some dead code (Christophe Jaillet)
- fix memory leak (Colin Ian King, found by Coverity)"
* tag 'for-linus-4.10-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: Axe some dead code
orangefs: fix memory leak of string 'new' on exit path
* File encryption for UBIFS using the fscrypt framework
* A fix to honor the dirty_writeback_interval sysctl
* Removal of dead code
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYUrkeAAoJEEtJtSqsAOnWlOYP/jdf3jljSDZZUbLAxZDzpYWx
lS86+Koht8n/FcOu2JLZJalwCxXDkqe09cNg7anGAT3pPnbrJbOCpKNj5V5CAxfX
I6rNiKhNFwtcVfH9Snv0O5qIfxk+uNxApGkULLE+bqpNdSYsjGxeI+f/7TARa3UU
QOlP3UM2dVcqwCUpjJDojVQ3rFnKR91J8KNkBbcq1Yzg5DuiyZowMYBuXSzxVGJ5
wAP3VNPd38s9/PlFvr+sHaLOdCp5GOtT5WrkCidFR8wqzCWtKgoU+w5alz6jz8XN
uJ/5J9qw1SPotZs6vqRCTll+eGLheVhpkpFRnPosHL1DhS9OrH1T9OZUWgZrfUGM
eEYhEwvwYwZiywsNUa5Evh1I6jMLvgH3xP/61/BO6xRYOyXa7hkb9y7SEKUm402t
fUF2V91e8ndNGUp879hgOBuHhQebjW6IIY2j6KFzRQgnswDkKN5Vbjd57/kTFC6m
JvrUE/mNDj7zHAG07JShnVyDGkmV71kD3Ixrr99L1DqJaiOoTzc8Vqch6y5wcdwy
GRB4580z3YgdoLNi7VxZi79Qpz1DkEMNI8B30yhnQF2zwpuDwkj9lJS/DKRurA+W
OW/DPZjuSG3kLApErfwjzfcTpbOzX3App+fx3lwvsTXt/qqEDxOuo16NrSJE5PK5
gjviqfL4UMwY6FkVlJDB
=xBCi
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.10-rc1' of git://git.infradead.org/linux-ubifs
Pull ubifs updates from Richard Weinberger:
- file encryption for UBIFS using the fscrypt framework
- a fix to honor the dirty_writeback_interval sysctl
- removal of dead code
* tag 'upstream-4.10-rc1' of git://git.infradead.org/linux-ubifs: (30 commits)
ubifs: Initialize fstr_real_len
ubifs: Use fscrypt ioctl() helpers
ubifs: Use FS_CFLG_OWN_PAGES
ubifs: Raise write version to 5
ubifs: Implement UBIFS_FLG_ENCRYPTION
ubifs: Implement UBIFS_FLG_DOUBLE_HASH
ubifs: Use a random number for cookies
ubifs: Add full hash lookup support
ubifs: Rename tnc_read_node_nm
ubifs: Add support for encrypted symlinks
ubifs: Implement encrypted filenames
ubifs: Make r5 hash binary string aware
ubifs: Relax checks in ubifs_validate_entry()
ubifs: Implement encrypt/decrypt for all IO
ubifs: Constify struct inode pointer in ubifs_crypt_is_encrypted()
ubifs: Introduce new data node field, compr_size
ubifs: Enforce crypto policy in mmap
ubifs: Massage assert in ubifs_xattr_set() wrt. fscrypto
ubifs: Preload crypto context in ->lookup()
ubifs: Enforce crypto policy in ->link and ->rename
...
When a server returns the optional flag SMB_SHARE_IS_IN_DFS in response
to a tree connect, cifs_build_path_to_root() will return a pathname
which includes the hostname. This causes problems with cifs_get_root()
which separates each component and does a lookup for each component of
the path which in this case will incorrectly include looking up the
hostname component as a path component.
We encountered a problem with dfs shares hosted by a Netapp. When
connecting to nodes pointed to by the DFS share. The tree connect for
these nodes return SMB_SHARE_IS_IN_DFS resulting failures in lookup
in cifs_get_root().
RH bz: 1373153
The patch was tested against a Netapp simulator and by a user using an
actual Netapp server.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Pierguido Lambri <plambri@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
With commit 2b149f119 many things have been fixed/introduced.
However, the default behaviour for RawNTLMSSP authentication
seems to be wrong in case the domain is not passed on the command line.
The main points (see below) of the patch are:
- It alignes behaviour with Windows clients
- It fixes backward compatibility
- It fixes UPN
I compared this behavour with the one from a Windows 10 command line
client. When no domains are specified on the command line, I traced
the packets and observed that the client does send an empty
domain to the server.
In the linux kernel case, the empty domain is replaced by the
primary domain communicated by the SMB server.
This means that, if the credentials are valid against the local server
but that server is part of a domain, then the kernel module will
ask to authenticate against that domain and we will get LOGON failure.
I compared the packet trace from the smbclient when no domain is passed
and, in that case, a default domain from the client smb.conf is taken.
Apparently, connection succeeds anyway, because when the domain passed
is not valid (in my case WORKGROUP), then the local one is tried and
authentication succeeds. I tried with any kind of invalid domain and
the result was always a connection.
So, trying to interpret what to do and picking a valid domain if none
is passed, seems the wrong thing to do.
To this end, a new option "domainauto" has been added in case the
user wants a mechanism for guessing.
Without this patch, backward compatibility also is broken.
With kernel 3.10, the default auth mechanism was NTLM.
One of our testing servers accepted NTLM and, because no
domains are passed, authentication was local.
Moving to RawNTLMSSP forced us to change our command line
to add a fake domain to pass to prevent this mechanism to kick in.
For the same reasons, UPN is broken because the domain is specified
in the username.
The SMB server will work out the domain from the UPN and authenticate
against the right server.
Without the patch, though, given the domain is empty, it gets replaced
with another domain that could be the wrong one for the authentication.
Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Passing a gazillion arguments takes a lot of code:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-253 (-253)
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Contained in this update:
- DAX PMD vaults via iomap infrastructure
- Direct-io support in iomap infrastructure
- removal of now-redundant XFS inode iolock, replaced with VFS i_rwsem
- synchronisation with fixes and changes in userspace libxfs code
- extent tree lookup helpers
- lots of little corruption detection improvements to verifiers
- optimised CRC calculations
- faster buffer cache lookups
- deprecation of barrier/nobarrier mount options - we always use
REQ_FUA/REQ_FLUSH where appropriate for data integrity now
- cleanups to speculative preallocation
- miscellaneous minor bug fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYUgqdAAoJEK3oKUf0dfodQgsP/1dJ4qUc6cRk8kL+f10FoIek
oFzdViRHZj8cROGe2n2YTBJtPa9KjU5DNHnxaxWZBN4ZpItp/uN1sAQhgtNQ4/cN
C3JF6B/+/dIbNSbd7DwvSl0dMWknzmrB+Myfs2ZPpMA1S4GInk1MOJSj7AQdYAvJ
dS0dQWAuIB20cahwuGA4y7zUniYL1IcF/BH8hlmzpcUNUoJ9AkR1hTg5/aVfmga3
w2p1vZyT2E4xs/Ff4FYW5MzPGxLVQMZVNIAXAcJl+c61z46ndXqidSmVHGvc+Tlt
ouxftHy/7KqowZlCFss1pSXg9HlXHhjS+iLbZerfcjO2qldriZS+QqQyASmQzPAz
+PpnMfVOj+yjsXKyIHWuS1G35aV16pPWwdA0ECeU6yv9iZ7tSz5rvSrsPZPLFz4x
RVhcKbmXR3y8DugkmtznU5ozxPt5hbbstEV3leCzxJpZu5reRJThUW7nYkSd0CEJ
ZyT/GP6Aq/MM8O/hOgVutAH409dsrYok8m/lq1J7VbNUt8inylcsMWsBeX/0/AHY
aC7I2Vx8bnbfL+C8wYKYhuShOGSch93O5hDUXdH2K/Sm5cK4y2asWge6MfFsS6Lu
waVYwd5aYBlNbzkvUMm2I5EV4cCCR3YwWYwfBEP7kPYUDxN14huOz6lVXnQPDLQ1
qsV1aNfK9PPiw6Fcaop0
=HwDG
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"There is quite a varied bunch of stuff in this update, and some of it
you will have already merged through the ext4 tree which imported the
dax-4.10-iomap-pmd topic branch from the XFS tree.
There is also a new direct IO implementation that uses the iomap
infrastructure. It's much simpler, faster, and has lower IO latency
than the existing direct IO infrastructure.
Summary:
- DAX PMD faults via iomap infrastructure
- Direct-io support in iomap infrastructure
- removal of now-redundant XFS inode iolock, replaced with VFS
i_rwsem
- synchronisation with fixes and changes in userspace libxfs code
- extent tree lookup helpers
- lots of little corruption detection improvements to verifiers
- optimised CRC calculations
- faster buffer cache lookups
- deprecation of barrier/nobarrier mount options - we always use
REQ_FUA/REQ_FLUSH where appropriate for data integrity now
- cleanups to speculative preallocation
- miscellaneous minor bug fixes and cleanups"
* tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
xfs: nuke unused tracepoint definitions
xfs: use GPF_NOFS when allocating btree cursors
xfs: use xfs_vn_setattr_size to check on new size
xfs: deprecate barrier/nobarrier mount option
xfs: Always flush caches when integrity is required
xfs: ignore leaf attr ichdr.count in verifier during log replay
xfs: use rhashtable to track buffer cache
xfs: optimise CRC updates
xfs: make xfs btree stats less huge
xfs: don't cap maximum dedupe request length
xfs: don't allow di_size with high bit set
xfs: error out if trying to add attrs and anextents > 0
xfs: don't crash if reading a directory results in an unexpected hole
xfs: complain if we don't get nextents bmap records
xfs: check for bogus values in btree block headers
xfs: forbid AG btrees with level == 0
xfs: several xattr functions can be void
xfs: handle cow fork in xfs_bmap_trace_exlist
xfs: pass state not whichfork to trace_xfs_extlist
xfs: Move AGI buffer type setting to xfs_read_agi
...
Logfs was introduced to the kernel in 2009, and hasn't seen any non
drive-by changes since 2012, while having lots of unsolved issues
including the complete lack of error handling, with more and more
issues popping up without any fixes.
The logfs.org domain has been bouncing from a mail, and the maintainer
on the non-logfs.org domain hasn't repsonded to past queries either.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge more updates from Andrew Morton:
- a few misc things
- kexec updates
- DMA-mapping updates to better support networking DMA operations
- IPC updates
- various MM changes to improve DAX fault handling
- lots of radix-tree changes, mainly to the test suite. All leading up
to reimplementing the IDA/IDR code to be a wrapper layer over the
radix-tree. However the final trigger-pulling patch is held off for
4.11.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
radix tree test suite: delete unused rcupdate.c
radix tree test suite: add new tag check
radix-tree: ensure counts are initialised
radix tree test suite: cache recently freed objects
radix tree test suite: add some more functionality
idr: reduce the number of bits per level from 8 to 6
rxrpc: abstract away knowledge of IDR internals
tpm: use idr_find(), not idr_find_slowpath()
idr: add ida_is_empty
radix tree test suite: check multiorder iteration
radix-tree: fix replacement for multiorder entries
radix-tree: add radix_tree_split_preload()
radix-tree: add radix_tree_split
radix-tree: add radix_tree_join
radix-tree: delete radix_tree_range_tag_if_tagged()
radix-tree: delete radix_tree_locate_item()
radix-tree: improve multiorder iterators
btrfs: fix race in btrfs_free_dummy_fs_info()
radix-tree: improve dump output
radix-tree: make radix_tree_find_next_bit more useful
...
Pull block IO fixes from Jens Axboe:
"A few fixes that I collected as post-merge.
I was going to wait a bit with sending this out, but the O_DIRECT fix
should really go in sooner rather than later"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: Fix failed allocation path when mapping queues
blk-mq: Avoid memory reclaim when remapping queues
block_dev: don't update file access position for sync direct IO
nvme/pci: Log PCI_STATUS when the controller dies
block_dev: don't test bdev->bd_contains when it is not stable
Pull fs meta data unmap optimization from Jens Axboe:
"A series from Jan Kara, providing a more efficient way for unmapping
meta data from in the buffer cache than doing it block-by-block.
Provide a general helper that existing callers can use"
* 'for-4.10/fs-unmap' of git://git.kernel.dk/linux-block:
fs: Remove unmap_underlying_metadata
fs: Add helper to clean bdev aliases under a bh and use it
ext2: Use clean_bdev_aliases() instead of iteration
ext4: Use clean_bdev_aliases() instead of iteration
direct-io: Use clean_bdev_aliases() instead of handmade iteration
fs: Provide function to unmap metadata for a range of blocks
This fixes several interlinked problems with the iterators in the
presence of multiorder entries.
1. radix_tree_iter_next() would only advance by one slot, which would
result in the iterators returning the same entry more than once if
there were sibling entries.
2. radix_tree_next_slot() could return an internal pointer instead of
a user pointer if a tagged multiorder entry was immediately followed by
an entry of lower order.
3. radix_tree_next_slot() expanded to a lot more code than it used to
when multiorder support was compiled in. And I wasn't comfortable with
entry_to_node() being in a header file.
Fixing radix_tree_iter_next() for the presence of sibling entries
necessarily involves examining the contents of the radix tree, so we now
need to pass 'slot' to radix_tree_iter_next(), and we need to change the
calling convention so it is called *before* dropping the lock which
protects the tree. Also rename it to radix_tree_iter_resume(), as some
people thought it was necessary to call radix_tree_iter_next() each time
around the loop.
radix_tree_next_slot() becomes closer to how it looked before multiorder
support was introduced. It only checks to see if the next entry in the
chunk is a sibling entry or a pointer to a node; this should be rare
enough that handling this case out of line is not a performance impact
(and such impact is amortised by the fact that the entry we just
processed was a multiorder entry). Also, radix_tree_next_slot() used to
force a new chunk lookup for untagged entries, which is more expensive
than the out of line sibling entry skipping.
Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We drop the lock which protects the radix tree, so we must call
radix_tree_iter_next() in order to avoid a modification to the tree
invalidating the iterator state.
Link: http://lkml.kernel.org/r/1480369871-5271-54-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.
Link: http://lkml.kernel.org/r/1479460644-25076-21-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite()
has released corresponding radix tree entry lock. When we want to
writeprotect PTE on cache flush, we need PTE modification to happen
under radix tree entry lock to ensure consistent updates of PTE and
radix tree (standard faults use page lock to ensure this consistency).
So move update of PTE bit into dax_pfn_mkwrite().
Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, flushing of caches for DAX mappings was ignoring entry lock.
So far this was ok (modulo a bug that a difference in entry lock could
cause cache flushing to be mistakenly skipped) but in the following
patches we will write-protect PTEs on cache flushing and clear dirty
tags. For that we will need more exclusion. So do cache flushing under
an entry lock. This allows us to remove one lock-unlock pair of
mapping->tree_lock as a bonus.
Link: http://lkml.kernel.org/r/1479460644-25076-19-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move final handling of COW faults from generic code into DAX fault
handler. That way generic code doesn't have to be aware of
peculiarities of DAX locking so remove that knowledge and make locking
functions private to fs/dax.c.
Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every single user of vmf->virtual_address typed that entry to unsigned
long before doing anything with it so the type of virtual_address does
not really provide us any additional safety. Just use masked
vmf->address which already has the appropriate type.
Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have two different structures for passing fault information
around - struct vm_fault and struct fault_env. DAX will need more
information in struct vm_fault to handle its faults so the content of
that structure would become event closer to fault_env. Furthermore it
would need to generate struct fault_env to be able to call some of the
generic functions. So at this point I don't think there's much use in
keeping these two structures separate. Just embed into struct vm_fault
all that is needed to use it for both purposes.
Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: unexport __get_user_pages_unlocked()".
This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.
It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality. This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.
Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)
This patch (of 2):
Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().
[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 262c5e86fe ("printk/btrfs: handle more message headers")
triggers:
warning: `ratelimit' may be used uninitialized in this function
with gcc (4.1.2) and probably many other versions. The code actually is
correct but a bit twisted. Let's make it more straightforward and set
the default values at the beginning.
Link: http://lkml.kernel.org/r/20161213135246.GQ3506@pathway.suse.cz
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull namespace updates from Eric Biederman:
"After a lot of discussion and work we have finally reachanged a basic
understanding of what is necessary to make unprivileged mounts safe in
the presence of EVM and IMA xattrs which the last commit in this
series reflects. While technically it is a revert the comments it adds
are important for people not getting confused in the future. Clearing
up that confusion allows us to seriously work on unprivileged mounts
of fuse in the next development cycle.
The rest of the fixes in this set are in the intersection of user
namespaces, ptrace, and exec. I started with the first fix which
started a feedback cycle of finding additional issues during review
and fixing them. Culiminating in a fix for a bug that has been present
since at least Linux v1.0.
Potentially these fixes were candidates for being merged during the rc
cycle, and are certainly backport candidates but enough little things
turned up during review and testing that I decided they should be
handled as part of the normal development process just to be certain
there were not any great surprises when it came time to backport some
of these fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC"
exec: Ensure mm->user_ns contains the execed files
ptrace: Don't allow accessing an undumpable mm
ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
Pull audit updates from Paul Moore:
"After the small number of patches for v4.9, we've got a much bigger
pile for v4.10.
The bulk of these patches involve a rework of the audit backlog queue
to enable us to move the netlink multicasting out of the task/thread
that generates the audit record and into the kernel thread that emits
the record (just like we do for the audit unicast to auditd).
While we were playing with the backlog queue(s) we fixed a number of
other little problems with the code, and from all the testing so far
things look to be in much better shape now. Doing this also allowed us
to re-enable disabling IRQs for some netns operations ("netns: avoid
disabling irq for netns id").
The remaining patches fix some small problems that are well documented
in the commit descriptions, as well as adding session ID filtering
support"
* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
audit: use proper refcount locking on audit_sock
netns: avoid disabling irq for netns id
audit: don't ever sleep on a command record/message
audit: handle a clean auditd shutdown with grace
audit: wake up kauditd_thread after auditd registers
audit: rework audit_log_start()
audit: rework the audit queue handling
audit: rename the queues and kauditd related functions
audit: queue netlink multicast sends just like we do for unicast sends
audit: fixup audit_init()
audit: move kaudit thread start from auditd registration to kaudit init (#2)
audit: add support for session ID user filter
audit: fix formatting of AUDIT_CONFIG_CHANGE events
audit: skip sessionid sentinel value when auto-incrementing
audit: tame initialization warning len_abuf in audit_log_execve_info
audit: less stack usage for /proc/*/loginuid
Pull security subsystem updates from James Morris:
"Generally pretty quiet for this release. Highlights:
Yama:
- allow ptrace access for original parent after re-parenting
TPM:
- add documentation
- many bugfixes & cleanups
- define a generic open() method for ascii & bios measurements
Integrity:
- Harden against malformed xattrs
SELinux:
- bugfixes & cleanups
Smack:
- Remove unnecessary smack_known_invalid label
- Do not apply star label in smack_setprocattr hook
- parse mnt opts after privileges check (fixes unpriv DoS vuln)"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (56 commits)
Yama: allow access for the current ptrace parent
tpm: adjust return value of tpm_read_log
tpm: vtpm_proxy: conditionally call tpm_chip_unregister
tpm: Fix handling of missing event log
tpm: Check the bios_dir entry for NULL before accessing it
tpm: return -ENODEV if np is not set
tpm: cleanup of printk error messages
tpm: replace of_find_node_by_name() with dev of_node property
tpm: redefine read_log() to handle ACPI/OF at runtime
tpm: fix the missing .owner in tpm_bios_measurements_ops
tpm: have event log use the tpm_chip
tpm: drop tpm1_chip_register(/unregister)
tpm: replace dynamically allocated bios_dir with a static array
tpm: replace symbolic permission with octal for securityfs files
char: tpm: fix kerneldoc tpm2_unseal_trusted name typo
tpm_tis: Allow tpm_tis to be bound using DT
tpm, tpm_vtpm_proxy: add kdoc comments for VTPM_PROXY_IOC_NEW_DEV
tpm: Only call pm_runtime_get_sync if device has a parent
tpm: define a generic open() method for ascii & bios measurements
Documentation: tpm: add the Physical TPM device tree binding documentation
...
r_safe_completion is currently, and has always been, signaled only if
on-disk ack was requested. It's there for fsync and syncfs, which wait
for in-flight writes to flush - all data write requests set ONDISK.
However, the pool perm check code introduced in 4.2 sends a write
request with only ACK set. An unfortunately timed syncfs can then hang
forever: r_safe_completion won't be signaled because only an unsafe
reply was requested.
We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
that is somewhat incomplete and yet another special case. Instead,
rename this completion to r_done_completion and always signal it when
the OSD client is done with the request, whether unsafe, safe, or
error. This is a bit cleaner and helps with the cancellation code.
Reported-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pool permission check needs to write to the first object. But for
snapshot, head of the first object may have already been deleted.
Skip the check for snapshot inode to avoid creating orphan object.
Link: http://tracker.ceph.com/issues/18211
Signed-off-by: Yan, Zheng <zyan@redhat.com>
needed for both ext4 and xfs dax changes to use iomap for DAX. It
also includes the fscrypt branch which is needed for ubifs encryption
work as well as ext4 encryption and fscrypt cleanups.
Lots of cleanups and bug fixes, especially making sure ext4 is robust
against maliciously corrupted file systems --- especially maliciously
corrupted xattr blocks and a maliciously corrupted superblock. Also
fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
mbcache so we don't miss some common xattr blocks that can be merged.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlhQQVEACgkQ8vlZVpUN
gaN9TQgAoCD+V4kJjMCFhiV8u6QR3hqD6bOZbggo5wJf4CHglWkmrbAmc3jANOgH
CKsXDRRjxuDjPXf1ukB1i4M7ArLYjkbbzKdsu7lismoJLS+w8uwUKSNdep+LYMjD
alxUcf5DCzLlUmdOdW4yE22L+CwRfqfs8IpBvKmJb7DrAKiwJVA340ys6daBGuu1
63xYx0QIyPzq0xjqLb6TVf88HUI4NiGVXmlm2wcrnYd5966hEZd/SztOZTVCVWOf
Z0Z0fGQ1WJzmaBB9+YV3aBi+BObOx4m2PUprIa531+iEW02E+ot5Xd4vVQFoV/r4
NX3XtoBrT1XlKagy2sJLMBoCavqrKw==
=j4KP
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"This merge request includes the dax-4.0-iomap-pmd branch which is
needed for both ext4 and xfs dax changes to use iomap for DAX. It also
includes the fscrypt branch which is needed for ubifs encryption work
as well as ext4 encryption and fscrypt cleanups.
Lots of cleanups and bug fixes, especially making sure ext4 is robust
against maliciously corrupted file systems --- especially maliciously
corrupted xattr blocks and a maliciously corrupted superblock. Also
fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
mbcache so we don't miss some common xattr blocks that can be merged"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
dax: Fix sleep in atomic contex in grab_mapping_entry()
fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
fscrypt: Delay bounce page pool allocation until needed
fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
fscrypt: Never allocate fscrypt_ctx on in-place encryption
fscrypt: Use correct index in decrypt path.
fscrypt: move the policy flags and encryption mode definitions to uapi header
fscrypt: move non-public structures and constants to fscrypt_private.h
fscrypt: unexport fscrypt_initialize()
fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
fscrypto: move ioctl processing more fully into common code
fscrypto: remove unneeded Kconfig dependencies
MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
ext4: do not perform data journaling when data is encrypted
ext4: return -ENOMEM instead of success
ext4: reject inodes with negative size
ext4: remove another test in ext4_alloc_file_blocks()
Documentation: fix description of ext4's block_validity mount option
ext4: fix checks for data=ordered and journal_async_commit options
...
This patch series contains several performance tuning patches regarding to the
IO submission flow, in addition to supporting new features such as a ZBC-base
drive and multiple devices.
It also includes some major bug fixes such as:
- checkpoint version control
- fdatasync-related roll-forward recovery routine
- memory boundary or null-pointer access in corner cases
- missing error cases
It has various minor clean-up patches as well.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYTx44AAoJEEAUqH6CSFDSnAQP/jeYJq5Zd0bweEF5g00Ec1Qg
qNKQ57e9EHDRaDLBUmHHEaCEPRL0bw6SOUUWWqzGA07KcsIK+Yb/dGAyIcuV7WMl
PjntVbYm4yARDYBHGupdOCzFSkzr8gDalb+98jJnoGUonsftljhES9jedQ1NjAms
GFPHDNtirZM/r0bjKkYKjpqJ6FCxFxcGPfb/GtohDajIpohWfKZiemaXGTgtYR4d
iBVek16h+Hprz90ycZBY69uz0TdAwu/gb+htMVBrAdExHWvlFzgp35OIywiAB/YX
3QD/x4t2HqOBaNYiiOAY4ukVW/Yyqa/ZAzbm+m5B5CAcFYiWXMy+cMXUY9HJJ/K0
wdvi//Avtvgpp2PVZFn2pASx14vgMFylBzuNgKpP6MPdtWTEL33jT7VYs9Nuz45E
dgZ9IpiDt4DeTRuZ4mPO5iH7bVHPvAVV80bpXzirCCzDeNZ1EFFIQzXh/2UAmCxI
twPXGBIYul0aIl9JkWAyhCZSd3XDSqedpfPudknjhzM9Xb1H5X0QJco7f/UwsWXH
WxV6lHr1Q7UH96wJ7x/GAqj8ArOAASRV18+K51dqU+DWHnFPpBArJe39FVf8NGWs
Fz1ZmlWBQ0ZgzvLkGa80llhjalXIEy/JabMrpy6VrzQGxHdmW4cVxe4dJ3710WxX
VysJUcNMRKxMUTWOKsxp
=Boum
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch series contains several performance tuning patches
regarding to the IO submission flow, in addition to supporting new
features such as a ZBC-base drive and multiple devices.
It also includes some major bug fixes such as:
- checkpoint version control
- fdatasync-related roll-forward recovery routine
- memory boundary or null-pointer access in corner cases
- missing error cases
It has various minor clean-up patches as well"
* tag 'for-f2fs-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (66 commits)
f2fs: fix a missing size change in f2fs_setattr
f2fs: fix to access nullified flush_cmd_control pointer
f2fs: free meta pages if sanity check for ckpt is failed
f2fs: detect wrong layout
f2fs: call sync_fs when f2fs is idle
Revert "f2fs: use percpu_counter for # of dirty pages in inode"
f2fs: return AOP_WRITEPAGE_ACTIVATE for writepage
f2fs: do not activate auto_recovery for fallocated i_size
f2fs: fix to determine start_cp_addr by sbi->cur_cp_pack
f2fs: fix 32-bit build
f2fs: set ->owner for debugfs status file's file_operations
f2fs: fix incorrect free inode count in ->statfs
f2fs: drop duplicate header timer.h
f2fs: fix wrong AUTO_RECOVER condition
f2fs: do not recover i_size if it's valid
f2fs: fix fdatasync
f2fs: fix to account total free nid correctly
f2fs: fix an infinite loop when flush nodes in cp
f2fs: don't wait writeback for datas during checkpoint
f2fs: fix wrong written_valid_blocks counting
...
While fstr_real_len is only being used under if (encrypted),
gcc-6 still warns.
Fixes this false positive:
fs/ubifs/dir.c: In function 'ubifs_readdir':
fs/ubifs/dir.c:629:13: warning: 'fstr_real_len' may be used
uninitialized in this function [-Wmaybe-uninitialized]
fstr.len = fstr_real_len
Initialize fstr_real_len to make gcc happy.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Richard Weinberger <richard@nod.at>
This set fixes error reporting for dlm sockets, removes the unbound
property on the dlm callback workqueue to improve performance, and
includes a couple trivial changes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYTwosAAoJEDgbc8f8gGmqqHUQAKj+/z+kIrMp5MJEhzriMLpP
wKIZa9bkmcm+BuLLf7EOwmaYx374HCq4oNNY7DJT0bE9rbFLwx9zgOvdoIjJFU3V
mSvhyH8FeyueNyZHZdXmA1JZCGbuCeS36cxseaeS14+ANE/cQFlOHW5ihvLAmmnR
fyV/38IjbDl33pTVf2YU5G232csicNMM8xR+1+ctrhd6CREdbY8Nf4TYVjNLHAsD
r3FsuzScv1+p1LuczEhFP/Nl0YcVpH3EzSgOY67WRSQlSMyrfdnVvJkgwSIZkhpp
XwW++ZBFq3B5Et1YgrFtTECrvMOb3hvoejtKTeTPq3tWoOvgweml1brtO8rVN85U
brdTn3blKE7oyh+0ITdENLKXsWB5+qe1afNN51qO+MZyXKCR6uct+SjSI+zelet8
jKqxP1bQCxbnvPfF/pWVGujDE4Cb6qoeCrFSoJ/VpC/JcKxxLB7p06yflY5Ztokr
yWnPiBSEz7M7+lRF/HKmJ2PZKwdZwyrrRWtCyRXPPD29kg4pG46oxjqU9iEp3R9F
hDCt/AiqQWWQuhU0RZ910h2ce1y9oSyQSAbVqfmqNYZMk6UeO+0X9+kxl5fSeIWT
bjO+LsZqz8QQG33XYADs+5dSRK9Lmh5roR6j7QKlVJUsB+RbBhkDSMArh+jSCQap
61L10OPKaN97m6TNXfVw
=4ZWQ
-----END PGP SIGNATURE-----
Merge tag 'dlm-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm fixes from David Teigland:
"This set fixes error reporting for dlm sockets, removes the unbound
property on the dlm callback workqueue to improve performance, and
includes a couple trivial changes"
* tag 'dlm-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix error return code in sctp_accept_from_sock()
dlm: don't specify WQ_UNBOUND for the ast callback workqueue
dlm: remove lock_sock to avoid scheduling while atomic
dlm: don't save callbacks after accept
dlm: audit and remove any unnecessary uses of module.h
dlm: make genl_ops const
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAlhOxdwACgkQNqiEXrVA
jGSLyxAAjfq81k86sv9FD+I8OQzF0rv9t4oKwmkoky/8hk2cOJzeujbo0Xm2KWf5
8nZvgQ47mJKj0qB4XZNuetRxnmlVlh9Fs162clKb9VhklFlupDfJMwGA93kmXwRu
bmGEpmAVRCtjc6T89GLor4InxGz8sj5aY1phPhKvOXbONtAEfY5Z/TLroCh9gVjt
J6jW14V8i24SDqgse65MJUtDAvNpVKtNwDlWbm7i91T6QN9IwrNoT6Yslk9xXCNH
srkbEvtkcB4kIDwub45QIcbl381ZqyJuKLD2aIkzW5vyvux9eaQZRxMGls/OZZqw
7Kxt4zIOM2A73+50hpCqlISQb5vTwl6HbwuzqYGLYuKYBXjxJEeUpPKQkHZikFn0
/rg6Ui40VY84hEHK9R/tpZX2hYkShbpci3s95PvyhGjUzo64vc+9EU7Tbs3xRJAP
5MwDT+t/7NL8VvpINJFByCpPDfk/qHOKItZ5dC7FK5VAZUHVv9BR6chlC/7kdOtw
8nZEoZIHYFPc08hHjLNQ34/8pf2uG7rjbrK8Fk4kI1dZHHC7ZAQm0uSKEYM3x3LX
EIClnW4ohDeVDrM+bzshn8BTnYcOCZHBh7AcjToAzzhtJH2DrW72Dl40F9BIWtHr
lyUmAXtFoJZlnLnlFbrcw/OWz4+uPJ2NRYa/0Yg1cnWtC7WGc90=
=99M0
-----END PGP SIGNATURE-----
Merge tag 'jfs-4.10' of git://github.com/kleikamp/linux-shaggy
Pull jfs update from David Kleikamp:
"The jfs piece of the current_time() series"
* tag 'jfs-4.10' of git://github.com/kleikamp/linux-shaggy:
fs: jfs: Replace CURRENT_TIME_SEC by current_time()
smbencrypt() points a scatterlist to the stack, which is breaks if
CONFIG_VMAP_STACK=y.
Fix it by switching to crypto_cipher_encrypt_one(). The new code
should be considerably faster as an added benefit.
This code is nearly identical to some code that Eric Biggers
suggested.
Cc: stable@vger.kernel.org # 4.9 only
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
For sync direct IO, generic_file_direct_write/generic_file_read_iter
will update file access position. Don't duplicate the update in
.direct_IO. This cause my raid array can't assemble.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdev->bd_contains is not stable before calling __blkdev_get().
When __blkdev_get() is called on a parition with ->bd_openers == 0
it sets
bdev->bd_contains = bdev;
which is not correct for a partition.
After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
and then ->bd_contains is stable.
When FMODE_EXCL is used, blkdev_get() calls
bd_start_claiming() -> bd_prepare_to_claim() -> bd_may_claim()
This call happens before __blkdev_get() is called, so ->bd_contains
is not stable. So bd_may_claim() cannot safely use ->bd_contains.
It currently tries to use it, and this can lead to a BUG_ON().
This happens when a whole device is already open with a bd_holder (in
use by dm in my particular example) and two threads race to open a
partition of that device for the first time, one opening with O_EXCL and
one without.
The thread that doesn't use O_EXCL gets through blkdev_get() to
__blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;
Immediately thereafter the other thread, using FMODE_EXCL, calls
bd_start_claiming() from blkdev_get(). This should fail because the
whole device has a holder, but because bdev->bd_contains == bdev
bd_may_claim() incorrectly reports success.
This thread continues and blocks on bd_mutex.
The first thread then sets bdev->bd_contains correctly and drops the mutex.
The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
again in:
BUG_ON(!bd_may_claim(bdev, whole, holder));
The BUG_ON fires.
Fix this by removing the dependency on ->bd_contains in
bd_may_claim(). As bd_may_claim() has direct access to the whole
device, it can simply test if the target bdev is the whole device.
Fixes: 6b4517a791 ("block: implement bd_claiming and claiming block")
Cc: stable@vger.kernel.org (v2.6.35+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJYT5HMAAoJELDendYovxMvhNQH/1g/3ahM4JKN8Z0SbjKBEdQm
yj2xOj6cE3l6wMSUblKjZD2DLLhpmcHT/E97Xro/lZQEfQJoMXXWWDFowMU/P1LA
mJxb7Fzq5Wr+6eGSAlIQB270MrpNi/luf+CWHMwVA3V7R3KRXwonOdGQSkISIzCd
tgIydEA3a9r2+HgeIBpZFZ4GcSrJQU75krMyl2tjD1C+jeYVd+zdoj2OnDsZQDZQ
hDWApMpNbpSBAn7JtSSdXWSTBsGH0lUECebeYPhPQ2sX2P6Y8+UCGwA7i6FFdbTa
agXfVSdRz8dCe3k19VcKDAw6nK9BTTMnEeEHmkmygIh6wuHPP44CzigTXIbJoXI=
=zjfm
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen features and fixes for 4.10
These are some fixes, a move of some arm related headers to share them
between arm and arm64 and a series introducing a helper to make code
more readable.
The most notable change is David stepping down as maintainer of the
Xen hypervisor interface. This results in me sending you the pull
requests for Xen related code from now on"
* tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits)
xen/balloon: Only mark a page as managed when it is released
xenbus: fix deadlock on writes to /proc/xen/xenbus
xen/scsifront: don't request a slot on the ring until request is ready
xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
x86: Make E820_X_MAX unconditionally larger than E820MAX
xen/pci: Bubble up error and fix description.
xen: xenbus: set error code on failure
xen: set error code on failures
arm/xen: Use alloc_percpu rather than __alloc_percpu
arm/arm64: xen: Move shared architecture headers to include/xen/arm
xen/events: use xen_vcpu_id mapping for EVTCHNOP_status
xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing
xen-scsifront: Add a missing call to kfree
MAINTAINERS: update XEN HYPERVISOR INTERFACE
xenfs: Use proc_create_mount_point() to create /proc/xen
xen-platform: use builtin_pci_driver
xen-netback: fix error handling output
xen: make use of xenbus_read_unsigned() in xenbus
xen: make use of xenbus_read_unsigned() in xen-pciback
xen: make use of xenbus_read_unsigned() in xen-fbfront
...
Here's the new driver core patches for 4.10-rc1.
Big thing here is the nice addition of "functional dependencies" to the
driver core. The idea has been talked about for a very long time, great
job to Rafael for stepping up and implementing it. It's been tested for
longer than the 4.9-rc1 date, we held off on merging it earlier in order
to feel more comfortable about it.
Other than that, it's just a handful of small other patches, some good
cleanups to the mess that is the firmware class code, and we have a test
driver for the deferred probe logic.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWFAvPQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ym3NgCgmhFeWEkp9SDt17YGGavmnzQUlBQAoJlUipJp
PHeQkq15ZWw3wWC9FEvM
=91M1
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here's the new driver core patches for 4.10-rc1.
Big thing here is the nice addition of "functional dependencies" to
the driver core. The idea has been talked about for a very long time,
great job to Rafael for stepping up and implementing it. It's been
tested for longer than the 4.9-rc1 date, we held off on merging it
earlier in order to feel more comfortable about it.
Other than that, it's just a handful of small other patches, some good
cleanups to the mess that is the firmware class code, and we have a
test driver for the deferred probe logic.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (30 commits)
firmware: Correct handling of fw_state_wait() return value
driver core: Silence device links sphinx warning
firmware: remove warning at documentation generation time
drivers: base: dma-mapping: Fix typo in dmam_alloc_non_coherent comments
driver core: test_async: fix up typo found by 0-day
firmware: move fw_state_is_done() into UHM section
firmware: do not use fw_lock for fw_state protection
firmware: drop bit ops in favor of simple state machine
firmware: refactor loading status
firmware: fix usermode helper fallback loading
driver core: firmware_class: convert to use class_groups
driver core: devcoredump: convert to use class_groups
driver core: class: add class_groups support
kernfs: Declare two local data structures static
driver-core: fix platform_no_drv_owner.cocci warnings
drivers/base/memory.c: Remove unused 'first_page' variable
driver core: add CLASS_ATTR_WO()
drivers: base: cacheinfo: support DT overrides for cache properties
drivers: base: cacheinfo: add pr_fmt logging
drivers: base: cacheinfo: fix boot error message when acpi is enabled
...
Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).
Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):
[root@kteam1 ~]# cat prep.sh
DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
mkdir /mnt/$i
btrfs subvolume create /mnt/SV_$i
ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
chmod a+rwx /mnt/$i
done
[root@kteam1 ~]# sh prep.sh
[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done
[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done
kmalloc-128 10144 10144 128 32 1 : tunables 0 0 0 : slabdata 317 317 0
kmalloc-128 9992352 9992352 128 32 1 : tunables 0 0 0 : slabdata 312261 312261 0
kmalloc-128 24226752 24226752 128 32 1 : tunables 0 0 0 : slabdata 757086 757086 0
kmalloc-128 42754240 42754240 128 32 1 : tunables 0 0 0 : slabdata 1336070 1336070 0
The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.
The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.
The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.
The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.
Changed in v2: remove support of thresh == NO_THRESHOLD.
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v3.15+
Commit db717d8e26 ("fscrypto: move ioctl processing more fully into
common code") moved ioctl() related functions into fscrypt and offers
us now a set of helper functions.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: David Gstir <david@sigma-star.at>
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
- Add additional checks for bad platform data
- Remove bounce buffer in console writer
- Protect read/unlink race with a mutex
- Correctly give up during dump locking failures
- Increase ftrace bandwidth by splitting ftrace buffers per CPU
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJYSJxYAAoJEIly9N/cbcAmYBsQAIAmHDgk3ootLQhyatZ9H2X0
Nyl24xA7UCPaz13ddF1tUaItI4mYBWfY4gde+3fIVXDitgmFxZZqb8YV68CvFgUt
Hb8tlTiM0F2z/muGBIgJ5TN5XiB4dO0WgvcKvnQdzyNGPVlAXvowHPkaM9X+iEA1
y4U2Le7iK9+9fvkH7RM4O3hMiTmpKeUITYTWo1Y8n9LaZo3w5+pqhS+TPu75uyD0
pLb53EOzZmg1nu9hcac5t4G5W1Lr4ji2EekDXemi/571HAzQnMXxJWc6ZVYLDNfP
W4D0UGcHAERDzrYwWcGn8HIThYlpbnVw9atSTTodJTiIubtsRt4haycUH1hqMS5o
4R2myhbAoM0A3zYBqrhwtQHg8apNes2hOR2WycAqgvylZZl1o6zaEs9zc7aafYuy
N/M0x5tlya3fOgkvkJsmERT5jtqDVMhtBZ2xa8NYfJCHgULaUmjEx25eTr1kF3nW
ERIX/3IayMvqHwYptP9dOzy2owLpXC8yZlM34AeM+ub93hHj1ELLfG7aN0bklD/+
wfmIX8HpOA2XGWflOk5fiHLHro6pwRU9zOIIHFJ4Tf60PMoN+rjRfej1fjz+KOhO
gxUYaCb+/4BlCqLqdFvF54qhQO2qmVuOAg/1BLu+hnZtXSyhVJxePthSs5shyoE8
owL8rVXDGapjF1xO6WCR
=UmFL
-----END PGP SIGNATURE-----
Merge tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:
"Improvements and fixes to pstore subsystem:
- add additional checks for bad platform data
- remove bounce buffer in console writer
- protect read/unlink race with a mutex
- correctly give up during dump locking failures
- increase ftrace bandwidth by splitting ftrace buffers per CPU"
* tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
ramoops: add pdata NULL check to ramoops_probe
pstore: Convert console write to use ->write_buf
pstore: Protect unlink with read_mutex
pstore: Use global ftrace filters for function trace filtering
ftrace: Provide API to use global filtering for ftrace ops
pstore: Clarify context field przs as dprzs
pstore: improve error report for failed setup
pstore: Merge per-CPU ftrace records into one
pstore: Add ftrace timestamp counter
ramoops: Split ftrace buffer space into per-CPU zones
pstore: Make ramoops_init_przs generic for other prz arrays
pstore: Allow prz to control need for locking
pstore: Warn on PSTORE_TYPE_PMSG using deprecated function
pstore: Make spinlock per zone instead of global
pstore: Actually give up during locking failure
Patches queued up by Filipe:
The most important change is still the fix for the extent tree
corruption that happens due to balance when qgroups are enabled (a
regression introduced in 4.7 by a fix for a regression from the last
qgroups rework). This has been hitting SLE and openSUSE users and QA
very badly, where transactions keep getting aborted when running
delayed references leaving the root filesystem in RO mode and nearly
unusable. There are fixes here that allow us to run xfstests again
with the integrity checker enabled, which has been impossible since 4.8
(apparently I'm the only one running xfstests with the integrity
checker enabled, which is useful to validate dirtied leafs, like
checking if there are keys out of order, etc). The rest are just some
trivial fixes, most of them tagged for stable, and two cleanups.
Signed-off-by: Chris Mason <clm@fb.com>
fsnotify_unmount_inodes() plays complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. Furthermore the code has a
bug that if the current inode is the last on i_sb_list that does not have e.g.
I_FREEING set, then we leave next_i pointing to inode which may get removed
from the i_sb_list once we drop s_inode_list_lock thus resulting in
use-after-free issues (usually manifesting as infinite looping in
fsnotify_unmount_inodes()).
Fix the problem by keeping current inode pinned somewhat longer. Then we can
make the code much simpler and standard.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but should be
more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to go...
Lots of plain-text files have also been converted and integrated.
- Images in binary formats have been replaced with more source-friendly
versions.
- Various bits of organizational work, including the renaming of various
files discussed at the kernel summit.
- New documentation for the device_link mechanism.
...and, of course, lots of typo fixes and small updates.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9
sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/
T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF
XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf
BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx
r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh
QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7
RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G
zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF
A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC
bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf
pmSJR0hVeRUmA4uw6+Su
=A0EV
-----END PGP SIGNATURE-----
Merge tag 'docs-4.10' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"These are the documentation changes for 4.10.
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but
should be more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to
go... Lots of plain-text files have also been converted and
integrated.
- Images in binary formats have been replaced with more
source-friendly versions.
- Various bits of organizational work, including the renaming of
various files discussed at the kernel summit.
- New documentation for the device_link mechanism.
... and, of course, lots of typo fixes and small updates"
* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
dma-buf: Extract dma-buf.rst
Update Documentation/00-INDEX
docs: 00-INDEX: document directories/files with no docs
docs: 00-INDEX: remove non-existing entries
docs: 00-INDEX: add missing entries for documentation files/dirs
docs: 00-INDEX: consolidate process/ and admin-guide/ description
scripts: add a script to check if Documentation/00-INDEX is sane
Docs: change sh -> awk in REPORTING-BUGS
Documentation/core-api/device_link: Add initial documentation
core-api: remove an unexpected unident
ppc/idle: Add documentation for powersave=off
Doc: Correct typo, "Introdution" => "Introduction"
Documentation/atomic_ops.txt: convert to ReST markup
Documentation/local_ops.txt: convert to ReST markup
Documentation/assoc_array.txt: convert to ReST markup
docs-rst: parse-headers.pl: cleanup the documentation
docs-rst: fix media cleandocs target
docs-rst: media/Makefile: reorganize the rules
docs-rst: media: build SVG from graphviz files
docs-rst: replace bayer.png by a SVG image
...
Merge updates from Andrew Morton:
- various misc bits
- most of MM (quite a lot of MM material is awaiting the merge of
linux-next dependencies)
- kasan
- printk updates
- procfs updates
- MAINTAINERS
- /lib updates
- checkpatch updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits)
init: reduce rootwait polling interval time to 5ms
binfmt_elf: use vmalloc() for allocation of vma_filesz
checkpatch: don't emit unified-diff error for rename-only patches
checkpatch: don't check c99 types like uint8_t under tools
checkpatch: avoid multiple line dereferences
checkpatch: don't check .pl files, improve absolute path commit log test
scripts/checkpatch.pl: fix spelling
checkpatch: don't try to get maintained status when --no-tree is given
lib/ida: document locking requirements a bit better
lib/rbtree.c: fix typo in comment of ____rb_erase_color
lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
MAINTAINERS: add drm and drm/i915 irc channels
MAINTAINERS: add "C:" for URI for chat where developers hang out
MAINTAINERS: add drm and drm/i915 bug filing info
MAINTAINERS: add "B:" for URI where to file bugs
get_maintainer: look for arbitrary letter prefixes in sections
printk: add Kconfig option to set default console loglevel
printk/sound: handle more message headers
printk/btrfs: handle more message headers
printk/kdb: handle more message headers
...
Pull timer updates from Thomas Gleixner:
"The time/timekeeping/timer folks deliver with this update:
- Fix a reintroduced signed/unsigned issue and cleanup the whole
signed/unsigned mess in the timekeeping core so this wont happen
accidentaly again.
- Add a new trace clock based on boot time
- Prevent injection of random sleep times when PM tracing abuses the
RTC for storage
- Make posix timers configurable for real tiny systems
- Add tracepoints for the alarm timer subsystem so timer based
suspend wakeups can be instrumented
- The usual pile of fixes and updates to core and drivers"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
timekeeping: Use mul_u64_u32_shr() instead of open coding it
timekeeping: Get rid of pointless typecasts
timekeeping: Make the conversion call chain consistently unsigned
timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
alarmtimer: Add tracepoints for alarm timers
trace: Update documentation for mono, mono_raw and boot clock
trace: Add an option for boot clock as trace clock
timekeeping: Add a fast and NMI safe boot clock
timekeeping/clocksource_cyc2ns: Document intended range limitation
timekeeping: Ignore the bogus sleep time if pm_trace is enabled
selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
arm64: dts: rockchip: Arch counter doesn't tick in system suspend
clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
posix-timers: Make them configurable
posix_cpu_timers: Move the add_device_randomness() call to a proper place
timer: Move sys_alarm from timer.c to itimer.c
ptp_clock: Allow for it to be optional
Kconfig: Regenerate *.c_shipped files after previous changes
...
Pull smp hotplug updates from Thomas Gleixner:
"This is the final round of converting the notifier mess to the state
machine. The removal of the notifiers and the related infrastructure
will happen around rc1, as there are conversions outstanding in other
trees.
The whole exercise removed about 2000 lines of code in total and in
course of the conversion several dozen bugs got fixed. The new
mechanism allows to test almost every hotplug step standalone, so
usage sites can exercise all transitions extensively.
There is more room for improvement, like integrating all the
pointlessly different architecture mechanisms of synchronizing,
setting cpus online etc into the core code"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
tracing/rb: Init the CPU mask on allocation
soc/fsl/qbman: Convert to hotplug state machine
soc/fsl/qbman: Convert to hotplug state machine
zram: Convert to hotplug state machine
KVM/PPC/Book3S HV: Convert to hotplug state machine
arm64/cpuinfo: Convert to hotplug state machine
arm64/cpuinfo: Make hotplug notifier symmetric
mm/compaction: Convert to hotplug state machine
iommu/vt-d: Convert to hotplug state machine
mm/zswap: Convert pool to hotplug state machine
mm/zswap: Convert dst-mem to hotplug state machine
mm/zsmalloc: Convert to hotplug state machine
mm/vmstat: Convert to hotplug state machine
mm/vmstat: Avoid on each online CPU loops
mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
tracing/rb: Convert to hotplug state machine
oprofile/nmi timer: Convert to hotplug state machine
net/iucv: Use explicit clean up labels in iucv_init()
x86/pci/amd-bus: Convert to hotplug state machine
x86/oprofile/nmi: Convert to hotplug state machine
...
We have observed page allocations failures of order 4 during core dump
while trying to allocate vma_filesz. This results in a useless core
file of size 0. To improve reliability use vmalloc().
Note that the vmalloc() allocation is bounded by sysctl_max_map_count,
which is 65,530 by default. So with a 4k page size, and 8 bytes per
seg, this is a max of 128 pages or an order 7 allocation. Other parts
of the core dump path, such as fill_files_note() are already using
vmalloc() for presumably similar reasons.
Link: http://lkml.kernel.org/r/1479745791-17611-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4bcc595ccd ("printk: reinstate KERN_CONT for printing
continuation lines") allows to define more message headers for a single
message. The motivation is that continuous lines might get mixed.
Therefore it make sense to define the right log level for every piece of
a cont line.
The current btrfs_printk() macros do not support continuous lines at the
moment. But better be prepared for a custom messages and avoid
potential "lvl" buffer overflow.
This patch iterates over the entire message header. It is interested
only into the message level like the original code.
This patch also introduces PRINTK_MAX_SINGLE_HEADER_LEN. Three bytes
are enough for the message level header at the moment. But it used to
be three, see the commit 04d2c8c83d ("printk: convert the format for
KERN_<LEVEL> to a 2 byte pattern").
Also I fixed the default ratelimit level. It looked very strange when it
was different from the default log level.
[pmladek@suse.com: Fix a check of the valid message level]
Link: http://lkml.kernel.org/r/20161111183236.GD2145@dhcp128.suse.cz
Link: http://lkml.kernel.org/r/1478695291-12169-4-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Runtime nlink calculation works but meh. I don't know how to do it at
compile time, but I know how to do it at init time.
Shift "2+" part into init time as a bonus.
Link: http://lkml.kernel.org/r/20161122195549.GB29812@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comparison for "<" works equally well as comparison for "<=" but one
SUB/LEA is saved (no, it is not optimised away, at least here).
Link: http://lkml.kernel.org/r/20161122195143.GA29812@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
format_decode and vsnprintf occasionally show up in perf top, so I went
looking for places that might not need the full printf power. With the
help of kprobes, I gathered some statistics on which format strings we
mostly pass to vsnprintf. On a trivial desktop workload, I hit "%x" 25%
of the time, so something apparently reads /proc/pid/status (which does
5*16 printf("%x") calls) a lot.
With this patch, reading /proc/pid/status is 30% faster according to
this microbenchmark:
char buf[4096];
int i, fd;
for (i = 0; i < 10000; ++i) {
fd = open("/proc/self/status", O_RDONLY);
read(fd, buf, sizeof(buf));
close(fd);
}
Link: http://lkml.kernel.org/r/1474410485-1305-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some comments were obsoleted since commit 05c0ae21c0 ("try a saner
locking for pde_opener...").
Some new comments added.
Some confusing comments replaced with equally confusing ones.
Link: http://lkml.kernel.org/r/20161029160231.GD1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kzalloc is too much, half of the fields will be reinitialized anyway.
If proc file doesn't have ->release hook (some still do not), clearing
is unnecessary because it will be freed immediately.
Link: http://lkml.kernel.org/r/20161029155747.GC1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
list_del_init() is too much, structure will be freed in three lines
anyway.
Link: http://lkml.kernel.org/r/20161029155313.GA1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux doesn't support 4GB+ filenames in /proc, so unsigned long is too
much.
MOV r64, r/m64 is larger than MOV r32, r/m32.
Link: http://lkml.kernel.org/r/20161029161123.GG1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"unsigned int" is better on x86_64 because it most of the time it
autoexpands to 64-bit value while "int" requires MOVSX instruction.
Link: http://lkml.kernel.org/r/20161029160810.GF1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to being able to examine if a process has been correctly
confined with seccomp, the state of no_new_privs is equally interesting,
so this adds it to /proc/$pid/status.
Link: http://lkml.kernel.org/r/20161103214041.GA58566@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jann Horn <jann@thejh.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rodrigo Freire <rfreire@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Robert Ho <robert.hu@intel.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Richard W.M. Jones" <rjones@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The other pagetable walks in task_mmu.c have a cond_resched() after
walking their ptes: add a cond_resched() in gather_pte_stats() too, for
reading /proc/<id>/numa_maps. Only pagemap_pmd_range() has a
cond_resched() in its (unusually expensive) pmd_trans_huge case: more
should probably be added, but leave them unchanged for now.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052157400.13021@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support handing __radix_tree_replace() a callback that gets invoked for
all leaf nodes that change or get freed as a result of the slot
replacement, to assist users tracking nodes with node->private_list.
This prepares for putting page cache shadow entries into the radix tree
root again and drastically simplifying the shadow tracking.
Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bug in khugepaged fixed earlier in this series shows that radix tree
slot replacement is fragile; and it will become more so when not only
NULL<->!NULL transitions need to be caught but transitions from and to
exceptional entries as well. We need checks.
Re-implement radix_tree_replace_slot() on top of the sanity-checked
__radix_tree_replace(). This requires existing callers to also pass the
radix tree root, but it'll warn us when somebody replaces slots with
contents that need proper accounting (transitions between NULL entries,
real entries, exceptional entries) and where a replacement through the
slot pointer would corrupt the radix tree node counts.
Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The way the page cache is sneaking shadow entries of evicted pages into
the radix tree past the node entry accounting and tracking them manually
in the upper bits of node->count is fraught with problems.
These shadow entries are marked in the tree as exceptional entries,
which are a native concept to the radix tree. Maintain an explicit
counter of exceptional entries in the radix tree node. Subsequent
patches will switch shadow entry tracking over to that counter.
DAX and shmem are the other users of exceptional entries. Since slot
replacements that change the entry type from regular to exceptional must
now be accounted, introduce a __radix_tree_replace() function that does
replacement and accounting, and switch DAX and shmem over.
The increase in radix tree node size is temporary. A followup patch
switches the shadow tracking to this new scheme and we'll no longer need
the upper bits in node->count and shrink that back to one byte.
Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CURRENT_TIME is not y2038 safe.
Use y2038 safe ktime_get_real_seconds() here for timestamps. struct
heartbeat_block's hb_seq and deletetion time are already 64 bits wide
and accommodate times beyond y2038.
Also use y2038 safe ktime_get_real_ts64() for on disk inode timestamps.
These are also wide enough to accommodate time64_t.
Link: http://lkml.kernel.org/r/1475365298-29236-1-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct timespec is not y2038 safe. Use time64_t which is y2038 safe to
represent orphan scan times. time64_t is sufficient here as only the
seconds delta times are relevant.
Also use appropriate time functions that return time in time64_t format.
Time functions now return monotonic time instead of real time as only
delta scan times are relevant and these values are not persistent across
reboots.
The format string for the debug print is still using long as this is
only the time elapsed since the last scan and long is sufficient to
represent this value.
Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_lock_refcount_tree, if ocfs2_read_refcount_block() returns an
error, we do ocfs2_refcount_tree_put twice (once in
ocfs2_unlock_refcount_tree and once outside it), thereby reducing the
refcount of the refcount tree twice, but we dont delete the tree in this
case. This will make refcnt of the tree = 0 and the
ocfs2_refcount_tree_put will eventually call ocfs2_mark_lockres_freeing,
setting OCFS2_LOCK_FREEING for the refcount_tree->rf_lockres.
The error returned by ocfs2_read_refcount_block is propagated all the
way back and for next iteration of write, ocfs2_lock_refcount_tree gets
the same tree back from ocfs2_get_refcount_tree because we havent
deleted the tree. Now we have the same tree, but OCFS2_LOCK_FREEING is
set for rf_lockres and eventually, when _ocfs2_lock_refcount_tree is
called in this iteration, BUG_ON( __ocfs2_cluster_lock:1395 ERROR:
Cluster lock called on freeing lockres T00000000000000000386019775b08d!
flags 0x81) is triggerred.
Call stack:
(loop16,11155,0):ocfs2_lock_refcount_tree:482 ERROR: status = -5
(loop16,11155,0):ocfs2_refcount_cow_hunk:3497 ERROR: status = -5
(loop16,11155,0):ocfs2_refcount_cow:3560 ERROR: status = -5
(loop16,11155,0):ocfs2_prepare_inode_for_refcount:2111 ERROR: status = -5
(loop16,11155,0):ocfs2_prepare_inode_for_write:2190 ERROR: status = -5
(loop16,11155,0):ocfs2_file_write_iter:2331 ERROR: status = -5
(loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: bug expression:
lockres->l_flags & OCFS2_LOCK_FREEING
(loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: Cluster lock called on
freeing lockres T00000000000000000386019775b08d! flags 0x81
kernel BUG at fs/ocfs2/dlmglue.c:1395!
invalid opcode: 0000 [#1] SMP CPU 0
Modules linked in: tun ocfs2 jbd2 xen_blkback xen_netback xen_gntdev .. sd_mod crc_t10dif ext3 jbd mbcache
RIP: __ocfs2_cluster_lock+0x31c/0x740 [ocfs2]
RSP: e02b:ffff88017c0138a0 EFLAGS: 00010086
Process loop16 (pid: 11155, threadinfo ffff88017c010000, task ffff8801b5374300)
Call Trace:
ocfs2_refcount_lock+0xae/0x130 [ocfs2]
__ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2]
ocfs2_lock_refcount_tree+0xdd/0x320 [ocfs2]
ocfs2_refcount_cow_hunk+0x1cb/0x440 [ocfs2]
ocfs2_refcount_cow+0xa9/0x1d0 [ocfs2]
ocfs2_prepare_inode_for_refcount+0x115/0x200 [ocfs2]
ocfs2_prepare_inode_for_write+0x33b/0x470 [ocfs2]
ocfs2_file_write_iter+0x220/0x8c0 [ocfs2]
aio_write_iter+0x2e/0x30
Fix this by avoiding the second call to ocfs2_refcount_tree_put()
Link: http://lkml.kernel.org/r/1473984404-32011-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Eric Ren <zren@suse.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'page' parameter in ocfs2_write_end_nolock() is never used.
Link: http://lkml.kernel.org/r/582FD91A.5000902@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When 'dispatch_assert' is set, 'response' must be DLM_MASTER_RESP_YES,
and 'res' won't be null, so execution can't reach these two branch.
Link: http://lkml.kernel.org/r/58174C91.3040004@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable `set_maybe' is redundant when the mle has been found in the
map. So it is ok to set the node_idx into mle's maybe_map directly.
Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D490DD@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The value of 'stage' must be between 1 and 2, so the switch can't reach
the default case.
Link: http://lkml.kernel.org/r/57FB5EB2.7050002@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 642261ac99: "dax: add struct iomap based DAX PMD support" has
introduced unmapping of page tables if huge page needs to be split in
grab_mapping_entry(). However the unmapping happens after
radix_tree_preload() call which disables preemption and thus
unmap_mapping_range() tries to acquire i_mmap_lock in atomic context
which is a bug. Fix the problem by moving unmapping before
radix_tree_preload() call.
Fixes: 642261ac99
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add a flags parameter to send_cap_msg, so we can request expedited
service from the MDS when we know we'll be waiting on the result.
Set that flag in the case of try_flush_caps. The callers of that
function generally wait synchronously on the result, so it's beneficial
to ask the server to expedite it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
The userland ceph has MClientCaps at struct version 10. This brings the
kernel up the same version.
For now, all of the the new stuff is set to default values including
the flags field, which will be conditionally set in a later patch.
Note that we don't need to set the change_attr and btime to anything
since we aren't currently setting the feature flag. The MDS should
ignore those values.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
When we get to this many arguments, it's hard to work with positional
parameters. send_cap_msg is already at 25 arguments, with more needed.
Define a new args structure and pass a pointer to it to send_cap_msg.
Eventually it might make sense to embed one of these inside
ceph_cap_snap instead of tracking individual fields.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Just for clarity. This part is inside the header, so it makes sense to
group it with the rest of the stuff in the header.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Dirty snapshot data needs to be flushed unconditionally. If they
were created before truncation, writeback should use old truncate
size/seq.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
When iov_iter type is ITER_PIPE, copy_page_to_iter() increases
the page's reference and add the page to a pipe_buffer. It also
set the pipe_buffer's ops to page_cache_pipe_buf_ops. The comfirm
callback in page_cache_pipe_buf_ops expects the page is from page
cache and uptodate, otherwise it return error.
For ceph_sync_read() case, pages are not from page cache. So we
can't call copy_page_to_iter() when iov_iter type is ITER_PIPE.
The fix is using iov_iter_get_pages_alloc() to allocate pages
for the pipe. (the code is similar to default_file_splice_read)
Signed-off-by: Yan, Zheng <zyan@redhat.com>
For readahead/fadvise cases, caller of ceph_readpages does not
hold buffer capability. Pages can be added to page cache while
there is no buffer capability. This can cause data integrity
issue.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:
WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 __might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109447d>] prepare_to_wait_event+0x5d/0x110
ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G O 4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
0000000000000000 ffff8838b416fa88 ffffffff812f4409 ffff8838b416fad0
ffffffff81a034f2 ffff8838b416fac0 ffffffff81052b46 ffffffff81a0432c
0000000000000061 0000000000000000 0000000000000000 ffff88167bda54a0
Call Trace:
[<ffffffff812f4409>] dump_stack+0x67/0x9e
[<ffffffff81052b46>] warn_slowpath_common+0x86/0xc0
[<ffffffff81052bcc>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
[<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
[<ffffffff8107767f>] __might_sleep+0x9f/0xb0
[<ffffffff81612d30>] mutex_lock+0x20/0x40
[<ffffffffa04eea14>] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
[<ffffffffa04fa692>] try_get_cap_refs+0xa2/0x320 [ceph]
[<ffffffffa04fd6f5>] ceph_get_caps+0x255/0x2b0 [ceph]
[<ffffffff81094370>] ? wait_woken+0xb0/0xb0
[<ffffffffa04f2c11>] ceph_write_iter+0x2b1/0xde0 [ceph]
[<ffffffff81613f22>] ? schedule_timeout+0x202/0x260
[<ffffffff8117f01a>] ? kmem_cache_free+0x1ea/0x200
[<ffffffff811b46ce>] ? iput+0x9e/0x230
[<ffffffff81077632>] ? __might_sleep+0x52/0xb0
[<ffffffff81156147>] ? __might_fault+0x37/0x40
[<ffffffff8119e123>] ? cp_new_stat+0x153/0x170
[<ffffffff81198cfa>] __vfs_write+0xaa/0xe0
[<ffffffff81199369>] vfs_write+0xa9/0x190
[<ffffffff811b6d01>] ? set_close_on_exec+0x31/0x70
[<ffffffff8119a056>] SyS_write+0x46/0xa0
This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.
Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528de ("sched/wait: Provide infrastructure to deal with
nested blocking")
Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
The length of the reply is protocol-dependent - for cephx it's
ceph_x_authorize_reply. Nothing sensible can be passed from the
messenger layer anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Starting with version 5 the following properties change:
- UBIFS_FLG_DOUBLE_HASH is mandatory
- UBIFS_FLG_ENCRYPTION is optional but depdens on UBIFS_FLG_DOUBLE_HASH
- Filesystems with unknown super block flags will be rejected, this
allows us in future to add new features without raising the UBIFS
write version.
Signed-off-by: Richard Weinberger <richard@nod.at>
This feature flag indicates that all directory entry nodes have a 32bit
cookie set and therefore UBIFS is allowed to perform lookups by hash.
Signed-off-by: Richard Weinberger <richard@nod.at>
UBIFS stores a 32bit hash of every file, for traditional lookups by name
this scheme is fine since UBIFS can first try to find the file by the
hash of the filename and upon collisions it can walk through all entries
with the same hash and do a string compare.
When filesnames are encrypted fscrypto will ask the filesystem for a
unique cookie, based on this cookie the filesystem has to be able to
locate the target file again. With 32bit hashes this is impossible
because the chance for collisions is very high. Do deal with that we
store a 32bit cookie directly in the UBIFS directory entry node such
that we get a 64bit cookie (32bit from filename hash and the dent
cookie). For a lookup by hash UBIFS finds the entry by the first 32bit
and then compares the dent cookie. If it does not match, it has to do a
linear search of the whole directory and compares all dent cookies until
the correct entry is found.
Signed-off-by: Richard Weinberger <richard@nod.at>
As of now all filenames known by UBIFS are strings with a NUL
terminator. With encrypted filenames a filename can be any binary
string and the r5 function cannot search for the NUL terminator.
UBIFS always knows how long a filename is, therefore we can change
the hash function to iterate over the filename length to work
correctly with binary strings.
Signed-off-by: Richard Weinberger <richard@nod.at>
When data of a data node is compressed and encrypted
we need to store the size of the compressed data because
before encryption we may have to add padding bytes.
For the new field we consume the last two padding bytes
in struct ubifs_data_node. Two bytes are fine because
the data length is at most 4096.
Signed-off-by: Richard Weinberger <richard@nod.at>
When we're creating a new inode in UBIFS the inode is not
yet exposed and fscrypto calls ubifs_xattr_set() without
holding the inode mutex. This is okay but ubifs_xattr_set()
has to know about this.
Signed-off-by: Richard Weinberger <richard@nod.at>
When a file is moved or linked into another directory
its current crypto policy has to be compatible with the
target policy.
Signed-off-by: Richard Weinberger <richard@nod.at>
We need ->open() for files to load the crypto key.
If the no key is present and the file is encrypted,
refuse to open.
Signed-off-by: Richard Weinberger <richard@nod.at>
We need the ->open() hook to load the crypto context
which is needed for all crypto operations within that
directory.
Signed-off-by: Richard Weinberger <richard@nod.at>
fscrypto will need this function too. Also get struct ubifs_info
from the provided inode. Not all callers will have a reference to
struct ubifs_info.
Signed-off-by: Richard Weinberger <richard@nod.at>
'ubifs_fast_find_freeable()' can not return an error pointer, so this test
can be removed.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Richard Weinberger <richard@nod.at>
Right now wbuf timer has hardcoded timeouts and there is no place for
manual adjustments. Some projects / cases many need that though. Few
file systems allow doing that by respecting dirty_writeback_interval
that can be set using sysctl (dirty_writeback_centisecs).
Lowering dirty_writeback_interval could be some way of dealing with user
space apps lacking proper fsyncs. This is definitely *not* a perfect
solution but we don't have ideal (user space) world. There were already
advanced discussions on this matter, mostly when ext4 was introduced and
it wasn't behaving as ext3. Anyway, the final decision was to add some
hacks to the ext4, as trying to fix whole user space or adding new API
was pointless.
We can't (and shouldn't?) just follow ext4. We can't e.g. sync on close
as this would cause too many commits and flash wearing. On the other
hand we still should allow some trade-off between -o sync and default
wbuf timeout. Respecting dirty_writeback_interval should allow some sane
cutomizations if used warily.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Values of these fields are set during init and never modified. They are
used (read) in a single function only. There isn't really any reason to
keep them in a struct. It only makes struct just a bit bigger without
any visible gain.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
This patch fix a missing size change in f2fs_setattr
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The "perf_counter_reset" case has already been handled above.
Moreover "ORANGEFS_PARAM_REQUEST_OP_READAHEAD_COUNT_SIZE" is not a really
consistent.
It is likely that this (dead) code is a cut and paste left over.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
allocates string 'new' is not free'd on the exit path when
cdm_element_count <= 0. Fix this by kfree'ing it.
Fixes CoverityScan CID#1375923 "Resource Leak"
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This is exposing an existing deadlock between fsync and AIO. Until we
have the deadlock fixed, I'm pulling this one out.
This reverts commit a23eaa875f.
Signed-off-by: Chris Mason <clm@fb.com>
... to better explain its purpose after introducing in-place encryption
without bounce buffer.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since fscrypt users can now indicated if fscrypt_encrypt_page() should
use a bounce page, we can delay the bounce page pool initialization util
it is really needed. That is until fscrypt_operations has no
FS_CFLG_OWN_PAGES flag set.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Rename the FS_CFLG_INPLACE_ENCRYPTION flag to FS_CFLG_OWN_PAGES which,
when set, indicates that the fs uses pages under its own control as
opposed to writeback pages which require locking and a bounce buffer for
encryption.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In case of in-place encryption fscrypt_ctx was allocated but never
released. Since we don't need it for in-place encryption, we skip
allocating it.
Fixes: 1c7dcf69ee ("fscrypt: Add in-place encryption mode")
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Actually use the fs-provided index instead of always using page->index
which is only set for page-cache pages.
Fixes: 9c4bb8a3a9 ("fscrypt: Let fs select encryption index/tweak")
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The fscrypt_initalize() function isn't used outside fs/crypto, so
there's no point making it be an exported symbol.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
To avoid namespace collisions, rename get_crypt_info() to
fscrypt_get_crypt_info(). The function is only used inside the
fs/crypto directory, so declare it in the new header file,
fscrypt_private.h.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Multiple bugs were recently fixed in the "set encryption policy" ioctl.
To make it clear that fscrypt_process_policy() and fscrypt_get_policy()
implement ioctls and therefore their implementations must take standard
security and correctness precautions, rename them to
fscrypt_ioctl_set_policy() and fscrypt_ioctl_get_policy(). Make the
latter take in a struct file * to make it consistent with the former.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
SHA256 and ENCRYPTED_KEYS are not needed. CTR shouldn't be needed
either, but I left it for now because it was intentionally added by
commit 71dea01ea2 ("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4
encryption is enabled"). So it sounds like there may be a dependency
problem elsewhere, which I have not been able to identify specifically,
that must be solved before CTR can be removed.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently data journalling is incompatible with encryption: enabling both
at the same time has never been supported by design, and would result in
unpredictable behavior. However, users are not precluded from turning on
both features simultaneously. This change programmatically replaces data
journaling for encrypted regular files with ordered data journaling mode.
Background:
Journaling encrypted data has not been supported because it operates on
buffer heads of the page in the page cache. Namely, when the commit
happens, which could be up to five seconds after caching, the commit
thread uses the buffer heads attached to the page to copy the contents of
the page to the journal. With encryption, it would have been required to
keep the bounce buffer with ciphertext for up to the aforementioned five
seconds, since the page cache can only hold plaintext and could not be
used for journaling. Alternatively, it would be required to setup the
journal to initiate a callback at the commit time to perform deferred
encryption - in this case, not only would the data have to be written
twice, but it would also have to be encrypted twice. This level of
complexity was not justified for a mode that in practice is very rarely
used because of the overhead from the data journalling.
Solution:
If data=journaled has been set as a mount option for a filesystem, or if
journaling is enabled on a regular file, do not perform journaling if the
file is also encrypted, instead fall back to the data=ordered mode for the
file.
Rationale:
The intent is to allow seamless and proper filesystem operation when
journaling and encryption have both been enabled, and have these two
conflicting features gracefully resolved by the filesystem.
Fixes: 4461471107
Signed-off-by: Sergey Karamov <skaramov@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Connect the new VFS clone_range, copy_range, and dedupe_range features
to the existing reflink capability of ocfs2. Compared to the existing
ocfs2 reflink ioctl We have to do things a little differently to support
the VFS semantics (we can clone subranges of a file but we don't clone
xattrs), but the VFS ioctls are more broadly supported.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Convert inline data files to extents files before reflinking,
and fix i_blocks so that stat(2) output is correct.
v3: Make zero-length dedupe consistent with btrfs behavior.
v4: Use VFS double-inode lock routines and remove MAX_DEDUPE_LEN.
When ocfs2 shares blocks from one file to another, it's necessary to
charge that many blocks to the quota because ocfs2 tallies block charges
according to the number of blocks mapped, not the number of physical
blocks used.
Without this patch, reflinking X blocks and then CoWing all of them
causes quota usage to *decrease* by X as seen in generic/305.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
generic/188 triggered a dmesg stack trace because the dio completion
was casting a buffer head to an on-disk inode, which is whacky.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Always unlock the inode when completing dio writes, even if an error
has occurrred. The caller already checks the inode and unlocks it
if needed, so we might as well reduce contention.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
ocfs2_dio_end_io_write eats whatever errors may happen,
which means that write errors do not propagate to userspace.
Fix that.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we're adding the refcount flag to an extent, we have to budget
enough space to handle a full extent btree split in addition to
whatever modifications have to be made to the refcount btree. We
don't currently do this, with the result that generic/186 crashes
when we need an extent split but not a refcount split because meta_ac
never gets allocated.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The swapfile mechanism calls bmap once to find all the swap file
mappings, which means that we cannot properly support CoW remapping.
Therefore, error out if the swap code tries to call bmap on a
refcounted file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace the open-coded inode refcount flag test with a helper function
to reduce the potential for bugs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
... and don't zero anything on short copy; just unlock
and return 0 if that has happened on non-uptodate page.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we had a short copy into an uptodate page, there's no reason
whatsoever to zero anything; OTOH, if that page had _not_ been
uptodate, we must have been trying to overwrite it completely
and got a short copy. In that case, overwriting the end with
zeroes, marking uptodate and sending to server is just plain
wrong. Just unlock, keep it non-uptodate and return 0.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) the page is uptodate - ->write_begin() would either fail (in which
case we don't reach ->write_end()), or unstuff the inode, or find the
page already uptodate, or do a successful call of stuffed_readpage(),
which would've made it uptodate
b) zeroing the tail in pagecache is wrong. kill -9 at the right time
while writing unmodified file contents to the same file should _not_
leave us in a situation when read() from the file will be reporting
it full of zeroes. Especially since that effect will be transient -
at some later point the page will be evicted and then we'll be back
to the real file contents.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
don't zero on short copies; if the page was uptodate it's just plain
wrong, and if it wasn't we'll be better off just returning 0 and
buggering off.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We should set the error code if kzalloc() fails.
Fixes: 67cf5b09a4 ("ext4: add the basic function for inline data support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Don't load an inode with a negative size; this causes integer overflow
problems in the VFS.
[ Added EXT4_ERROR_INODE() to mark file system as corrupted. -TYT]
Fixes: a48380f769 (ext4: rename i_dir_acl to i_size_high)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Clients can set the umask attribute when creating files to cause the
server to apply it always except when inheriting permissions from the
parent directory. That way, the new files will end up with the same
permissions as files created locally.
See https://tools.ietf.org/html/draft-ietf-nfsv4-umask-02 for more details.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
What matters when deciding if we should make a page uptodate is
not how much we _wanted_ to copy, but how much we actually have
copied. As it is, on architectures that do not zero tail on
short copy we can leave uninitialized data in page marked uptodate.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The flexfiles client can piggyback both layout errors and layoutstats
as part of the layoutreturn. Both these payloads can get large, with
20 layout error entries taking up about 1.2K, and 4 layoutstats entries
taking up another 1K.
This patch allows a maximum payload of 4k by allocating a full page.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Hoist both the XFS reflink inode state and preparation code and the XFS
file blocks compare functions into the VFS so that ocfs2 can take
advantage of it for reflink and dedupe.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A clone is a perfectly fine implementation of a file copy, so most
file systems just implement the copy that way. Instead of duplicating
this logic move it to the VFS. Currently btrfs and XFS implement copies
the same way as clones and there is no behavior change for them, cifs
only implements clones and grow support for copy_file_range with this
patch. NFS implements both, so this will allow copy_file_range to work
on servers that only implement CLONE and be lot more efficient on servers
that implements CLONE and COPY.
Signed-off-by: Christoph Hellwig <hch@lst.de>
kernel crashes. Marked for stable - it goes back to 4.6, but started
popping up only in 4.8.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYSo2yAAoJEEp/3jgCEfOLa1UH/j/nhhmy6bkvNTNQT9PuWmlu
1cGG6JkJq/USEaSO+VXGSAjBCjCngNTYYXBo0IBCnkf11tuwagvz/9LSbvy9P+vu
1IKwcJBFpgcEMEZWsYjVui9uFiDcLYiTPt4pux4tQ4vyj6HEFgioTg/430ApUEOS
ywO1pjRz8RH0FlKhhcTRGOwVcwUzI/aRw7aLeflSwz3mDnh6ajp/8pjvxWf7AN+V
Ih9LygjYNb4IdUcgN2G05z2qKLPfNAoBA+kRdEkOzecX2J0Db8Bu1bfZBxgOK+ui
kpdVlFPkpULbwjlLLpvmOgy7FKgmLfdxuEuQol8hCu0KQ+buP/kZnbjg6QBeCtk=
=1nK/
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.9-rc9' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for an issue with ->d_revalidate() in ceph, causing frequent
kernel crashes.
Marked for stable - it goes back to 4.6, but started popping up only
in 4.8"
* tag 'ceph-for-4.9-rc9' of git://github.com/ceph/ceph-client:
ceph: don't set req->r_locked_dir in ceph_d_revalidate
If .readlink == NULL implies generic_readlink().
Generated by:
to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If i_op->readlink is NULL, but i_op->get_link is set then vfs_readlink()
defaults to calling generic_readlink().
The IOP_DEFAULT_READLINK flag indicates that the above conditions are met
and the default action can be taken.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Also check d_is_symlink() in callers instead of inode->i_op->readlink
because following patches will allow NULL ->readlink for symlinks.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The /proc/self and /proc/self-thread symlinks have separate but identical
functionality for reading and following. This cleanup utilizes
generic_readlink to remove the duplication.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Here again we are copying form one buffer to another, while jumping through
hoops to make kernel memory look like userspace memory.
For no good reason, since vfs_get_link() provides exactly what is needed.
As a bonus, now the security hook for readlink is also called on the
underlying inode.
Note: this can be called from link-following context. But this is okay:
- not in RCU mode
- commit e54ad7f1ee ("proc: prevent stacking filesystems on top")
- ecryptfs is *reading* the underlying symlink not following it, so the
right security hook is being called
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
btrfs_transaction_abort() has a WARN() to help us nail down whatever
problem lead to the abort. But most of the time, we're aborting for EIO,
and the warning just adds noise.
Signed-off-by: Chris Mason <clm@fb.com>
New inode operations were forgotten to be added to bad_inode. Most of the
time the op is checked for NULL before being called but marking the inode
bad and the check can race (very unlikely).
However in case of ->get_link() only DCACHE_SYMLINK_TYPE is checked before
calling the op, so there's no race and will definitely oops when trying to
follow links on such a beast.
Also remove comments about extinct ops.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
This is all unused code, so remove it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use NOFS for allocating btree cursors, since they can be called
under the ilock.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Commit 6552321831 ("xfs: remove i_iolock and use i_rwsem in the
VFS inode instead") introduced a regression that truncate(2) doesn't
check on new size, so it succeeds even if the new size exceeds the
current resource limit. Because xfs_setattr_size() was used instead
of xfs_vn_setattr_size(), and the latter calls xfs_vn_change_ok()
first to do sanity check on permission and new size.
This is found by truncate03 test from ltp, and the following is a
simplified reproducer:
#!/bin/bash
dev=/dev/sda5
mnt=/mnt/xfs
mkfs -t xfs -f $dev
mount $dev $mnt
# set max file size to 16k
ulimit -f 16
truncate -s $((16 * 1024 + 1)) /mnt/xfs/testfile
[ $? -eq 0 ] && echo "FAIL: truncate exceeded max file size"
ulimit -f unlimited
umount $mnt
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We always perform integrity operations now, so these mount options
don't do anything. Deprecate them and mark them for removal in
in a year.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There is no reason anymore for not issuing device integrity
operations when teh filesystem requires ordering or data integrity
guarantees. We should always issue cache flushes and FUA writes
where necessary and let the underlying storage optimise them as
necessary for correct integrity operation.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we create a new attribute, we first create a shortform
attribute, and try to fit the new attribute into it.
If that fails, we copy the (empty) attribute into a leaf attribute,
and do the copy again. Thus there can be a transient state where
we have an empty leaf attribute.
If we encounter this during log replay, the verifier will fail.
So add a test to ignore this part of the leaf attr verification
during log replay.
Thanks as usual to dchinner for spotting the problem.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We encountered a deadlock where the SEQUENCE that accompanied the
LAYOUTGET triggered a session drain, while ff_layout_alloc_lseg
triggered a GETDEVICEINFO. The GETDEVICEINFO hung waiting for the
session drain, while the LAYOUTGET held the slot waiting for
alloc_lseg to finish.
Avoid this by moving the call to nfs4_find_get_deviceid out of
ff_layout_alloc_lseg and into nfs4_ff_layout_prepare_ds.
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
[dros@primarydata.com: pNFS/flexfiles: fix races in ff_layout_mirror_valid]
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This function sets req->r_locked_dir which is supposed to indicate to
ceph_fill_trace that the parent's i_rwsem is locked for write.
Unfortunately, there is no guarantee that the dir will be locked when
d_revalidate is called, so we really don't want ceph_fill_trace to do
any dcache manipulation from this context. Clear req->r_locked_dir since
it's clearly not safe to do that.
What we really want to know with d_revalidate is whether the dentry
still points to the same inode. ceph_fill_trace installs a pointer to
the inode in req->r_target_inode, so we can just compare that to
d_inode(dentry) to see if it's the same one after the lookup.
Also, since we aren't generally interested in the parent here, we can
switch to using a GETATTR to hint that to the MDS, which also means that
we only need to reserve one cap.
Finally, just remove the d_unhashed check. That's really outside the
purview of a filesystem's d_revalidate. If the thing became unhashed
while we're checking it, then that's up to the VFS to handle anyway.
Fixes: 200fd27c8f ("ceph: use lookup request to revalidate dentry")
Link: http://tracker.ceph.com/issues/18041
Reported-by: Donatas Abraitis <donatas.abraitis@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
f2fs_sync_file() remount_ro
- f2fs_readonly
- destroy_flush_cmd_control
- f2fs_issue_flush
- no fcc pointer!
So, this patch doesn't free fcc in this case, but just stop its kernel thread
which sends flush commands.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
put_compat_statfs64() does NOT return -1 and setting errno to EOVERFLOW
when some variables(like: f_bsize) overflowed in the returned struct.
The reason is that the ubuf->f_blocks is __u64 type, it couldn't be
4bits as the judgement in put_comat_statfs64(). Here correct the
__u32 variables(in struct compat_statfs64) for comparison.
reproducer:
step1. mount hugetlbfs with two different pagesize on ppc64 arch.
$ hugeadm --pool-pages-max 16M:0
$ hugeadm --create-mount
$ mount | grep -i hugetlbfs
none on /var/lib/hugetlbfs/pagesize-16MB type hugetlbfs (rw,relatime,seclabel,pagesize=16777216)
none on /var/lib/hugetlbfs/pagesize-16GB type hugetlbfs (rw,relatime,seclabel,pagesize=17179869184)
step2. compile & run this C program.
$ cat statfs64_test.c
#define _LARGEFILE64_SOURCE
#include <stdio.h>
#include <sys/syscall.h>
#include <sys/statfs.h>
int main()
{
struct statfs64 sb;
int err;
err = syscall(SYS_statfs64, "/var/lib/hugetlbfs/pagesize-16GB", sizeof(sb), &sb);
if (err)
return -1;
printf("sizeof f_bsize = %d, f_bsize=%ld\n", sizeof(sb.f_bsize), sb.f_bsize);
return 0;
}
$ gcc -m32 statfs64_test.c
$ ./a.out
sizeof f_bsize = 4, f_bsize=0
Signed-off-by: Li Wang <liwang@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Previous mkfs.f2fs allows small partition inappropriately, so f2fs should detect
that as well.
Refer this in f2fs-tools.
mkfs.f2fs: detect small partition by overprovision ratio and # of segments
Reported-and-Tested-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The layout-private data may depend on the layout and/or the inode
still existing when it does post-processing and frees its data, so we
need to free them after calling lrp->ld_private.ops->free().
This fixes a mirror list corruption issue in the flexfiles driver.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we're merging an old entry into our new entry, we want to ensure that
we add the list entry in the correct place.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Otherwise the lock context won't be freed when we're done with it.
From: NeilBrown <neilb@suse.com>
Fixes: 5bd3f817 ("NFSv4: change nfs4_select_rw_stateid to take a lock_context inplace of lock_owner")
Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Don't load an inode with a negative size; this causes integer overflow
problems in the VFS.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
On filesystems with a lot of metadata and in metadata intensive workloads
xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
the CPU time is spent on CPU cache misses while traversing the rbtree.
As the buffer cache does not need any kind of ordering, but fast lookups
a hashtable is the natural data structure to use. The rhashtable
infrastructure provides a self-scaling hashtable implementation and
allows lookups to proceed while the table is going through a resize
operation.
This reduces the CPU-time spent for the lookups to 1/3 even for small
filesystems with a relatively small number of cached buffers, with
possibly much larger gains on higher loaded filesystems.
[dchinner: reduce minimum hash size to an acceptable size for large
filesystems with many AGs with no active use.]
[dchinner: remove stale rbtree asserts.]
[dchinner: use xfs_buf_map for compare function argument.]
[dchinner: make functions static.]
[dchinner: remove redundant comments.]
Signed-off-by: Lucas Stach <dev@lynxeye.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Basically, the pjdfstests set the ownership of a file to 06555, and then
chowns it (as root) to a new uid/gid. Prior to commit a09f99edde ("fuse:
fix killing s[ug]id in setattr"), fuse would send down a setattr with both
the uid/gid change and a new mode. Now, it just sends down the uid/gid
change.
Technically this is NOTABUG, since POSIX doesn't _require_ that we clear
these bits for a privileged process, but Linux (wisely) has done that and I
think we don't want to change that behavior here.
This is caused by the use of should_remove_suid(), which will always return
0 when the process has CAP_FSETID.
In fact we really don't need to be calling should_remove_suid() at all,
since we've already been indicated that we should remove the suid, we just
don't want to use a (very) stale mode for that.
This patch should fix the above as well as simplify the logic.
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: a09f99edde ("fuse: fix killing s[ug]id in setattr")
Cc: <stable@vger.kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_write_and_wait_marked_extents and btrfs_sync_log both call
btrfs_wait_marked_extents, which provides a core loop and then handles
errors differently based on whether it's it's a log root or not.
This means that btrfs_write_and_wait_marked_extents needs to take a root
because btrfs_wait_marked_extents requires one, even though it's only
used to determine whether the root is a log root. The log root code
won't ever call into the transaction commit code using a log root, so we
can factor out the core loop and provide the error handling appropriate
to each waiter in new routines. This allows us to eventually remove
the root argument from btrfs_commit_transaction, and as a result,
btrfs_end_transaction.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the exception of the one case where btrfs_wait_cache_io is called
without a block group, it's called with the same arguments. The root
argument is only used in the special case, so let's factor out the core
and simplify the call in the normal case to require a trans, block group,
and path.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent-tree tracepoints all operate on the extent root, regardless of
which root is passed in. Let's just use the extent root objectid instead.
If it turns out that nobody is depending on the format of this tracepoint,
we can drop the root printing entirely.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This results in btrfs_assert_delayed_root_empty and
btrfs_destroy_delayed_inode taking an fs_info instead of a root.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_ctl->root member was only being used to access root->fs_info.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root is never used. We substitute extent_root in for the
reada_find_extent call, since it's only ever used to obtain the node
size. This call site will be changed to use fs_info in a later patch.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root member is never used except for obtaining an fs_info pointer.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even though a separate root is passed in, we're still operating on the
extent root. Let's use that for the trace point.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_init_new_device only uses the root passed in via the ioctl to
start the transaction. Nothing else that happens is related to whatever
root the user used to initiate the ioctl. We can drop the root requirement
and just use fs_info->dev_root instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many functions that are always called with the same root
argument. Rather than passing the same root every time, we can
pass an fs_info pointer instead and have the function get the root
pointer itself.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are 11 functions that accept a root parameter and immediately
overwrite it. We can pass those an fs_info pointer instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Ensure we release the NFS_LAYOUT_RETURN lock when we invalidate the
layout stateid, so that processes and RPC tasks that are waiting on
the layout return can continue.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
All callers are followed by the same boilerplate - "if it has returned
0, update nd->path/inode/seq - we are not following a symlink here".
Pull it into the function itself, renaming it into step_into().
Rename WALK_GET to WALK_FOLLOW, while we are at it - more descriptive
name.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... turning the condition for put_link() in walk_component() into
"WALK_MORE not passed and depth is non-zero". Again, makes for
simpler arguments.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function path_is_under() doesn't modify the paths pointed by its
arguments but only browse them. Constifying this pointers make a cleaner
interface to be used by (future) code which may only have access to
const struct path pointers (e.g. LSM hooks).
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the current code it is possible to lock a mutex twice when
a subsequent reconnects are triggered. On the 1st reconnect we
reconnect sessions and tcons and then persistent file handles.
If the 2nd reconnect happens during the reconnecting of persistent
file handles then the following sequence of calls is observed:
cifs_reopen_file -> SMB2_open -> small_smb2_init -> smb2_reconnect
-> cifs_reopen_persistent_file_handles -> cifs_reopen_file (again!).
So, we are trying to acquire the same cfile->fh_mutex twice which
is wrong. Fix this by moving reconnecting of persistent handles to
the delayed work (smb2_reconnect_server) and submitting this work
every time we reconnect tcon in SMB2 commands handling codepath.
This can also lead to corruption of a temporary file list in
cifs_reopen_persistent_file_handles() because we can recursively
call this function twice.
Cc: Stable <stable@vger.kernel.org> # v4.9+
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
We can not unlock/lock cifs_tcp_ses_lock while walking through ses
and tcon lists because it can corrupt list iterator pointers and
a tcon structure can be released if we don't hold an extra reference.
Fix it by moving a reconnect process to a separate delayed work
and acquiring a reference to every tcon that needs to be reconnected.
Also do not send an echo request on newly established connections.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
This reverts commit 1beba1b3a9.
The perpcu_counter doesn't provide atomicity in single core and consume more
DRAM. That incurs fs_mark test failure due to ENOMEM.
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.
Convert some obvious users. *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case. Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If maxBuf is not 0 but less than a size of SMB2 lock structure
we can end up with a memory corruption.
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Nick Piggin reported that the CRC overhead in an fsync heavy
workload was higher than expected on a Power8 machine. Part of this
was to do with the fact that the power8 CRC implementation is not
efficient for CRC lengths of less than 512 bytes, and so the way we
split the CRCs over the CRC field means a lot of the CRCs are
reduced to being less than than optimal size.
To optimise this, change the CRC update mechanism to zero the CRC
field first, and then compute the CRC in one pass over the buffer
and write the result back into the buffer. We can do this safely
because anything writing a CRC has exclusive access to the buffer
the CRC is being calculated over.
We leave the CRC verify code the same - it still splits the CRC
calculation - because we do not want read-only operations modifying
the underlying buffer. This is because read-only operations may not
have an exclusive access to the buffer guaranteed, and so temporary
modifications could leak out to to other processes accessing the
buffer concurrently.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Embedding a switch statement in every btree stats inc/add adds a lot
of code overhead to the core btree infrastructure paths. Stats are
supposed to be small and lightweight, but the btree stats have
become big and bloated as we've added more btrees. It needs fixing
because the reflink code will just add more overhead again.
Convert the v2 btree stats to arrays instead of independent
variables, and instead use the type to index the specific btree
array via an enum. This allows us to use array based indexing
to update the stats, rather than having to derefence variables
specific to the btree type.
If we then wrap the xfsstats structure in a union and place uint32_t
array beside it, and calculate the correct btree stats array base
array index when creating a btree cursor, we can easily access
entries in the stats structure without having to switch names based
on the btree type.
We then replace with the switch statement with a simple set of stats
wrapper macros, resulting in a significant simplification of the
btree stats code, and:
text data bss dec hex filename
48905 144 8 49057 bfa1 fs/xfs/libxfs/xfs_btree.o.old
36793 144 8 36945 9051 fs/xfs/libxfs/xfs_btree.o
it reduces the core btree infrastructure code size by close to 25%!
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
After various discussions on linux-fsdevel, it has been decided that it
is not necessary to cap the length of a dedupe request, and that
correctly-written userspace client programs will be able to absorb the
change. Therefore, remove the length clamping behavior.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The on-disk field di_size is used to set i_size, which is a signed
integer of loff_t. If the high bit of di_size is set, we'll end up with
a negative i_size, which will cause all sorts of problems. Since the
VFS won't let us create a file with such length, we should catch them
here in the verifier too.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We shouldn't assert if somehow we end up trying to add an attr fork to
an inode that apparently already has attr extents because this is an
indication of on-disk corruption. Instead, return an error code to
userspace.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In xfs_dir3_data_read, we can encounter the situation where err == 0 and
*bpp == NULL if the given bno offset happens to be a hole; this leads to
a crash if we try to set the buffer type after the _da_read_buf call.
Holes can happen due to corrupt or malicious entries in the bmbt data,
so be a little more careful when we're handling buffers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When reading into memory all extents of a btree-format inode fork,
complain if the number of extents we find is not the same as the number
of extents reported in the inode core. This is needed to stop an IO
action from accessing the garbage areas of the in-core fork.
[dchinner: removed redundant assert]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we're reading a btree block, make sure that what we retrieved
matches the owner and level; and has a plausible number of records.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There is no such thing as a zero-level AG btree since even a single-node
zero-records btree has one level. Btree cursor constructors read
cur_nlevels straight from disk and then access things like
cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero!
Therefore, strengthen the verifiers to prevent this possibility.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are a handful of xattr functions which now return
nothing but zero. They can be made void, chased through calling
functions, and error handling etc can be removed.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
By inspection, xfs_bmap_trace_exlist isn't handling cow forks,
and will trace the data fork instead.
Fix this by setting state appropriately if whichfork
== XFS_COW_FORK.
()___()
< @ @ >
| |
{o_o}
(|)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When xfs_bmap_trace_exlist called trace_xfs_extlist,
it sent in the "whichfork" var instead of the bmap "state"
as expected (even though state was already set up for this
purpose).
As a result, the xfs_bmap_class in tracing code used
"whichfork" not state in xfs_iext_state_to_fork(), and got
the wrong ifork pointer. It all goes downhill from
there, including an ASSERT when ifp_bytes is empty
by the time it reaches xfs_iext_get_ext():
XFS: Assertion failed: idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We've missed properly setting the buffer type for
an AGI transaction in 3 spots now, so just move it
into xfs_read_agi() and set it if we are in a transaction
to avoid the problem in the future.
This is similar to how it is done in i.e. the dir3
and attr3 read functions.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>