Add support for btree staging cursors for the free space btrees. This
is needed both for online repair and also to convert xfs_repair to use
btree bulk loading.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add a new btree function that enables us to bulk load a btree cursor.
This will be used by the upcoming online repair patches to generate new
btrees. This avoids the programmatic inefficiency of calling
xfs_btree_insert in a loop (which generates a lot of log traffic) in
favor of stamping out new btree blocks with ordered buffers, and then
committing both the new root and scheduling the removal of the old btree
blocks in a single transaction commit.
The design of this new generic code is based off the btree rebuilding
code in xfs_repair's phase 5 code, with the explicit goal of enabling us
to share that code between scrub and repair. It has the additional
feature of being able to control btree block loading factors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create an in-core fake root for inode-rooted btree types so that callers
can generate a whole new btree using the upcoming btree bulk load
function without making the new tree accessible from the rest of the
filesystem. It is up to the individual btree type to provide a function
to create a staged cursor (presumably with the appropriate callouts to
update the fakeroot) and then commit the staged root back into the
filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create an in-core fake root for AG-rooted btree types so that callers
can generate a whole new btree using the upcoming btree bulk load
function without making the new tree accessible from the rest of the
filesystem. It is up to the individual btree type to provide a function
to create a staged cursor (presumably with the appropriate callouts to
update the fakeroot) and then commit the staged root back into the
filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add a xbitmap_hweight helper function so that we can get rid of the
open-coded loop.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Shorten the name of xfs_bitmap to xbitmap since the scrub bitmap has
nothing to do with the libxfs bitmap.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Remove the xfs_bitmap_destroy call from the end of xrep_reap_extents
because this sort of violates our rule that the function initializing a
structure should destroy it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Double 'three' exists in the comments of iomap_dio_rw.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Historically we only set the capacity to zero for devices that support
partitions (independ of actually having partitions created). Doing that
is rather inconsistent, but changing it broke legacy udisks polling for
legacy ide-cdrom devices. Use the crude a crude check for devices that
either are non-removable or partitionable to get the sane behavior for
most device while not breaking userspace for this particular setup.
Fixes: a1548b6744 ("block: move rescan_partitions to fs/block_dev.c")
Reported-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No one checks the return value of debugfs_create_file_size, as it's not
needed, so make the return value void, so that no one tries to do so in
the future.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200309163640.237984-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we call fiemap on a truncated file with none blocks allocated,
it makes sense we get nothing from this call. No output means
no blocks have been counted, but the call succeeded. It's a valid
response.
Simple example reproducer:
xfs_io -f 'truncate 2M' -c 'fiemap -v' /cifssch/testfile
xfs_io: ioctl(FS_IOC_FIEMAP) ["/cifssch/testfile"]: Invalid argument
Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
The num_remote_opens counter keeps track of the number of open files which must be
maintained by the server at any point. This is a per-tree-connect counter, and the value
of this counter gets displayed in the /proc/fs/cifs/Stats output as a following...
Open files: 0 total (local), 1 open on server
^^^^^^^^^^^^^^^^
As a thumb-rule, we want to increment this counter for each open/create that we
successfully execute on the server. Similarly, we should decrement the counter when
we successfully execute a close.
In this case, an increment was being missed in case of smb2_query_dir_first,
in case of successful open. As a result, we would underflow the counter and we
could even see the counter go to negative after sufficient smb2_query_dir_first calls.
I tested the stats counter for a bunch of filesystem operations with the fix.
And it looks like the counter looks correct to me.
I also check if we missed the increments and decrements elsewhere. It does not
seem so. Few other cases where an open is done and we don't increment the counter are
the compound calls where the corresponding close is also sent in the request.
Signed-off-by: Shyam Prasad N <nspmangalore@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Smatch complains that "rc" could be uninitialized.
fs/cifs/inode.c:2206 cifs_getattr() error: uninitialized symbol 'rc'.
Changing it to "return 0;" improves readability as well.
Fixes: cc1baf98c8f6 ("cifs: do not ignore the SYNC flags in getattr")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
When I lifted the code in xfs_alloc_ag_vextent_lastblock out of a loop,
I forgot to convert all the accesses to len to be pointer dereferences.
Coverity-id: 1457918
Fixes: 5113f8ec37 ("xfs: clean up weird while loop in xfs_alloc_ag_vextent_near")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare variable-length
types such as these ones is a flexible array member[1][2], introduced in
C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by this
change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The situation is the same as for __d_obtain_alias() (which is what that
thing is parallel to) - if we find a preexisting alias, we want to grab it,
drop the inode and return the alias we'd found.
The only thing d_instantiate_anon() does compared to that is spurious
security_d_instiate() that has already been done to that dentry with exact
same arguments.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Overlayfs works sub-optimally with upper fs that has no xattr/d_type/
RENAME_WHITEOUT support. We should basically deprecate support for those
filesystems, but so far, we only issue a warning and don't fail the mount
for the sake of backward compat. Some features are already being disabled
with no xattr support.
For newly supported remote upper fs, we do not need to worry about backward
compatibility, so we can fail the mount if upper fs is a sub-optimal
filesystem.
This reduces the in-tree remote filesystems supported as upper to just
FUSE, for which the remote upper fs support was added.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As with other required upper fs features, we only warn if support is
missing to avoid breaking existing sub-optimal setups.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
No reason to prevent upper layer being a remote filesystem. Do the
revalidation in that case, just as we already do for lower layers.
This lets virtiofs be used as upper layer, which appears to be a real use
case.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allow completely skipping ->revalidate() on a per-dentry basis, in case the
underlying layers used for a dentry do not themselves have ->revalidate().
E.g. negative overlay dentry has no underlying layers, hence revalidate is
unnecessary. Or if lower layer is remote but overlay dentry is pure-upper,
then can skip revalidate.
The following places need to update whether the dentry needs revalidate or
not:
- fill-super (root dentry)
- lookup
- create
- fh_to_dentry
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Following patch will allow remote as upper layer, but not overlay stacked
on upper layer. Separate the two concepts.
This patch is doesn't change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use a common loop for plain and weak revalidation. This will aid doing
revalidation on upper layer.
This patch doesn't change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This issue came up with NFSv4 as the lower layer, which generates
"system.nfs4_acl" xattrs (even for plain old unix permissions). Prior to
this patch this prevented copy-up from succeeding.
The overlayfs permission model mandates that permissions are checked
locally for the task and remotely for the mounter(*). NFS4 ACLs are not
supported by the Linux kernel currently, hence they cannot be enforced
locally. Which means it is indifferent whether this attribute is copied or
not.
Generalize this to any xattr that is not used in access checking (i.e. it's
not a POSIX ACL and not in the "security." namespace).
Incidentally, best effort copying of xattrs seems to also be the behavior
of "cp -a", which is what overlayfs tries to mimic.
(*) Documentation/filesystems/overlayfs.txt#Permission model
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move i_ino initialization to ovl_inode_init() to avoid the dance of setting
i_ino in ovl_fill_inode() sometimes on the first call and sometimes on the
seconds call.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allocates and initializes the root dentry and inode.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_inode_update() is no longer called from create object code path.
Fixes: 01b39dcc95 ("ovl: use inode_insert5() to hash a newly...")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 6dde1e42f4 ("ovl: make i_ino consistent with st_ino in more
cases"), relaxed the condition nfs_export=on in order to set the value of
i_ino to xino map of real ino.
Specifically, it also relaxed the pre-condition that index=on for
consistent i_ino. This opened the corner case of lower hardlink in
ovl_get_inode(), which calls ovl_fill_inode() with ino=0 and then
ovl_init_inode() is called to set i_ino to lower real ino without the xino
mapping.
Pass the correct values of ino;fsid in this case to ovl_fill_inode(), so it
can initialize i_ino correctly.
Fixes: 6dde1e42f4 ("ovl: make i_ino consistent with st_ino in more ...")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fix a debug-only build error in ext2/xattr.c:
When building without extra debugging, (and with another patch that uses
no_printk() instead of <empty> for the ext2-xattr debug-print macros,
this build error happens:
../fs/ext2/xattr.c: In function ‘ext2_xattr_cache_insert’:
../fs/ext2/xattr.c:869:18: error: ‘ext2_xattr_cache’ undeclared (first use in
this function); did you mean ‘ext2_xattr_list’?
atomic_read(&ext2_xattr_cache->c_entry_count));
Fix the problem by removing cached entry count from the debug message
since otherwise we'd have to export the mbcache structure just for that.
Fixes: be0726d33c ("ext2: convert to mbcache2")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
User extended attributes are useful as metadata storage for kernfs
consumers like cgroups. Especially in the case of cgroups, it is useful
to have a central metadata store that multiple processes/services can
use to coordinate actions.
A concrete example is for userspace out of memory killers. We want to
let delegated cgroup subtree owners (running as non-root) to be able to
say "please avoid killing this cgroup". This is especially important for
desktop linux as delegated subtrees owners are less likely to run as
root.
This patch introduces a new flag, KERNFS_ROOT_SUPPORT_USER_XATTR, that
lets kernfs consumers enable user xattr support. An initial limit of 128
entries or 128KB -- whichever is hit first -- is placed per cgroup
because xattrs come from kernel memory and we don't want to let
unprivileged users accidentally eat up too much kernel memory.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
This helps set up size accounting in the next commit. Without this out
param, it's difficult to find out the removed xattr size without taking
a lock for longer and walking the xattr linked list twice.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
xattr values have a 64k maximum size. This can result in an order 4
kmalloc request which can be difficult to fulfill. Since xattrs do not
need physically contiguous memory, we can switch to kvmalloc and not
have to worry about higher order allocations failing.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
It's meant to be write-only.
Fixes: 89c905becc ("nfsd: allow forced expiration of NFSv4 clients")
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
yuehaibing@huawei.com reports the following build errors arise when
CONFIG_NFSD_V4_2_INTER_SSC is set and the NFS client is not built
into the kernel:
fs/nfsd/nfs4proc.o: In function `nfsd4_do_copy':
nfs4proc.c:(.text+0x23b7): undefined reference to `nfs42_ssc_close'
fs/nfsd/nfs4proc.o: In function `nfsd4_copy':
nfs4proc.c:(.text+0x5d2a): undefined reference to `nfs_sb_deactive'
fs/nfsd/nfs4proc.o: In function `nfsd4_do_async_copy':
nfs4proc.c:(.text+0x61d5): undefined reference to `nfs42_ssc_open'
nfs4proc.c:(.text+0x6389): undefined reference to `nfs_sb_deactive'
The new inter-server copy code invokes client functions. Until the
NFS server has infrastructure to load the appropriate NFS client
modules to handle inter-server copy requests, let's constrain the
way this feature is built.
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: YueHaibing <yuehaibing@huawei.com> # build-tested
If the rpc.mountd daemon goes down, then that should not cause all
exports to start failing with ESTALE errors. Let's explicitly
distinguish between the cache upcall cases that need to time out,
and those that do not.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add tracing to allow us to figure out where any stale filehandle issues
may be originating from.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In NFSv4, the lock stateids are tied to the lockowner, and the open stateid,
so that the action of closing the file also results in either an automatic
loss of the locks, or an error of the form NFS4ERR_LOCKS_HELD.
In practice this means we must not add new locks to the open stateid
after the close process has been invoked. In fact doing so, can result
in the following panic:
kernel BUG at lib/list_debug.c:51!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 2 PID: 1085 Comm: nfsd Not tainted 5.6.0-rc3+ #2
Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.14410784.B64.1908150010 08/15/2019
RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
Code: 1a 3d 9b e8 74 10 c2 ff 0f 0b 48 c7 c7 f0 1a 3d 9b e8 66 10 c2 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 b0 1a 3d 9b e8 52 10 c2 ff <0f> 0b 48 89 fe 4c 89 c2 48 c7 c7 78 1a 3d 9b e8 3e 10 c2 ff 0f 0b
RSP: 0018:ffffb296c1d47d90 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff8ba032456ec8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8ba039e99cc8 RDI: ffff8ba039e99cc8
RBP: ffff8ba032456e60 R08: 0000000000000781 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ba009a4abe0
R13: ffff8ba032456e8c R14: 0000000000000000 R15: ffff8ba00adb01d8
FS: 0000000000000000(0000) GS:ffff8ba039e80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb213f0b008 CR3: 00000001347de006 CR4: 00000000003606e0
Call Trace:
release_lock_stateid+0x2b/0x80 [nfsd]
nfsd4_free_stateid+0x1e9/0x210 [nfsd]
nfsd4_proc_compound+0x414/0x700 [nfsd]
? nfs4svc_decode_compoundargs+0x407/0x4c0 [nfsd]
nfsd_dispatch+0xc1/0x200 [nfsd]
svc_process_common+0x476/0x6f0 [sunrpc]
? svc_sock_secure_port+0x12/0x30 [sunrpc]
? svc_recv+0x313/0x9c0 [sunrpc]
? nfsd_svc+0x2d0/0x2d0 [nfsd]
svc_process+0xd4/0x110 [sunrpc]
nfsd+0xe3/0x140 [nfsd]
kthread+0xf9/0x130
? nfsd_destroy+0x50/0x50 [nfsd]
? kthread_park+0x90/0x90
ret_from_fork+0x1f/0x40
The fix is to ensure that lock creation tests for whether or not the
open stateid is unhashed, and to fail if that is the case.
Fixes: 659aefb68e ("nfsd: Ensure we don't recognise lock stateids after freeing them")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svcrdma expects that the payload falls precisely into the xdr_buf
page vector. This does not seem to be the case for
nfsd4_encode_readv().
This code is called only when fops->splice_read is missing or when
RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
common cases.
Add new transport method: ->xpo_read_payload so that when a READ
payload does not fit exactly in rq_res's page vector, the XDR
encoder can inform the RPC transport exactly where that payload is,
without the payload's XDR pad.
That way, when a Write chunk is present, the transport knows what
byte range in the Reply message is supposed to be matched with the
chunk.
Note that the Linux NFS server implementation of NFS/RDMA can
currently handle only one Write chunk per RPC-over-RDMA message.
This simplifies the implementation of this fix.
Fixes: b042098063 ("nfsd4: allow exotic read compounds")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently, nfsd4_encode_exchange_id() encodes the utsname nodename
string in the server_scope field. In a multi-host container
environemnt, if an nfsd container is restarted on a different host than
it was originally running on, clients will see a server_scope mismatch
and will not attempt to reclaim opens.
Instead, set the server_scope while we're in a process context during
service startup, so we get the utsname nodename of the current process
and store that in nfsd_net.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
[bfields: fix up major_id too]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Link: https://lore.kernel.org/r/20200309202715.GA9428@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Link: https://lore.kernel.org/r/20200309180441.GA2992@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In function nfs_permission:
1. the rcu_read_lock and rcu_read_unlock around nfs_do_access
is unnecessary because the rcu critical data structure is already
protected in subsidiary function nfs_access_get_cached_rcu. No other
data structure needs rcu_read_lock in nfs_do_access.
2. call nfs_do_access once is enough, because:
2-1. when mask has MAY_NOT_BLOCK bit
The second call to nfs_do_access will not happen.
2-2. when mask has no MAY_NOT_BLOCK bit
The second call to nfs_do_access will happen if res == -ECHILD, which
means the first nfs_do_access goes out after statement if (!may_block).
The second call to nfs_do_access will go through this procedure once
again except continue the work after if (!may_block).
But above work can be performed by only one call to nfs_do_access
without mangling the mask flag.
Tested in x86_64
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we receive a CB_RECALL_ANY that asks us to return flexfiles
layouts, we iterate through all the layouts and look at whether or
not there are active open file descriptors that might need them
for I/O. If there are no such descriptors, we return the layouts.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Convert nfs_delegation_reap_unclaimed() to use nfs_client_for_each_server()
for efficiency.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Convert it to use the nfs_client_for_each_server() helper, and
make it more efficient by skipping delegations for inodes we
know are in the process of being freed. Also improve the efficiency
of the cursor by skipping delegations that are being freed.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a helper nfs_client_for_each_server() to iterate through all the
filesystems that are attached to a struct nfs_client, and apply
a function to all the active ones.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Now that we can rely on just the rcu_read_lock(), remove the
clp->cl_lock and clean up.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Make sure to test the stateid for validity so that we catch instances
where the server may have been reusing stateids in
nfs_layout_find_inode_by_stateid().
Fixes: 7b410d9ce4 ("pNFS: Delay getting the layout header in CB_LAYOUTRECALL handlers")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that if the DS is returning too many DELAY and GRACE errors, we
also report that to the MDS through the layouterror mechanism.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently, we have no real limit on the access cache size (we set it
to ULONG_MAX). That can lead to credentials getting pinned for a
very long time on lots of files if you have a system with a lot of
memory.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In both async rename and rename, we take a reference to the
cred in the call arguments.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Avoid unnecessary references to the cred when we have already referenced
it through the open context or the open owner.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In read/write/commit, we should be able to assume that the cred is
pinned by the open context.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we're creating a nfs_open_context() for a specific file pointer,
we must use the cred assigned to that file.
Fixes: a52458b48a ("NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We can't allow delegreturn to hold up nfs4_evict_inode() forever,
since that can cause the memory shrinkers to block. This patch
therefore ensures that we eventually time out, and complete the
reclaim of the inode.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the cred assigned to the layout that we're updating differs from
the one used to retrieve the new layout segment, then we need to
update the layout plh_lc_cred field.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the cred assigned to the delegation that we're updating differs
from the one we're updating too, then we need to update that field
too.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we're running as a 64-bit architecture and are not running in
32-bit compatibility mode, it is better to use the 64-bit readdir
cookies that supplied by the server. Doing so improves the accuracy
of telldir()/seekdir(), particularly when the directory is changing,
for instance, when doing 'rm -rf'.
We still fall back to using the 32-bit offsets on 32-bit architectures
and when in compatibility mode.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
prevent inodes from vanishing, but ihold() does not guarantee inode
persistence. Replace the inode pointer with a per boot, machine wide,
unique inode identifier. The second commit fixes the breakage of the hash
mechanism whihc causes a 100% performance regression.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uQJsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYp4EAC5fr/AyRaIn/AEIZUmoyK6ELUaknfH
Z788avxDB/t5GkzC9A2dMpybYi78tzLSAEfB8jYgwbrqExapVtiqvjGZ1RIi3HoN
f/DWLnOb2s+yYQ3BQlHu4RdKONEzCqBwKFpElGRv3JzCY8Qeh5cQBzdqzvOEFmYw
P7DJVtJRZ2dud7AzJ+xk6KuNIKCT2F7Djmtop6nq1EVw0J/2oYOVgQu76APBj7cj
32srLmpP4xcQiJmWLC5ksXKiZrMPnyNfwXhHFufNvJ2Re6+Wf8mmglqG/5DmA+Ns
Sq3L7D7yXwtWQZ8Po1qnWhPDZVXQbWzHyTn4YAMJAK7yoO7mut8jgECt+A8vf4L+
hsc41c6THfdCQQ9gmxLL+c08nZGlmvIC4/1RsihNZ3kd2o4k6Ah9xFp8lBFcpjWd
7tuhakNqJvUOvB34t2AYqzMFbZ/FJG+QNGyIW0bTUn4YIgRPxI/zsdMxqGVBZ4oN
0iuy1kPLGbGAnLU9thkiVMmAyaPesuiB6f+mmzobEUgGI35GrCJi6a4YaTG1sqFn
Gl8oPzcU2n+DWbVBfJrVFHJye7oi78kCw6wpNLBCJQp8NP8doAH0Sgspglg52E/p
G4GGLz0vGauHBC5wQ3WYiGLImWbzC1dwKdcNE7dhuTgXbhz8ChVlOSU9Fu4+pGpq
6URL6DVTLwDZPg==
=e2iB
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fix from Thomas Gleixner:
"Fix for yet another subtle futex issue.
The futex code used ihold() to prevent inodes from vanishing, but
ihold() does not guarantee inode persistence. Replace the inode
pointer with a per boot, machine wide, unique inode identifier.
The second commit fixes the breakage of the hash mechanism which
causes a 100% performance regression"
* tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Unbreak futex hashing
futex: Fix inode life-time issue
If the xfs_buf_map array allocation in xfs_dabuf_map fails for whatever
reason, we bail out with error code zero. This will confuse callers, so
make sure that we return ENOMEM. Allocation failure should never happen
with the small size of the array, but code defensively anyway.
Fixes: 45feef8f50 ("xfs: refactor xfs_dabuf_map")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Enable io-wq hashing stuff for dependent works simply by re-enqueueing
such requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a preparation patch removing io_wq_enqueue_hashed(), which
now should be done by io_wq_hash_work() + io_wq_enqueue().
Also, set hash value for dependant works, and do it as late as possible,
because req->file can be unavailable before. This hash will be ignored
by io-wq.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This little tweak restores the behaviour that was before the recent
io_worker_handle_work() optimisation patches. It makes the function do
cond_resched() and flush_signals() only if there is an actual work to
execute.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Processing links, io_submit_sqe() prepares requests, drops sqes, and
passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or
IOSQE_ASYNC requests will go through the same prep, which doesn't expect
sqe=NULL and fail with NULL pointer deference.
Always do full prepare including io_alloc_async_ctx() for linked
requests, and then it can skip the second preparation.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can use the variable allocated_clusters rather than map_from_clusters
to control reserved block/cluster accounting in ext4_ext_map_blocks.
This eliminates a variable and associated code and improves readability
a little.
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200311205125.25061-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that the eofblocks code has been removed, we don't need to assign
0 to err before calling ext4_ext_insert_extent() since it will assign
a return value to ret anyway. The variable free_on_err can be
eliminated and replaced by a reference to allocated_clusters which
clearly conveys the idea that newly allocated blocks should be freed
when recovering from an extent insertion failure. The error handling
code itself should be restructured so that it errors out immediately on
an insertion failure in the case where no new blocks have been allocated
(bigalloc) rather than proceeding further into the mapping code. The
initializer for fb_flags can also be rearranged for improved
readability. Finally, insert a missing space in nearby code.
No known bugs are addressed by this patch - it's simply a cleanup.
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200311205033.25013-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309180813.GA3347@embeddedor
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309154838.GA31559@embeddedor
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch moves ext4_fiemap to use iomap framework.
For xattr a new 'ext4_iomap_xattr_ops' is added.
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/b9f45c885814fcdd0631747ff0fe08886270828c.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For indirect block mapping if the i_block > max supported block in inode
then ext4_ind_map_blocks() returns a -EIO error. But in case of fiemap
this could be a valid query to ->iomap_begin call.
So check if the offset >= s_bitmap_maxbytes in ext4_iomap_begin_report(),
then simply skip calling ext4_map_blocks().
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/87fa0ddc5967fa707656212a3b66a7233425325c.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_iomap_begin is already implemented which provides ext4_map_blocks,
so just move the API from generic_block_bmap to iomap_bmap for iomap
conversion.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8bbd53bd719d5ccfecafcce93f2bf1d7955a44af.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch avoids the memory alloc & free path when depth is 0,
since anyway there is no extra caching done in that case.
So on checking depth 0, simply return early.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/93da0d0f073c73358e85bb9849d8a5378d1da539.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
IOMAP_F_MERGED needs to be set in case of non-extent based mapping.
This is needed in later patches for conversion of ext4_fiemap to use iomap.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a4764c91c08c16d4d4a4b36defb2a08625b0e9b3.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-03-13
The following pull-request contains BPF updates for your *net-next* tree.
We've added 86 non-merge commits during the last 12 day(s) which contain
a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).
The main changes are:
1) Add modify_return attach type which allows to attach to a function via
BPF trampoline and is run after the fentry and before the fexit programs
and can pass a return code to the original caller, from KP Singh.
2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
objects to be visible in /proc/kallsyms so they can be annotated in
stack traces, from Jiri Olsa.
3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
in order to enable this for BPF based socket dispatch, from Lorenz Bauer.
4) Introduce a new bpftool 'prog profile' command which attaches to existing
BPF programs via fentry and fexit hooks and reads out hardware counters
during that period, from Song Liu. Example usage:
bpftool prog profile id 337 duration 3 cycles instructions llc_misses
4228 run_cnt
3403698 cycles (84.08%)
3525294 instructions # 1.04 insn per cycle (84.05%)
13 llc_misses # 3.69 LLC misses per million isns (83.50%)
5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
of a new bpf_link abstraction to keep in particular BPF tracing programs
attached even when the applicaion owning them exits, from Andrii Nakryiko.
6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
and which returns the PID as seen by the init namespace, from Carlos Neira.
7) Refactor of RISC-V JIT code to move out common pieces and addition of a
new RV32G BPF JIT compiler, from Luke Nelson.
8) Add gso_size context member to __sk_buff in order to be able to know whether
a given skb is GSO or not, from Willem de Bruijn.
9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
implementation but can be called from tracepoint programs, from Eelco Chaudron.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now the tail ends of follow_dotdot{,_rcu}() are pretty
much the open-coded analogues of step_into(). The differences:
* the lack of proper LOOKUP_NO_XDEV handling in non-RCU case
(arguably a bug)
* the lack of ->d_manage() handling (again, arguably a bug)
Adjust the calling conventions so that on the next step with could
just switch those functions to returning step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Behaviour change: LOOKUP_BENEATH lookup of .. in absolute root
yields an error even if it's not the process' root. That's
possible only if you'd managed to escape chroot jail by way of
procfs symlinks, but IMO the resulting behaviour is not worse -
more consistent and easier to describe:
".." in root is "stay where you are", uness LOOKUP_BENEATH
has been given, in which case it's "fail with EXDEV".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of returning 0, return new dentry; instead of returning
-ENOENT, return NULL. Adjust the callers accordingly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't mess with got_write there - it is guaranteed to be false on
entry and it will be set true if and only if we decide to go for
truncation and manage to get write access for that.
Don't carry acc_mode through the entire thing - it's only used
in that part. And don't bother with gotos in there - compiler is
quite capable of optimizing that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's easier to drop it right after lookup_open() and regain if
needed (i.e. if we will need to truncate). On the non-FMODE_OPENED
path we do that anyway. In case of FMODE_CREATED we won't be
needing it. And it's easier to prove correctness that way,
especially since the initial failure to get write access is not
always fatal; proving that we'll never end up truncating in that
case is rather convoluted.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
there we'll be able to merge it with its counterparts in other
cases, and there's no reason to do it before the parent has
been unlocked
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->atomic_open() might have used a different alias than the one we'd
passed to it; in "not opened" case we take care of that, in "opened"
one we don't. Currently we don't care downstream of "opened" case
which alias to return; however, that will change shortly when we
get to unifying may_open() calls.
It's not hard to get right in all cases, anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
common guts of follow_down() and follow_managed() taken to a new
helper - traverse_mounts(). The remnants of follow_managed()
are folded into its sole remaining caller (handle_mounts()).
Calling conventions of handle_mounts() slightly sanitized -
instead of the weird "1 for success, -E... for failure" that used
to be imposed by the calling conventions of walk_component() et.al.
we can use the normal "0 for success, -E... for failure".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We use nd->stack to store two things: pinning down the symlinks
we are resolving and resuming the name traversal when a nested
symlink is finished.
Currently, nd->depth is used to keep track of both. It's 0 when
we call link_path_walk() for the first time (for the pathname
itself) and 1 on all subsequent calls (for trailing symlinks,
if any). That's fine, as far as pinning symlinks goes - when
handling a trailing symlink, the string we are interpreting
is the body of symlink pinned down in nd->stack[0]. It's
rather inconvenient with respect to handling nested symlinks,
though - when we run out of a string we are currently interpreting,
we need to decide whether it's a nested symlink (in which case
we need to pick the string saved back when we started to interpret
that nested symlink and resume its traversal) or not (in which
case we are done with link_path_walk()).
Current solution is a bit of a kludge - in handling of trailing symlink
(in lookup_last() and open_last_lookups() we clear nd->stack[0].name.
That allows link_path_walk() to use the following rules when
running out of a string to interpret:
* if nd->depth is zero, we are at the end of pathname itself.
* if nd->depth is positive, check the saved string; for
nested symlink it will be non-NULL, for trailing symlink - NULL.
It works, but it's rather non-obvious. Note that we have two sets:
the set of symlinks currently being traversed and the set of postponed
pathname tails. The former is stored in nd->stack[0..nd->depth-1].link
and it's valid throught the pathname resolution; the latter is valid only
during an individual call of link_path_walk() and it occupies
nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and
nd->stack[1..nd->depth-1].name for subsequent ones. The kludge is basically
a way to recognize the second set becoming empty.
The things get simpler if we keep track of the second set's size
explicitly and always store it in nd->stack[0..depth-1].name.
We access the second set only inside link_path_walk(), so its
size can live in a local variable; that way the check becomes
trivial without the need of that kludge.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
old flags & WALK_FOLLOW <=> new !(flags & WALK_TRAILING)
That's what that flag had really been used for.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->last_type values are set in 3 places: path_init() (sets to LAST_ROOT),
link_path_walk (LAST_NORM/DOT/DOTDOT) and pick_link (LAST_BIND).
The are checked in walk_component(), lookup_last() and do_last().
They also get copied to the caller by filename_parentat(). In the last
3 cases the value is what we had at the return from link_path_walk().
In case of walk_component() it's either directly downstream from
assignment in link_path_walk() or, when called by lookup_last(), the
value we have at the return from link_path_walk().
The value at the entry into link_path_walk() can survive to return only
if the pathname contains nothing but slashes. Note that pick_link()
never returns such - pure jumps are handled directly. So for the calls
of link_path_walk() for trailing symlinks it does not matter what value
had been there at the entry; the value at the return won't depend upon it.
There are 3 call chains that might have pick_link() storing LAST_BIND:
1) pick_link() from step_into() from walk_component() from
link_path_walk(). In that case we will either be parsing the next
component immediately after return into link_path_walk(), which will
overwrite the ->last_type before anyone has a chance to look at it,
or we'll fail, in which case nobody will be looking at ->last_type at all.
2) pick_link() from step_into() from walk_component() from lookup_last().
The value is never looked at due to the above; it won't affect the value
seen at return from any link_path_walk().
3) pick_link() from step_into() from do_last(). Ditto.
In other words, assignemnt in pick_link() is pointless, and so is
LAST_BIND itself; nothing ever looks at that value. Kill it off.
And make link_path_walk() _always_ assign ->last_type - in the only
case when the value at the entry might survive to the return that value
is always LAST_ROOT, inherited from path_init(). Move that assignment
from path_init() into the beginning of link_path_walk(), to consolidate
the things.
Historical note: LAST_BIND used to be used for the kludge with trailing
pure jump symlinks (extra iteration through the top-level loop).
No point keeping it anymore...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move the call of get_link() into walk_component(). Change the
calling conventions for walk_component() to returning the link
body to follow (if any).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
After a pure jump ("/" or procfs-style symlink) we don't need to
hold the link anymore. link_path_walk() dropped it if such case
had been detected, lookup_last/do_last() (i.e. old trailing_symlink())
left it on the stack - it ended up calling terminate_walk() shortly
anyway, which would've purged the entire stack.
Do it in get_link() itself instead. Simpler logics that way...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fold trailing_symlink() into lookup_last() and do_last(), change
the calling conventions of those two. Rules change:
success, we are done => NULL instead of 0
error => ERR_PTR(-E...) instead of -E...
got a symlink to follow => return the path to be followed instead of 1
The loops calling those (in path_lookupat() and path_openat()) adjusted.
A subtle change of control flow here: originally a pure-jump trailing
symlink ("/" or procfs one) would've passed through the upper level
loop once more, with "" for path to traverse. That would've brought
us back to the lookup_last/do_last entry and we would've hit LAST_BIND
case (LAST_BIND left from get_link() called by trailing_symlink())
and pretty much skip to the point right after where we'd left the
sucker back when we picked that trailing symlink.
Now we don't bother with that extra pass through the upper level
loop - if get_link() says "I've just done a pure jump, nothing
else to do", we just treat that as non-symlink case.
Boilerplate added on that step will go away shortly - it'll migrate
into walk_component() and then to step_into(), collapsing into the
change of calling conventions for those.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move restoring LOOKUP_PARENT and zeroing nd->stack.name[0] past
the call of get_link() (nothing _currently_ uses them in there).
That allows to moved the call of may_follow_link() into get_link()
as well, since now the presence of LOOKUP_PARENT distinguishes
the callers from each other (link_path_walk() has it, trailing_symlink()
doesn't).
Preparations for folding trailing_symlink() into callers (lookup_last()
and do_last()) and changing the calling conventions of those. Next
stage after that will have get_link() call migrate into walk_component(),
then - into step_into(). It's tricky enough to warrant doing that
in stages, unfortunately...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New LOOKUP flag, telling path_lookupat() to act as path_mountpointat().
IOW, traverse mounts at the final point and skip revalidation of the
location where it ends up.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The following is true:
* calls of handle_mounts() and step_into() are always
paired in sequences like
err = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
err = step_into(nd, &path, flags, inode, seq);
* in all such sequences path is uninitialized before and
unused after this pair of calls
* in all such sequences inode and seq are unused afterwards.
So the call of handle_mounts() can be shifted inside step_into(),
turning 'path' into a local variable in the combined function.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tells step_into() not to follow symlinks, regardless of LOOKUP_FOLLOW.
Allows to switch handle_lookup_down() to of step_into(), getting
all follow_managed() and step_into() calls paired.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We need to dismiss a symlink when we are done traversing it;
currently that's done when we call step_into() for its last
component. For the cases when we do not call step_into()
for that component (i.e. when it's . or ..) we do the same
symlink dismissal after the call of handle_dots().
What we need to guarantee is that the symlink won't be dismissed
while we are still using nd->last.name - it's pointing into the
body of said symlink. step_into() is sufficiently late - by
the time it's called we'd already obtained the dentry, so the
name we'd been looking up is no longer needed. However, it
turns out to be cleaner to have that ("we are done with that
component now, can dismiss the link") done explicitly - in the
callers of step_into().
In handle_dots() case we won't be using the component string
at all, so for . and .. the corresponding point is actually
_before_ the call of handle_dots(), not after it.
Fix a minor irregularity in do_last(), while we are at it -
if trailing symlink ended with . or .. we forgot to dismiss
it. Not a problem, since nameidata is about to be done with
(neither . nor .. can be a trailing symlink, so this is the
last iteration through the loop) and terminate_walk() will
clean the stack anyway, but let's keep it more regular.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) in case of __follow_mount_rcu() failure, lookup_fast() proceeds
to call unlazy_child() and, should it succeed, handle_mounts().
Note that we have status > 0 (or we wouldn't be calling
__follow_mount_rcu() at all), so all stuff conditional upon
non-positive status won't be even touched.
Consolidate just that sequence after the call of __follow_mount_rcu().
2) calling d_is_negative() and keeping its result is pointless -
we either don't get past checking ->d_seq (and don't use the results of
d_is_negative() at all), or we are guaranteed that ->d_inode and
type bits of ->d_flags had been consistent at the time of d_is_negative()
call. IOW, we could only get to the use of its result if it's
equal to !inode. The same ->d_seq check guarantees that after that point
this CPU won't observe ->d_flags values older than ->d_inode update.
So 'negative' variable is completely pointless these days.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix the handling of signals in client rxrpc calls made by the afs
filesystem. Ignore signals completely, leaving call abandonment or
connection loss to be detected by timeouts inside AF_RXRPC.
Allowing a filesystem call to be interrupted after the entire request has
been transmitted and an abort sent means that the server may or may not
have done the action - and we don't know. It may even be worse than that
for older servers.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
When an AFS service handler function aborts a call, AF_RXRPC marks the call
as complete - which means that it's not going to get any more packets from
the receiver. This is a problem because reception of the final ACK is what
triggers afs_deliver_to_call() to drop the final ref on the afs_call
object.
Instead, aborted AFS service calls may then just sit around waiting for
ever or until they're displaced by a new call on the same connection
channel or a connection-level abort.
Fix this by calling afs_set_call_complete() to finalise the afs_call struct
representing the call.
However, we then need to drop the ref that stops the call from being
deallocated. We can do this in afs_set_call_complete(), as the work queue
is holding a separate ref of its own, but then we shouldn't do it in
afs_process_async_call() and afs_delete_async_call().
call->drop_ref is set to indicate that a ref needs dropping for a call and
this is dealt with when we transition a call to AFS_CALL_COMPLETE.
But then we also need to get rid of the ref that pins an asynchronous
client call. We can do this by the same mechanism, setting call->drop_ref
for an async client call too.
We can also get rid of call->incoming since nothing ever sets it and only
one thing ever checks it (futilely).
A trace of the rxrpc_call and afs_call struct ref counting looks like:
<idle>-0 [001] ..s5 164.764892: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_new_incoming_call+0x473/0xb34 a=00000000442095b5
<idle>-0 [001] .Ns5 164.766001: rxrpc_call: c=00000002 QUE u=4 sp=rxrpc_propose_ACK+0xbe/0x551 a=00000000442095b5
<idle>-0 [001] .Ns4 164.766005: rxrpc_call: c=00000002 PUT u=3 sp=rxrpc_new_incoming_call+0xa3f/0xb34 a=00000000442095b5
<idle>-0 [001] .Ns7 164.766433: afs_call: c=00000002 WAKE u=2 o=11 sp=rxrpc_notify_socket+0x196/0x33c
kworker/1:2-1810 [001] ...1 164.768409: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_process_call+0x25/0x7ae a=00000000442095b5
kworker/1:2-1810 [001] ...1 164.769439: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000001 00000000 02 22 ACK CallAck
kworker/1:2-1810 [001] ...1 164.769459: rxrpc_call: c=00000002 PUT u=2 sp=rxrpc_process_call+0x74f/0x7ae a=00000000442095b5
kworker/1:2-1810 [001] ...1 164.770794: afs_call: c=00000002 QUEUE u=3 o=12 sp=afs_deliver_to_call+0x449/0x72c
kworker/1:2-1810 [001] ...1 164.770829: afs_call: c=00000002 PUT u=2 o=12 sp=afs_process_async_call+0xdb/0x11e
kworker/1:2-1810 [001] ...2 164.771084: rxrpc_abort: c=00000002 95786a88:00000008 s=0 a=1 e=1 K-1
kworker/1:2-1810 [001] ...1 164.771461: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000002 00000000 04 00 ABORT CallAbort
kworker/1:2-1810 [001] ...1 164.771466: afs_call: c=00000002 PUT u=1 o=12 sp=SRXAFSCB_ProbeUuid+0xc1/0x106
The abort generated in SRXAFSCB_ProbeUuid(), labelled "K-1", indicates that
the local filesystem/cache manager didn't recognise the UUID as its own.
Fixes: 2067b2b3f4 ("afs: Fix the CB.ProbeUuid service handler to reply correctly")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix a couple of tracelines to indicate the usage count after the atomic op,
not the usage count before it to be consistent with other afs and rxrpc
trace lines.
Change the wording of the afs_call_trace_work trace ID label from "WORK" to
"QUEUE" to reflect the fact that it's queueing work, not doing work.
Fixes: 341f741f04 ("afs: Refcount the afs_call struct")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the interruptibility of kernel-initiated client calls so that they're
either only interruptible when they're waiting for a call slot to come
available or they're not interruptible at all. Either way, they're not
interruptible during transmission.
This should help prevent StoreData calls from being interrupted when
writeback is in progress. It doesn't, however, handle interruption during
the receive phase.
Userspace-initiated calls are still interruptable. After the signal has
been handled, sendmsg() will return the amount of data copied out of the
buffer and userspace can perform another sendmsg() call to continue
transmission.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes:
- Ensure the fs_context has the correct fs_type when mounting and submounting
- Fix leaking of ctx->nfs_server.hostname
- Add minor version to fscache key to prevent collisions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5r+QkACgkQ18tUv7Cl
QOtq4Q/+Oo707rb3N7DrPikUARB8D7FMTs/m/+xSPNm2DSllImIXJUdckqaoZkwc
DwBMLw+ZDvHtcNytytJQOWJNp9LGjHpZ20g0TLr2p2/JRrQyGgpc0FxTJONwA5Pp
zU6MSgqqfMZ5nLgxpMKsqoPNzO45sS8SKi2I6yZIupLZlsZOzF8L1wL/zc6gJvv8
71UGrSId9mEMKCrE8EQRx7etct5VPuP+pXfDGz4oaI1tdEmfmx3FoJlzZA1/Pf90
YSHdGZb7mR3LFkFRDlnh6NFHWU+yE+b5iWCt32ifO8pCN/CyIUvBxQblx4VLA47H
6S5nrYA96zRcQwhh9B/8yWLiqqxXo2hNl574uBJL/iDqSKSmkEBxZmCbE3aFEGa8
ClWlF6T5z4dlcAlKWXQkn3EXBHzL5+Opev5dArMhqNkr55g4z9Opsa6sc0ZWdywf
h/rSM8bHn9SNYkCGFHQ1MjAn6eNU0vVQ/s9DhM2xdtyfyTQOOHx5yA/KF6aGG5oQ
3mlVEJCEfsBKyWWjHhq3e/7ezgLlKRRlauxdLgjmKy+PmtlY6mGii0eF601e9OSL
RvK7I5/9spbcYmkyyQs0BxitrDZObyhxk31hgNrUMlN/JJrhPhKiyll8/Z7Z4sq6
QP6T/Vfn2FORA3ZAMtMH6V/ZOeiXUZjOkBRVWArIrPwOBJn/EQY=
=3DDn
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"These are mostly fscontext fixes, but there is also one that fixes
collisions seen in fscache:
- Ensure the fs_context has the correct fs_type when mounting and
submounting
- Fix leaking of ctx->nfs_server.hostname
- Add minor version to fscache key to prevent collisions"
* tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
nfs: add minor version to nfs_server_key for fscache
NFS: Fix leak of ctx->nfs_server.hostname
NFS: Don't hard-code the fs_type when submounting
NFS: Ensure the fs_context has the correct fs_type before mounting
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXmpHOAAKCRDh3BK/laaZ
PP0XAQCN52kSOBiSvr8xiQrO5YOONo4yfPDi6qIk/ltvA1yr6wEA3NWAepAL07AS
n51hMi02+JNXuMVnxOm0z2us5/PYJw0=
=MJC1
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
"Fix an Oops introduced in v5.4"
* tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix stack use after return
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXmufyAAKCRDh3BK/laaZ
POXNAQDmkgiy41nUQZ3LxtGKstsgVuzFhqBq+erinBPcF1r9mQEA/xJp4uc2Q8NO
JKZZHyWFLtAN8gGNYTCli4vrm1LoKQc=
=JV3K
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix three bugs introduced in this cycle"
* tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix lockdep warning for async write
ovl: fix some xino configurations
ovl: fix lock in ovl_llseek()
During a rename whiteout, if btrfs_whiteout_for_rename() returns an error
we can end up returning from btrfs_rename() with the log context object
still in the root's log context list - this happens if 'sync_log' was
set to true before we called btrfs_whiteout_for_rename() and it is
dangerous because we end up with a corrupt linked list (root->log_ctxs)
as the log context object was allocated on the stack.
After btrfs_rename() returns, any task that is running btrfs_sync_log()
concurrently can end up crashing because that linked list is traversed by
btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in
the same issue that commit e6c617102c ("Btrfs: fix log context list
corruption after rename exchange operation") fixed.
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5rxtkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv/xEACifgfgyE3a2ZM7w2VTe41IpMxOouEnUWOJ
oVKRp+9gkynE8pUGlE1igTa7T2nQIZM+Qd0KWqknkP2iFiQaNXSqqr8U6qIz9lzV
I6SAcj0Pa2FRzlRly5UXLKiadIHbt2OfP6PIk6sXTcMCFUXb75/WzNFVOnNnBuee
j8F5JUw45xyLXvQnfxpYSt8LeZyGYLoOwJEZX3j+hFHl1GCqSrAY8EB5tkXFbCZi
L9JdJYOBEvnwFF4qxWl++2bmEOywnKeFea84JqbGr9BaVrDAOjAWMairZAU82xiI
EWdQRKkSyDzrl+TACz/ri4J87fzE8FhBpHLufSY3HCxizaayNawxItDg5CCW1ghn
i+bEaKq6djZn1CpSU0w0CTfA1g0D1DnErBS82znC8ciV1ZflAed8oADh3/+X64j8
HzPT1DRoDGnzp4pBwTiZcG7Jb605Mh8i1TY1p35riaUbIR4y84BVNroEUHtO5Cmh
U09efdYifsU9XM+u0OXK+SvrHqtDb6EVSx5x37qiV1SVxZ3JSsr9/uTjnBOrjH5W
nUjqCzQfJZYSNmvRT6aSGDzk5wON95nnv7hYE9HWER/Cw7/VwKdJmBwehIAZUaXG
NxJ7I/mVndGKV8ghoN119XVl7t2i56Ctj2pwu/UJH7lZB/Yfu9qZ5oKpku/Kbriy
pYqSdy8J/Q==
=0jJw
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Just a single fix here, improving the RCU callback ordering from last
week. After a bit more perusing by Paul, he poked a hole in the
original"
* tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block:
io_uring: ensure RCU callback ordering with rcu_barrier()
afs_put_addrlist() casts kfree() to rcu_callback_t. Apart from being wrong
in theory, this might also blow up when people start enforcing function
types via compiler instrumentation, and it means the rcu_head has to be
first in struct afs_addr_list.
Use kfree_rcu() instead, it's simpler and more correct.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the code for verifying the iclog state on a clean unmount into a
helper, and instead of checking the iclog state just rely on the shutdown
check as they are equivalent. Also remove the ifdef DEBUG as the
compiler is smart enough to eliminate the dead code for non-debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When the log is shut down all iclogs are in the XLOG_STATE_IOERROR state,
which means that xlog_state_want_sync and xlog_state_release_iclog are
no-ops. Remove the whole section of code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove the ignored return value from xfs_log_unmount_write, and also
remove a rather pointless assert on the return value from xfs_log_force.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A shutdown log is a slow failure path. Add an unlikely annotation to
it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This is much less widely used than the bc_private union was, so this
is done as a single patch. The named union xfs_btree_cur_private
goes away and is embedded into the struct xfs_btree_cur_ag as an
anonymous union, and the code is modified via this script:
$ sed -i 's/priv\.\([abt|refc]\)/\1/g' fs/xfs/*[ch] fs/xfs/*/*[ch]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
we need to name the btree cursor private structures to be able
to pull them out of the deeply nested structure definition they are
in now.
Based on code extracted from a patchset by Darrick Wong.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Rename the union and it's internal structures to the new name and
remove the temporary defines that facilitated the change.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
BPRV is not longer appropriate because bc_private is going away.
Script:
$ sed -i 's/BTCUR_BPRV/BTCUR_BMBT/g' fs/xfs/*[ch] fs/xfs/*/*[ch]
With manual cleanup to the definitions in fs/xfs/libxfs/xfs_btree.h
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: change "BC_BT" to "BTCUR_BMBT", fix subject line typo]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
bc_private.b -> bc_ino conversion via script:
$ sed -i 's/bc_private\.b/bc_ino/g' fs/xfs/*[ch] fs/xfs/*/*[ch]
And then revert the change to the bc_ino #define in
fs/xfs/libxfs/xfs_btree.h manually.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: tweak the subject line slightly]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
bc_private.a -> bc_ag conversion via script:
`sed -i 's/bc_private\.a/bc_ag/g' fs/xfs/*[ch] fs/xfs/*/*[ch]`
And then revert the change to the bc_ag #define in
fs/xfs/libxfs/xfs_btree.h manually.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Just the defines of the new names - the conversion will be in
scripted commits after this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: change "bc_bt" to "bc_ino"]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Commit 263dde869b ("xfs: cleanup xfs_dir2_block_getdents") introduced
a getdents regression, when it converted the pointer arithmetics to
offset calculations: offset is updated in the loop already for the next
iteration, but the updated offset value is used incorrectly in two
places, where we should have used the not-yet-updated value.
This caused for example "git clean -ffdx" failures to cleanup certain
directory structures when running in a container.
Fix the regression by making sure we use proper offset in the loop body.
Thanks to Christoph Hellwig for suggestion how to best fix the code.
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 263dde869b ("xfs: cleanup xfs_dir2_block_getdents")
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Lockdep reports "WARNING: lock held when returning to user space!" due to
async write holding freeze lock over the write. Apparently aio.c already
deals with this by lying to lockdep about the state of the lock.
Do the same here. No need to check for S_IFREG() here since these file ops
are regular-only.
Reported-by: syzbot+9331a354f4f624a52a55@syzkaller.appspotmail.com
Fixes: 2406a307ac ("ovl: implement async IO routines")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fix up two bugs in the coversion to xino_mode:
1. xino=off does not always end up in disabled mode
2. xino=auto on 32bit arch should end up in disabled mode
Take a proactive approach to disabling xino on 32bit kernel:
1. Disable XINO_AUTO config during build time
2. Disable xino with a warning on mount time
As a by product, xino=on on 32bit arch also ends up in disabled mode.
We never intended to enable xino on 32bit arch and this will make the
rest of the logic simpler.
Fixes: 0f831ec85e ("ovl: simplify ovl_same_sb() helper")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull vfs fixes from Al Viro:
"A couple of fixes for old crap in ->atomic_open() instances"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
cifs_atomic_open(): fix double-put on late allocation failure
gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open(). Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified. Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling. Trivially
fixed...
Fixes: fe9ec8291f "do_last(): take fput() on error after opening to out:"
Cc: stable@kernel.org # v4.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
with the way fs/namei.c:do_last() had been done, ->atomic_open()
instances needed to recognize the case when existing file got
found with O_EXCL|O_CREAT, either by falling back to finish_no_open()
or failing themselves. gfs2 one didn't.
Fixes: 6d4ade986f (GFS2: Add atomic_open support)
Cc: stable@kernel.org # v3.11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All callers are equivalent to
path->dentry = dentry;
path->mnt = nd->path.mnt;
err = handle_mounts(path, ...)
Pass dentry as an explicit argument, fill *path in handle_mounts()
itself.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and shift filling struct path to just before the call of
handle_mounts(). All callers of handle_mounts() are
immediately preceded by path->mnt = nd->path.mnt now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek()
was replaced with ovl_inode_lock(), we did not add a check for error.
Fix this by making ovl_inode_lock() uninterruptible and change the
existing call sites to use an _interruptible variant.
Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com
Fixes: b1f9d3858f ("ovl: use ovl_inode_lock in ovl_llseek()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit. Fix it by replacing with scnprintf().
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xchk_xattr_listent, we attempt to validate the extended attribute
hash structures by performing a attr lookup by (hashed) name. If the
lookup returns ENODATA, that means that the hash information is corrupt.
The _process_error functions don't catch this, so we have to add that
explicitly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In xchk_dir_actor, we attempt to validate the directory hash structures
by performing a directory entry lookup by (hashed) name. If the lookup
returns ENOENT, that means that the hash information is corrupt. The
_process_error functions don't catch this, so we have to add that
explicitly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the owner field of dir3 block headers. If it's corrupt, release
the buffer and return EFSCORRUPTED. All callers handle this properly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the owner field of dir3 data block headers. If it's corrupt,
release the buffer and return EFSCORRUPTED. All callers handle this
properly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the owner field of dir3 free block headers and reject the metadata
if there's something wrong with it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If we decide that a directory free block is corrupt, we must take care
not to leak a buffer pointer to the caller. After xfs_trans_brelse
returns, the buffer can be freed or reused, which means that we have to
set *bpp back to NULL.
Callers are supposed to notice the nonzero return value and not use the
buffer pointer, but we should code more defensively, even if all current
callers handle this situation correctly.
Fixes: de14c5f541 ("xfs: verify free block header fields")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_verifier_error is supposed to be called on a corrupt metadata buffer
from within a buffer verifier function, whereas xfs_buf_mark_corrupt
is the function to be called when a piece of code has read a buffer and
catches something that a read verifier cannot. The first function sets
b_error anticipating that the low level buffer handling code will see
the nonzero b_error and clear XBF_DONE on the buffer, whereas the second
function does not.
Since xfs_dir3_free_header_check examines fields in the dir free block
header that require more context than can be provided to read verifiers,
we must call xfs_buf_mark_corrupt when it finds a problem.
Switching the calls has a secondary effect that we no longer corrupt the
buffer state by setting b_error and leaving XBF_DONE set. When /that/
happens, we'll trip over various state assertions (most commonly the
b_error check in xfs_buf_reverify) on a subsequent attempt to read the
buffer.
Fixes: bc1a09b8e3 ("xfs: refactor verifier callers to print address of failing check")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a xfs_failaddr_t parameter to this function so that callers can
potentially pass in (and therefore report) the exact point in the code
where we decided that a metadata buffer was corrupt. This enables us to
wire it up to checking functions that have to run outside of verifiers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a helper function to get rid of buffers that we have decided are
corrupt after the verifiers have run. This function is intended to
handle metadata checks that can't happen in the verifiers, such as
inter-block relationship checking. Note that we now mark the buffer
stale so that it will not end up on any LRU and will be purged on
release.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In e7ee96dfb8, we converted all ITER_ABORT users to use ECANCELED
instead, but we forgot to teach xfs_rmap_has_other_keys not to return
that magic value to callers. Fix it now by using ECANCELED both to
abort the iteration and to signal that we found another reverse mapping.
This enables us to drop the separate boolean flag.
Fixes: e7ee96dfb8 ("xfs: remove all *_ITER_ABORT values")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Log the corrupt buffer before we release the buffer.
Fixes: a5155b870d ("xfs: always log corruption errors")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Deduplicate cancellation parts, as many of them looks the same, as do
e.g.
- io_wqe_cancel_cb_work() and io_wqe_cancel_work()
- io_wq_worker_cancel() and io_work_cancel()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a bug where if userspace is writing to encrypted files while the
FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running,
dirty inodes could be evicted, causing writes could be lost or the
filesystem to hang due to a use-after-free. This was encountered during
real-world use, not just theoretical.
Tested with the existing fscrypt xfstests, and with a new xfstest I
wrote to reproduce this bug. This fix does expose an existing bug with
'-o lazytime' that Ted is working on fixing, but this fix is more
critical and needed anyway regardless of the lazytime fix.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXmk8HxQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK4YiAQC1RZyH4/mZ890Or6s8SzCgJTVmiLk9
ZTO/56XmLte6LAD+IBAExqDkkybmAF0rQ4kY1oL75f/e/nEs+50TXra9NQc=
=s2KD
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt fix from Eric Biggers:
"Fix a bug where if userspace is writing to encrypted files while the
FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running,
dirty inodes could be evicted, causing writes could be lost or the
filesystem to hang due to a use-after-free. This was encountered
during real-world use, not just theoretical.
Tested with the existing fscrypt xfstests, and with a new xfstest I
wrote to reproduce this bug. This fix does expose an existing bug with
'-o lazytime' that Ted is working on fixing, but this fix is more
critical and needed anyway regardless of the lazytime fix"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: don't evict dirty inodes after removing key
Ensure we keep the truncated value, if we did truncate it. If not, we
might read/write more than the registered buffer size.
Also for retry, ensure that we return the truncated mapped value for
the vectorized versions of the read/write commands.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just dereference bp->b_addr directly and make the code a little
simpler and more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Just dereference bp->b_addr directly and make the code a little
simpler and more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Just dereference bp->b_addr directly and make the code a little
simpler and more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is just a single user left, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
struct xfs_agfl is a header in front of the AGFL entries that exists
for CRC enabled file systems. For not CRC enabled file systems the AGFL
is simply a list of agbno. Make the CRC case similar to that by just
using the list behind the new header. This indirectly solves a problem
with modern gcc versions that warn about taking addresses of packed
structures (and we have to pack the AGFL given that gcc rounds up
structure sizes). Also replace the helper macro to get from a buffer
with an inline function in xfs_alloc.h to make the code easier to
read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
during do_exit(). That can confuse things. In particular, if BSD
process accounting is enabled, then do_exit() writes data to an
accounting file. If that file has FS_SYNC_FL set, then this write
occurs synchronously and can misbehave if PF_MEMALLOC is set.
For example, if the accounting file is located on an XFS filesystem,
then a WARN_ON_ONCE() in iomap_do_writepage() is triggered and the data
doesn't get written when it should. Or if the accounting file is
located on an ext4 filesystem without a journal, then a WARN_ON_ONCE()
in ext4_write_inode() is triggered and the inode doesn't get written.
Fix this in xfsaild() by using the helper functions to save and restore
PF_MEMALLOC.
This can be reproduced as follows in the kvm-xfstests test appliance
modified to add the 'acct' Debian package, and with kvm-xfstests's
recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:
mkfs.xfs -f /dev/vdb
mount /vdb
touch /vdb/file
chattr +S /vdb/file
accton /vdb/file
mkfs.xfs -f /dev/vdc
mount /vdc
umount /vdc
It causes:
WARNING: CPU: 1 PID: 336 at fs/iomap/buffered-io.c:1534
CPU: 1 PID: 336 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
RIP: 0010:iomap_do_writepage+0x16b/0x1f0 fs/iomap/buffered-io.c:1534
[...]
Call Trace:
write_cache_pages+0x189/0x4d0 mm/page-writeback.c:2238
iomap_writepages+0x1c/0x33 fs/iomap/buffered-io.c:1642
xfs_vm_writepages+0x65/0x90 fs/xfs/xfs_aops.c:578
do_writepages+0x41/0xe0 mm/page-writeback.c:2344
__filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:421
file_write_and_wait_range+0x71/0xc0 mm/filemap.c:760
xfs_file_fsync+0x7a/0x2b0 fs/xfs/xfs_file.c:114
generic_write_sync include/linux/fs.h:2867 [inline]
xfs_file_buffered_aio_write+0x379/0x3b0 fs/xfs/xfs_file.c:691
call_write_iter include/linux/fs.h:1901 [inline]
new_sync_write+0x130/0x1d0 fs/read_write.c:483
__kernel_write+0x54/0xe0 fs/read_write.c:515
do_acct_process+0x122/0x170 kernel/acct.c:522
slow_acct_process kernel/acct.c:581 [inline]
acct_process+0x1d4/0x27c kernel/acct.c:607
do_exit+0x83d/0xbc0 kernel/exit.c:791
kthread+0xf1/0x140 kernel/kthread.c:257
ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352
This bug was originally reported by syzbot at
https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.
Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If regular inode has no compressed cluster, allow using 'chattr -c'
to remove its compress flag, recovering it to a non-compressed file.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Compressed cluster can be generated during dirty data writeback,
if there is dirty pages on compressed inode, it needs to disable
converting compressed inode to non-compressed one.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
stat_inc_compr_inode() needs to check FI_COMPRESSED_FILE flag, so
in f2fs_disable_compressed_file(), we should call stat_dec_compr_inode()
before clearing FI_COMPRESSED_FILE flag.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
to do io completion events polling again, they can rely on io_sq_thread to do
polling work, which can reduce cpu usage and uring_lock contention.
I modify fio io_uring engine codes a bit to evaluate the performance:
static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
continue;
}
- if (!o->sqpoll_thread) {
+ if (o->sqpoll_thread && o->hipri) {
r = io_uring_enter(ld, 0, actual_min,
IORING_ENTER_GETEVENTS);
if (r < 0) {
and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
-rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k
-size=10G -numjobs=1 -time_based -runtime=120"
original codes
--------------------------------------------------------------------
iodepth | 4 | 8 | 16 | 32 | 64
bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
fio cpu usage | 100% | 100% | 100% | 100% | 100%
--------------------------------------------------------------------
with patch
--------------------------------------------------------------------
iodepth | 4 | 8 | 16 | 32 | 64
bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4%
--------------------------------------------------------------------
bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5%
--------------------------------------------------------------------
From above test results, we can see that bw has above 5.5%~13%
improvement, and fio process's cpu usage also drops much. Note this
won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
are both enabled, in this case, io_sq_thread always has 100% cpu usage.
I think this patch will be friendly to applications which will often use
io_uring_wait_cqe() or similar from liburing.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This fixes the incorrect failure when enabling project quota on casefold-enabled
file.
Cc: Daniel Rosenberg <drosen@google.com>
Cc: kernel-team@android.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_vm_page_mkwrite(), if inode is compress one, and current mmapped
page locates in compressed cluster, we have to call f2fs_get_dnode_of_data()
to get its physical block address before f2fs_wait_on_block_writeback().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Lack of maintenance on comments may mislead developers, fix them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_inode.xattr_ver field was gone after commit d260081ccf
("f2fs: change recovery policy of xattr node block"), remove i_sem
lock coverage in f2fs_setxattr()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This change solves below hangtask issue:
INFO: task kworker/u16:1:58 blocked for more than 122 seconds.
Not tainted 5.6.0-rc2-00590-g9983bdae4974e #11
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:1 D 0 58 2 0x00000000
Workqueue: writeback wb_workfn (flush-179:0)
Backtrace:
(__schedule) from [<c0913234>] (schedule+0x78/0xf4)
(schedule) from [<c017ec74>] (rwsem_down_write_slowpath+0x24c/0x4c0)
(rwsem_down_write_slowpath) from [<c0915f2c>] (down_write+0x6c/0x70)
(down_write) from [<c0435b80>] (f2fs_write_single_data_page+0x608/0x7ac)
(f2fs_write_single_data_page) from [<c0435fd8>] (f2fs_write_cache_pages+0x2b4/0x7c4)
(f2fs_write_cache_pages) from [<c043682c>] (f2fs_write_data_pages+0x344/0x35c)
(f2fs_write_data_pages) from [<c0267ee8>] (do_writepages+0x3c/0xd4)
(do_writepages) from [<c0310cbc>] (__writeback_single_inode+0x44/0x454)
(__writeback_single_inode) from [<c03112d0>] (writeback_sb_inodes+0x204/0x4b0)
(writeback_sb_inodes) from [<c03115cc>] (__writeback_inodes_wb+0x50/0xe4)
(__writeback_inodes_wb) from [<c03118f4>] (wb_writeback+0x294/0x338)
(wb_writeback) from [<c0312dac>] (wb_workfn+0x35c/0x54c)
(wb_workfn) from [<c014f2b8>] (process_one_work+0x214/0x544)
(process_one_work) from [<c014f634>] (worker_thread+0x4c/0x574)
(worker_thread) from [<c01564fc>] (kthread+0x144/0x170)
(kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
Reported-and-tested-by: Ondřej Jirman <megi@xff.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
inode.i_blocks counts based on 512byte sector, we need to convert
to 4kb sized block count before comparing to i_compr_blocks.
In addition, add to print message when sanity check on inode
compression configs failed.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If CONFIG_NET is not set, gcc warns:
fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function]
static int io_setup_async_msg(struct io_kiocb *req,
^~~~~~~~~~~~~~~~~~
There are many funcions wraped by CONFIG_NET, move them
together to simplify code, also fix this warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Minor tweaks.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Not easy to tell if we're going over the size of bits we can shove
in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON()
check for it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
is to trigger IO on them. The usual case of shrinking a buffer pool
would be to just not replenish the buffers when IO completes, and
instead just free it. But it may be nice to have a way to manually
remove a number of buffers from a given group, and
IORING_OP_REMOVE_BUFFERS provides that functionality.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for the vectored read. This is limited to supporting
just 1 segment in the iov, and is provided just for convenience for
applications that use IORING_OP_READV already.
The iov helpers will be used for IORING_OP_RECVMSG as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
support passing in an addr/len that is associated with a buffer ID and
buffer group ID. The group ID is used to index and lookup the buffers,
while the buffer ID can be used to notify the application which buffer
in the group was used. The addr passed in is the starting buffer address,
and length is each buffer length. A number of buffers to add with can be
specified, in which case addr is incremented by length for each addition,
and each buffer increments the buffer ID specified.
No validation is done of the buffer ID. If the application provides
buffers within the same group with identical buffer IDs, then it'll have
a hard time telling which buffer ID was used. The only restriction is
that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
maximum ID that can be used.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309202327.GA8813@embeddedor
Signed-off-by: Kees Cook <keescook@chromium.org>
After more careful studying, Paul informs me that we cannot rely on
ordering of RCU callbacks in the way that the the tagged commit did.
The current construct looks like this:
void C(struct rcu_head *rhp)
{
do_something(rhp);
call_rcu(&p->rh, B);
}
call_rcu(&p->rh, A);
call_rcu(&p->rh, C);
and we're relying on ordering between A and B, which isn't guaranteed.
Make this explicit instead, and have a work item issue the rcu_barrier()
to ensure that A has run before we manually execute B.
While thorough testing never showed this issue, it's dependent on the
per-cpu load in terms of RCU callbacks. The updated method simplifies
the code as well, and eliminates the need to maintain an rcu_head in
the fileset data.
Fixes: c1e2148f8e ("io_uring: free fixed_file_data after RCU grace period")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here are 4 small driver core / debugfs patches for 5.6-rc3
They are:
- debugfs api cleanup now that all callers for
debugfs_create_regset32() have been fixed up. This was
waiting until after the -rc1 merge as these fixes came in
through different trees
- driver core sync state fixes based on reports of minor issues
found in the feature
All of these have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXmS2Lg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylvNgCfbnALILZh05QJPCfZv/seNFcFYLIAnRNAzxAU
mTPqUqTp5+WMXSzGigMa
=NyIX
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs fixes from Greg KH:
"Here are four small driver core / debugfs patches for 5.6-rc3:
- debugfs api cleanup now that all debugfs_create_regset32() callers
have been fixed up. This was waiting until after the -rc1 merge as
these fixes came in through different trees
- driver core sync state fixes based on reports of minor issues found
in the feature
All of these have been in linux-next with no reported issues"
* tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: Skip unnecessary work when device doesn't have sync_state()
driver core: Add dev_has_sync_state()
driver core: Call sync_state() even if supplier has no consumers
debugfs: remove return value of debugfs_create_regset32()
After FS_IOC_REMOVE_ENCRYPTION_KEY removes a key, it syncs the
filesystem and tries to get and put all inodes that were unlocked by the
key so that unused inodes get evicted via fscrypt_drop_inode().
Normally, the inodes are all clean due to the sync.
However, after the filesystem is sync'ed, userspace can modify and close
one of the files. (Userspace is *supposed* to close the files before
removing the key. But it doesn't always happen, and the kernel can't
assume it.) This causes the inode to be dirtied and have i_count == 0.
Then, fscrypt_drop_inode() failed to consider this case and indicated
that the inode can be dropped, causing the write to be lost.
On f2fs, other problems such as a filesystem freeze could occur due to
the inode being freed while still on f2fs's dirty inode list.
Fix this bug by making fscrypt_drop_inode() only drop clean inodes.
I've written an xfstest which detects this bug on ext4, f2fs, and ubifs.
Fixes: b1c0ec3599 ("fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20200305084138.653498-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5j8gkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphtKEADIid1/6xG6EO965jKjR1G3e7pnA7M6Ek01
T0svGLMSYtPV9aRERiWDWdyCE01C0kjwWvmpiTCmWr0sm3bJYBB+NaDXkCtwa1IW
uFPMNDpeCQijQI1sImbeP2yN2ufGY5r7Y9RCMU7+iKgcao3pFaR136y7UfBHykJ8
Iyp/sir5FRHlEzrGyoXOe1j131BZrDGCa+cuPyAOlr75abN+TDazJAv05MGBQVfI
wc4hOHy0+D07juXP3ZD8UptoLTXPNk+tcAIqAEIaEuPxmRxq1lOfnM506rWyp2sy
XZrQhUblkL8nqfqXASYGQcY/DaNxhEvbzn86MaCKm4qf12uCiP0/DS3hFY/32lAt
VX9eOYenX1zTRLQoRNwvVHT4+m+Splp7IpICFK9bSGk1jp3rbclSXmWITqSWkOgi
C45wAAmWw4lzrbxcEDfBAns/lcwsrPwHn12WdM9ofk2I1jTDubO47c/oFEzEn0w/
IixdKeMVnifNoytP9XFcUcotNzc/NPiPvMNgCkNm59kUHfXMXx6HHyTLO/JUzjZ9
B/s2LkC23EksjEGC3gQiQxighyvNCsN0Wv9L7InaCjJY5IpcOoL495fnPCPfaOaW
7c6xrkRxvHN8bSsKmESywcFjtBv23OtlTfbma7hjdByaGkW/M62qdT6DOiQcoiX/
Ts7YOMtPdQ==
=aukx
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Here are a few io_uring fixes that should go into this release. This
contains:
- Removal of (now) unused io_wq_flush() and associated flag (Pavel)
- Fix cancelation lockup with linked timeouts (Pavel)
- Fix for potential use-after-free when freeing percpu ref for fixed
file sets
- io-wq cancelation fixups (Pavel)"
* tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
io_uring: fix lockup with timeouts
io_uring: free fixed_file_data after RCU grace period
io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL
io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation
There is a recipe to deadlock the kernel: submit a timeout sqe with a
linked_timeout (e.g. test_single_link_timeout_ception() from liburing),
and SIGKILL the process.
Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout
isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it
during io_put_free() to cancel the linked timeout. Probably, the same
can happen with another io_kill_timeout() call site, that is
io_commit_cqring().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5iXe8ACgkQxWXV+ddt
WDvWGg/+LFP+Y8Qz6xHTl3vXuGJKjCr7X/MIi69r2N0JFoCUeXyOdxeSlOuNCfhb
HiLZzfA5TYoptsdLJAXQLy7nPKFCQcc+J19Mbt2+aebpdGqfgN+YZEGkltfKL8Ao
xjOGu5HROFFpNTtnwa1dYOQkyVuZ8oafuJxwVJ8T28fxepRvBbi5jmy3lb1ypL3W
NoIPBe+67g5z/W0ATFmBMF7cCbvS5gsEGWKpbbjh7r8ZHJkhUaxVU7YdxPqlXrAO
ejZfiJUwi8rTGm0zd8A5TX/wsxSeBEXolvh91k5tatTljjzROHa028KRg2voUZIW
C5/7X+Z2C3gzuT0o7TGLBOR6CkVhkSutDV8/QIE6hDjZ/aCMNi0mIFco1hG8jjd1
jQfjemjj7PWuVEnZ6EuVSoHSXjZvBvX66of40YhTQEtSaJpcZU4jP26+8cXENN6+
6WbWcQpEQbT0cp0YKWhWvAIwGMf0jmWESISeFMEaF0eQd8BtzrH1qYcs3JTmXvHC
XmC47hoEJLhjQkAgQ4oNa5PZQzR1wEfW/4FPdqlADOR2frE1wDiKdrpN/dkAYHdQ
edNlo9u0+bRWCP40p04i2IUX/aUAc+me9QxiZwxT3Fw0g5QBSE2035Ly4spvT8NZ
gIvwnq1KGxmtrJSo5Lpkv4bjHYbByYMOiGJUMOTCIEdqajFI224=
=06pr
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One fixup for DIO when in use with the new checksums, a missed case
where the checksum size was still assuming u32"
* tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix RAID direct I/O reads with alternate csums
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAl5igvYTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFdbFD/9ZP3XDY+ngnN5nsSYS4QuzudlncnZL
ceRLD5YykNPLOAesr7DWI8EDky+IFL5w4wRHVxAbOeHpj3haySLefV9vsM/G6sm4
CiHdikx7uls184r5WYK3jfB19UF3UIePUjTnAtxOpemjkLv58Z15nPNGGQv9lkFJ
dJbCk1kdwaEA3LYEyXiGC/ianaxLtiqBy+C0d581OZn3ty551c8vmF0Ziz5tcuot
aObPE3f8sYNxDuTDZcseRxvXUfMS1Qj/tMxeDDIXryX71zIsFbQ6PMPUNHGHGit/
uoeuprDy90mLqGuEEuUfVaXjn8zEPFlW8IHy1OJ4fFNQ0X/HYa2/CFTA2BiVrpfM
1lVYKWuMz+mCq9i8wzF/+ikQ9QVMG2cSb0i4kyuAb+RBP+PDjNTbTLjFeEIJVz6O
yN9MUXWH5XS8liFq2F5VbITwpSJEk7vxiTGDT1zU38HXFdrxL0FRC60TKhkplLzO
9xsj9jUBV/sD5ohwq9Ga+kcXOB/KA/9iW3TMfBApq7oWIxaEfW7rQ6A/O5tuF/hX
q2mwrRoEx6tpCy77KFBLT89iF0gzV3xzadwWcnpDkFC7x2OkMmZPPr2nWeJS6qbN
hPOD1fiWW/NXMXs7foQ9HZ7HdbQMDI7olnf1sjkh4pq2MKDWsJLvNB4fYwZUxhpn
8K4B+9yfIofvpg==
=H/ky
-----END PGP SIGNATURE-----
Merge tag 'filelock-v5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking fixes from Jeff Layton:
"Just a couple of late-breaking patches for the file locking code. The
second patch (from yangerkun) fixes a rather nasty looking potential
use-after-free that should go to stable.
The other patch could technically wait for 5.7, but it's fairly
innocuous so I figured we might as well take it"
* tag 'filelock-v5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: fix a potential use-after-free problem when wakeup a waiter
fcntl: Distribute switch variables for initialization
The percpu refcount protects this structure, and we can have an atomic
switch in progress when exiting. This makes it unsafe to just free the
struct normally, and can trigger the following KASAN warning:
BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
Read of size 1 at addr ffff888181a19a30 by task swapper/0/0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
<IRQ>
dump_stack+0x76/0xa0
print_address_description.constprop.0+0x3b/0x60
? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
__kasan_report.cold+0x1a/0x3d
? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
rcu_core+0x370/0x830
? percpu_ref_exit+0x50/0x50
? rcu_note_context_switch+0x7b0/0x7b0
? run_rebalance_domains+0x11d/0x140
__do_softirq+0x10a/0x3e9
irq_exit+0xd5/0xe0
smp_apic_timer_interrupt+0x86/0x200
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:default_idle+0x26/0x1f0
Fix this by punting the final exit and free of the struct to RCU, then
we know that it's safe to do so. Jann suggested the approach of using a
double rcu callback to achieve this. It's important that we do a nested
call_rcu() callback, as otherwise the free could be ordered before the
atomic switch, even if the latter was already queued.
Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com
Suggested-by: Jann Horn <jannh@google.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'16306a61d3b7 ("fs/locks: always delete_block after waiting.")' add the
logic to check waiter->fl_blocker without blocked_lock_lock. And it will
trigger a UAF when we try to wakeup some waiter:
Thread 1 has create a write flock a on file, and now thread 2 try to
unlock and delete flock a, thread 3 try to add flock b on the same file.
Thread2 Thread3
flock syscall(create flock b)
...flock_lock_inode_wait
flock_lock_inode(will insert
our fl_blocked_member list
to flock a's fl_blocked_requests)
sleep
flock syscall(unlock)
...flock_lock_inode_wait
locks_delete_lock_ctx
...__locks_wake_up_blocks
__locks_delete_blocks(
b->fl_blocker = NULL)
...
break by a signal
locks_delete_block
b->fl_blocker == NULL &&
list_empty(&b->fl_blocked_requests)
success, return directly
locks_free_lock b
wake_up(&b->fl_waiter)
trigger UAF
Fix it by remove this logic, and this patch may also fix CVE-2019-19769.
Cc: stable@vger.kernel.org
Fixes: 16306a61d3 ("fs/locks: always delete_block after waiting.")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Before this patch, if gfs2_ail1_flush gets an error from function
gfs2_ail1_start_one (which comes indirectly from generic_writepages)
the file system is withdrawn, but without any explanation why.
This patch adds an error message if gfs2_ail1_flush gets an error
from gfs2_ail1_start_one.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When get an error in the middle of reading an inode, some fields in the
inode might be still not initialized. And then the evict_inode path may
access those fields via iput().
To fix, this makes sure that inode fields are initialized.
Reported-by: syzbot+9d82b8de2992579da5d0@syzkaller.appspotmail.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/871rqnreqx.fsf@mail.parknet.co.jp
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.
This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
KCSAN find inode->i_disksize could be accessed concurrently.
BUG: KCSAN: data-race in ext4_mark_iloc_dirty / ext4_write_end
write (marked) to 0xffff8b8932f40090 of 8 bytes by task 66792 on cpu 0:
ext4_write_end+0x53f/0x5b0
ext4_da_write_end+0x237/0x510
generic_perform_write+0x1c4/0x2a0
ext4_buffered_write_iter+0x13a/0x210
ext4_file_write_iter+0xe2/0x9b0
new_sync_write+0x29c/0x3a0
__vfs_write+0x92/0xa0
vfs_write+0xfc/0x2a0
ksys_write+0xe8/0x140
__x64_sys_write+0x4c/0x60
do_syscall_64+0x8a/0x2a0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffff8b8932f40090 of 8 bytes by task 14414 on cpu 1:
ext4_mark_iloc_dirty+0x716/0x1190
ext4_mark_inode_dirty+0xc9/0x360
ext4_convert_unwritten_extents+0x1bc/0x2a0
ext4_convert_unwritten_io_end_vec+0xc5/0x150
ext4_put_io_end+0x82/0x130
ext4_writepages+0xae7/0x16f0
do_writepages+0x64/0x120
__writeback_single_inode+0x7d/0x650
writeback_sb_inodes+0x3a4/0x860
__writeback_inodes_wb+0xc4/0x150
wb_writeback+0x43f/0x510
wb_workfn+0x3b2/0x8a0
process_one_work+0x39b/0x7e0
worker_thread+0x88/0x650
kthread+0x1d4/0x1f0
ret_from_fork+0x35/0x40
The plain read is outside of inode->i_data_sem critical section
which results in a data race. Fix it by adding READ_ONCE().
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Link: https://lore.kernel.org/r/1582556566-3909-1-git-send-email-hqjagain@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Al Viro recently reworked the way file system parameters are handled
Update super.c to work with it in linux-next 20200203.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the Kconfig and Makefile for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of nls operations for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of misc operations for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of exfat cache.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of bitmap operations for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of fat entry operations for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of file operations for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of directory operations for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of inode operations for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds the implementation of superblock operations for exfat.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds in-memory and on-disk structures and headers.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Although convert_initialized_extent() can potentially return an error
code with a negative value, its returned value is assigned to an
unsigned variable containing a block count in ext4_ext_map_blocks() and
then returned to that function's caller. The code currently works,
though the way this happens is obscure. The code would be more
readable if it followed the error handling convention used elsewhere
in ext4_ext_map_blocks().
This patch does not address any known test failure or bug report - it's
simply a cleanup. It also addresses a nearby coding standard issue.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200218202656.21561-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Improve comments in jbd2_journal_commit_transaction() to describe why
we don't need to clear the buffer_mapped bit for freeing file mapping
buffers whose page mapping is NULL.
Link: https://lore.kernel.org/r/20200217112706.20085-1-yi.zhang@huawei.com
Fixes: c96dceeabf ("jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer")
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200213160648.GA7054@embeddedor
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Link: https://lore.kernel.org/r/20200213152558.7070-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There are no forward references for ext4_split_extent() in extents.c,
so delete its unnecessary declaration.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200212162141.22381-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The EXT4_EOFBLOCKS_FL inode flag is used to indicate whether a file
contains unwritten blocks past i_size. It's set when ext4_fallocate
is called with the KEEP_SIZE flag to extend a file with an unwritten
extent. However, this flag hasn't been useful functionally since
March, 2012, when a decision was made to remove it from ext4.
All traces of EXT4_EOFBLOCKS_FL were removed from e2fsprogs version
1.42.2 by commit 010dc7b90d97 ("e2fsck: remove EXT4_EOFBLOCKS_FL flag
handling") at that time. Now that enough time has passed to make
e2fsprogs versions containing this modification common, this patch now
removes the code associated with EXT4_EOFBLOCKS_FL from the kernel as
well.
This change has two implications. First, because pre-1.42.2 e2fsck
versions only look for a problem if EXT4_EOFBLOCKS_FL is set, and
because that bit will never be set by newer kernels containing this
patch, old versions of e2fsck won't have a compatibility problem with
files created by newer kernels.
Second, newer kernels will not clear EXT4_EOFBLOCKS_FL inode flag bits
belonging to a file written by an older kernel. If set, it will remain
in that state until the file is deleted. Because e2fsck versions since
1.42.2 don't check the flag at all, no adverse effect is expected.
However, pre-1.42.2 e2fsck versions that do check the flag may report
that it is set when it ought not to be after a file has been truncated
or had its unwritten blocks written. In this case, the old version of
e2fsck will offer to clear the flag. No adverse effect would then
occur whether the user chooses to clear the flag or not.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200211210216.24960-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since commit "b1b4705d54ab ext4: introduce direct I/O read using
iomap infrastructure", we can easily make ext4 support iopoll
method, just use iomap_dio_iopoll().
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20200207120758.2411-1-xiaoguang.wang@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Writeback errors can leave buffer in not up-to-date state when there
are errors during background writes. Force buffer up-to-date while
marking it dirty.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20191224190940.157952-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The 'pgoff' displayed by the tracepoints wasn't a pgoff at all; it
was a byte offset from the start of the file. We already emit that in
the form of the 'offset', so we can just remove pgoff. That means we
can remove 'page' as an argument to the tracepoint, and rename this
type of tracepoint from being a page class to being a range class.
Fixes: 0b1b213fcf ("xfs: event tracing support")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This just prepares the ring for having lists of buffers associated with
it, that the application can provide for SQEs to consume instead of
providing their own.
The buffers are organized by group ID.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First it changes io-wq interfaces. It replaces {get,put}_work() with
free_work(), which guaranteed to be called exactly once. It also enforces
free_work() callback to be non-NULL.
io_uring follows the changes and instead of putting a submission reference
in io_put_req_async_completion(), it will be done in io_free_work(). As
removes io_get_work() with corresponding refcount_inc(), the ref balance
is maintained.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When executing non-linked hashed work, io_worker_handle_work()
will lock-unlock wqe->lock to update hash, and then immediately
lock-unlock to get next work. Optimise this case and do
lock/unlock only once.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are 2 optimisations:
- Now, io_worker_handler_work() do io_assign_current_work() twice per
request, and each one adds lock/unlock(worker->lock) pair. The first is
to reset worker->cur_work to NULL, and the second to set a real work
shortly after. If there is a dependant work, set it immediately, that
effectively removes the extra NULL'ing.
- And there is no use in taking wqe->lock for linked works, as they are
not hashed now. Optimise it out.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a preparation patch, it adds some helpers and makes
the next patches cleaner.
- extract io_impersonate_work() and io_assign_current_work()
- replace @next label with nested do-while
- move put_work() right after NULL'ing cur_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If after dropping the submission reference req->refs == 1, the request
is done, because this one is for io_put_work() and will be dropped
synchronously shortly after. In this case it's safe to steal a next
work from the request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There will be no use for @nxt in the handlers, and it's doesn't work
anyway, so purge it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The rule is simple, any async handler gets a submission ref and should
put it at the end. Make them all follow it, and so more consistent.
This is a preparation patch, and as io_wq_assign_next() currently won't
ever work, this doesn't care to use io_put_req_find_next() instead of
io_put_req().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
refcount_inc_not_zero() -> refcount_inc() fix.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5dnvEACgkQiiy9cAdy
T1FaWAv/XnyYfYh6H4fhtgtfNxW9xt9mkHo/AohHcf2rk2erqjVz0lHVe7SuS9C5
EpDYnijZKa//aiIV6VzDymPaMrXQZ+oCAExAzLPmWZnLeZ65Q02K2P1F3KvURdue
4nLjuOyzyG4YYkoBi4wKneu1Ji377m9L6BpSfM+MzPScCOl8OV/vv/nBRY1N6gIY
Rreq5iipRaDhifsaOgiA501sUu7mvpPEHNpluCtFmY4iTHQzYqjWZ5ZGXr2xz63n
5VV8KWWn/p3nhJGt7L/1aynws59AdEd5GNZ5FbDQHokx9n3MMnyl4QGDzUehnhlY
Ym6n50QA5QMn9I9NLg8I2aD6z4vNIj9kZxersoHduf4UsA9CyPaucUIyV81mt683
AZIqtz8H21fgJXOQ3nv4uNc8Yyt1SGQfFDo1EfphwLl6LaE8rx3CFEnVoNLM+jqb
nyRB/NxLtDWVQhYM8Bg/TP7iMqknHtarfZirv48LFdXLlhb83+qpSSHy0zVy9vli
y/0B7rEI
=zLW4
-----END PGP SIGNATURE-----
Merge tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five small cifs/smb3 fixes, two for stable (one for a reconnect
problem and the other fixes a use case when renaming an open file)"
* tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Use #define in cifs_dbg
cifs: fix rename() by ensuring source handle opened with DELETE bit
cifs: add missing mount option to /proc/mounts
cifs: fix potential mismatch of UNC paths
cifs: don't leak -EAGAIN for stat() during reconnect
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.
To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.
fs/fcntl.c: In function ‘send_sigio_to_task’:
fs/fcntl.c:738:20: warning: statement will never be executed [-Wswitch-unreachable]
738 | kernel_siginfo_t si;
| ^~
[1] https://bugs.llvm.org/show_bug.cgi?id=44916
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
As Lasse pointed out, "Looking at fs/erofs/decompress.c,
the return value from LZ4_decompress_safe_partial is only
checked for negative value to catch errors. ... So if
I understood it correctly, if there is bad data whose
uncompressed size is much less than it should be, it can
leave part of the output buffer untouched and expose the
previous data as the file content. "
Let's fix it now.
Cc: Lasse Collin <lasse.collin@tukaani.org>
Fixes: 7fc45dbc93 ("staging: erofs: introduce generic decompression backend")
[ Gao Xiang: v5.3+, I will manually backport this to stable later. ]
Link: https://lore.kernel.org/r/20200226081008.86348-3-gaoxiang25@huawei.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
As Lasse pointed out, "EROFS uses LZ4_decompress_safe_partial
for both partial and full blocks. Thus when it is decoding a
full block, it doesn't know if the LZ4 decoder actually decoded
all the input. The real uncompressed size could be bigger than
the value stored in the file system metadata.
Using LZ4_decompress_safe instead of _safe_partial when
decompressing a full block would help to detect errors."
So it's reasonable to use _safe in case of potential corrupted
images and it might have some speed gain as well although
I didn't observe much difference.
Note that legacy compressor (< 5.3, no LZ4_0PADDING) could
encode extra data in a pcluster, which is excluded as well.
Cc: Lasse Collin <lasse.collin@tukaani.org>
Fixes: 0ffd71bcc3 ("staging: erofs: introduce LZ4 decompression inplace")
[ Gao Xiang: v5.3+, I will manually backport this to stable later. ]
Link: https://lore.kernel.org/r/20200226081008.86348-2-gaoxiang25@huawei.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
XArray has friendly APIs and it will replace the old radix
tree in the near future.
This convert makes use of __xa_cmpxchg when inserting on
a just inserted item by other thread. In detail, instead
of totally looking up again as what we did for the old
radix tree, it will try to legitimize the current in-tree
item in the XArray therefore more effective.
In addition, naming is rather a challenge for non-English
speaker like me. The basic idea of workstn is to provide
a runtime sparse array with items arranged in the physical
block number order. Such items (was called workgroup) can be
used to record compress clusters or for later new features.
However, both workgroup and workstn seem not good names from
whatever point of view, so I'd like to rename them as pslot
and managed_pslots to stand for physical slots. This patch
handles the second as a part of the radix tree convert.
Cc: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/20200220024642.91529-1-gaoxiang25@huawei.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
btrfs_lookup_and_bind_dio_csum() does pointer arithmetic which assumes
32-bit checksums. If using a larger checksum, this leads to spurious
failures when a direct I/O read crosses a stripe. This is easy
to reproduce:
# mkfs.btrfs -f --checksum blake2 -d raid0 /dev/vdc /dev/vdd
...
# mount /dev/vdc /mnt
# cd /mnt
# dd if=/dev/urandom of=foo bs=1M count=1 status=none
# dd if=foo of=/dev/null bs=1M iflag=direct status=none
dd: error reading 'foo': Input/output error
# dmesg | tail -1
[ 135.821568] BTRFS warning (device vdc): csum failed root 5 ino 257 off 421888 ...
Fix it by using the actual checksum size.
Fixes: 1e25a2e3ca ("btrfs: don't assume ordered sums to be 4 bytes")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Let the low-level attr code only allocate the needed buffer size
for xfs_attrmulti_attr_get instead of allocating the upper bound
at the top of the call chain.
Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
No need to allocate the max size if we can just allocate the easily
known actual ACL size.
Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the round_down macro, and use the size of the uint32 type we
use in the callback that fills the buffer to make the code a little
more clear - the size of it is always the same as int for platforms
that Linux runs on.
Suggested-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The attrlist cursor only exists as part of an attr list context, so
embedd the structure instead of pointing to it. Also give it a proper
xfs_ prefix and remove the obsolete typedef.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we use the on-disk flags field also for the interface to the
lower level attr routines we can use the XFS_ATTR_INCOMPLETE definition
from the on-disk format directly instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The ATTR_* flags have a long IRIX history, where they a userspace
interface, the on-disk format and an internal interface. We've split
out the on-disk interface to the XFS_ATTR_* values, but despite (or
because?) of that the flag have still been a mess. Switch the
internal interface to pass the on-disk XFS_ATTR_* flags for the
namespace and the Linux XATTR_* flags for the actual flags instead.
The ATTR_* values that are actually used are move to xfs_fs.h with a
new XFS_IOC_* prefix to not conflict with the userspace version that
has the same name and must have the same value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove superflous braces, elses after return statements and use a goto
label to merge common error handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the function to xfs_acl.c and provide a proper stub for the
!CONFIG_XFS_POSIX_ACL case. Lift the flags check to the caller as it
nicely fits in there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Lift the common code to copy the cursor from and to user space into
xfs_ioc_attr_list. Note that this means we copy in twice now as
the cursor is in the middle of the conaining structure, but we never
touch the memory for the original copy. Doing so keeps the cursor
handling isolated in the common helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Lift the buffer allocation from the two callers into xfs_ioc_attr_list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Lift the flags and bufsize checks from both callers into the common code
in xfs_ioc_attr_list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The version taking the context structure is the main interface to list
attributes, so drop the _int postfix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The old xfs_attr_list code is only used by the attrlist by handle
ioctl. Move it to xfs_ioctl.c with its user. Also move the
attrlist and attrlist_ent structure to xfs_fs.h, as they are exposed
user ABIs. They are used through libattr headers with the same name
by at least xfsdump. Also document this relation so that it doesn't
require a research project to figure out.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace a single use macro containing open-coded variants of
standard helpers with direct calls to the standard helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace the alist char pointer with a void buffer given that different
callers use it in different ways. Use the chance to remove the typedef
and reduce the indentation of the struct definition so that it doesn't
overflow 80 char lines all over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Factor out a helper that compares an on-disk attr vs the name, length and
flags specified in struct xfs_da_args.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
op_flags with the XFS_DA_OP_* flags is the usual place for in-kernel
only flags, so move the notime flag there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use a NULL args->value as the indicator to lazily allocate a buffer
instead, and let the caller always free args->value instead of
duplicating the cleanup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We can just pass down the Linux convention of a zero valuelen to just
query for the existance of an attribute to the low-level code instead.
The use in the legacy xfs_attr_list code only used by the ioctl
interface was already dead code, as the callers check that the flag
is not present.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode can easily be derived from the args structure. Also
don't bother with else statements after early returns.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of converting from one style of arguments to another in
xfs_attr_set, pass the structure from higher up in the call chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of converting from one style of arguments to another in
xfs_attr_set, pass the structure from higher up in the call chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xattr values are blobs and should not be typed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All the callers already check the length when allocating the
in-kernel xattrs buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All callers provide a valid name pointer, remove the redundant check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a new helper to handle a single attr multi ioctl operation that
can be shared between the native and compat ioctl implementation.
There is a slight change in behaviour in that we don't break out of the
loop when copying in the attribute name fails. The previous behaviour
was rather inconsistent here as it continued for any other kind of
error, and that we don't clear the flags in the structure returned
to userspace, a behavior only introduced as a bug fix in the last
merge window.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Simplify the user copy code by using strndup_user. This means that we
now do one memory allocation per operation instead of one per ioctl,
but memory allocations are cheap compared to the actual file system
operations. Also the error for an invalid path is now EINVAL or EFAULT
instead of the previous odd and undocumented ERANGE.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge the ioctl handlers just like the low-level xfs_attr_set function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The Linux xattr and acl APIs use a single call for set and remove.
Modify the high-level XFS API to match that and let xfs_attr_set handle
removing attributes as well. With a little bit of reordering this
removes a lot of code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace the ATTR_INCOMPLETE flag with a new boolean field in struct
xfs_attr_list_context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While the flags field in the ABI and the on-disk format allows for
multiple namespace flags, an attribute can only exist in a single
namespace at a time. Hence asking to list attributes that exist
in multiple namespaces simultaneously is a logically invalid
request and will return no results. Reject this case early with
-EINVAL.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The collapse range operation uses a unique transaction and ilock
cycle for the hole punch and each extent shift iteration of the
overall operation. While the hole punch is safe as a separate
operation due to the iolock, cycling the ilock after each extent
shift is risky w.r.t. concurrent operations, similar to insert range.
To avoid this problem, make collapse range atomic with respect to
ilock. Hold the ilock across the entire operation, replace the
individual transactions with a single rolling transaction sequence
and finish dfops on each iteration to perform pending frees and roll
the transaction. Remove the unnecessary quota reservation as
collapse range can only ever merge extents (and thus remove extent
records and potentially free bmap blocks). The dfops call
automatically relogs the inode to keep it moving in the log. This
guarantees that nothing else can change the extent mapping of an
inode while a collapse range operation is in progress.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The insert range operation uses a unique transaction and ilock cycle
for the extent split and each extent shift iteration of the overall
operation. While this works, it is risks racing with other
operations in subtle ways such as COW writeback modifying an extent
tree in the middle of a shift operation.
To avoid this problem, make insert range atomic with respect to
ilock. Hold the ilock across the entire operation, replace the
individual transactions with a single rolling transaction sequence
and relog the inode to keep it moving in the log. This guarantees
that nothing else can change the extent mapping of an inode while
an insert range operation is in progress.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The insert range operation currently splits the extent at the target
offset in a separate transaction and lock cycle from the one that
shifts extents. In preparation for reworking insert range into an
atomic operation, lift the code into the caller so it can be easily
condensed to a single rolling transaction and lock cycle and
eliminate the helper. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Sparse reports a warning at xfs_ail_check()
warning: context imbalance in xfs_ail_check() - unexpected unlock
The root cause is the missing annotation at xfs_ail_check()
Add the missing __must_hold(&ailp->ail_lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xfs_da3_path_shift() "blk" can be assigned to state->path.blk[-1] if
state->path.active is 1 (which is a valid state) when it tries to add an
entry to a single dir leaf block and then to shift forward to see if
there's a sibling block that would be a better place to put the new
entry. This causes a UBSAN warning given negative array indices are
undefined behavior in C. In practice the warning is entirely harmless
given that "blk" is never dereferenced in this case, but it is still
better to fix up the warning and slightly improve the code.
UBSAN: Undefined behaviour in fs/xfs/libxfs/xfs_da_btree.c:1989:14
index -1 is out of range for type 'xfs_da_state_blk_t [5]'
Call trace:
dump_backtrace+0x0/0x2c8
show_stack+0x20/0x2c
dump_stack+0xe8/0x150
__ubsan_handle_out_of_bounds+0xe4/0xfc
xfs_da3_path_shift+0x860/0x86c [xfs]
xfs_da3_node_lookup_int+0x7c8/0x934 [xfs]
xfs_dir2_node_addname+0x2c8/0xcd0 [xfs]
xfs_dir_createname+0x348/0x38c [xfs]
xfs_create+0x6b0/0x8b4 [xfs]
xfs_generic_create+0x12c/0x1f8 [xfs]
xfs_vn_mknod+0x3c/0x4c [xfs]
xfs_vn_create+0x34/0x44 [xfs]
do_last+0xd4c/0x10c8
path_openat+0xbc/0x2f4
do_filp_open+0x74/0xf4
do_sys_openat2+0x98/0x180
__arm64_sys_openat+0xf8/0x170
do_el0_svc+0x170/0x240
el0_sync_handler+0x150/0x250
el0_sync+0x164/0x180
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use printk_ratelimit() to limit the amount of messages printed from
xfs_discard_page. Without that a failing device causes a large
number of errors that doesn't really help debugging the underling
issue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use printk_ratelimit() to limit the amount of messages printed from
xfs_buf_ioerror_alert. Without that a failing device causes a large
number of errors that doesn't really help debugging the underling
issue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove the XFS wrappers for converting from and to the kuid/kgid types.
Mostly this means switching to VFS i_{u,g}id_{read,write} helpers, but
in a few spots the calls to the conversion functions is open coded.
To match the use of sb->s_user_ns in the helpers and other file systems,
sb->s_user_ns is also used in the quota code. The ACL code already does
the conversion in a grotty layering violation in the VFS xattr code,
so it keeps using init_user_ns for the identity mapping.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the Linux inode i_uid/i_gid members everywhere and just convert
from/to the scalar value when reading or writing the on-disk inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of only synchronizing the uid/gid values in xfs_setup_inode,
ensure that they always match to prepare for removing the icdinode
fields.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If xfs_buf_get_map can't allocate enough memory for the buffer it's
trying to create, it'll cough up an error about not being able to
allocate "pagesn". That's not particularly helpful (and if we're really
out of memory the message is very spammy) so change the message to tell
us how many pages were actually requested, and ratelimit it too.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We recently used fuzz(hydra) to test XFS and automatically generate
tmp.img(XFS v5 format, but some metadata is wrong)
xfs_repair information(just one AG):
agf_freeblks 0, counted 3224 in ag 0
agf_longest 536874136, counted 3224 in ag 0
sb_fdblocks 613, counted 3228
Test as follows:
mount tmp.img tmpdir
cp file1M tmpdir
sync
In 4.19-stable, sync will stuck, the reason is:
xfs_mountfs
xfs_check_summary_counts
if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) ||
XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) &&
!xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS))
return 0; -->just return, incore sb_fdblocks still be 613
xfs_initialize_perag_data
cp file1M tmpdir -->ok(write file to pagecache)
sync -->stuck(write pagecache to disk)
xfs_map_blocks
xfs_iomap_write_allocate
while (count_fsb != 0) {
nimaps = 0;
while (nimaps == 0) { --> endless loop
nimaps = 1;
xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again
xfs_bmapi_write
xfs_bmap_alloc
xfs_bmap_btalloc
xfs_alloc_vextent
xfs_alloc_fix_freelist
xfs_alloc_space_available -->fail(agf_freeblks is 0)
In linux-next, sync not stuck, cause commit c2b3164320 ("xfs:
use the latest extent at writeback delalloc conversion time") remove
the above while, dmesg is as follows:
[ 55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0.
Users do not know why this page is discard, the better soultion is:
1. Like xfs_repair, make sure sb_fdblocks is equal to counted
(xfs_initialize_perag_data did this, who is not called at this mount)
2. Add agf verify, if fail, will tell users to repair
This patch use the second soultion.
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Ren Xudong <renxudong1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Prior to commit df732b29c8 ("xfs: call xlog_state_release_iclog with
l_icloglock held"), xlog_state_release_iclog() always performed a
locked check of the iclog error state before proceeding into the
sync state processing code. As of this commit, part of
xlog_state_release_iclog() was open-coded into
xfs_log_release_iclog() and as a result the locked error state check
was lost.
The lockless check still exists, but this doesn't account for the
possibility of a race with a shutdown being performed by another
task causing the iclog state to change while the original task waits
on ->l_icloglock. This has reproduced very rarely via generic/475
and manifests as an assert failure in __xlog_state_release_iclog()
due to an unexpected iclog state.
Restore the locked error state check in xlog_state_release_iclog()
to ensure that an iclog state update via shutdown doesn't race with
the iclog release state processing code.
Fixes: df732b29c8 ("xfs: call xlog_state_release_iclog with l_icloglock held")
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Clang warns:
fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized
whenever 'if' condition is false [-Wsometimes-uninitialized]
if (def->pollin)
^~~~~~~~~~~
fs/io_uring.c:4182:2: note: uninitialized use occurs here
mask |= POLLERR | POLLPRI;
^~~~
fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always
true
if (def->pollin)
^~~~~~~~~~~~~~~~
fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence
this warning
__poll_t mask, ret;
^
= 0
1 warning generated.
io_op_defs has many definitions where pollin is not set so mask indeed
might be uninitialized. Initialize it to zero and change the next
assignment to |=, in case further masks are added in the future to avoid
missing changing the assignment then.
Fixes: d7718a9d25 ("io_uring: use poll driven retry for files that support it")
Link: https://github.com/ClangBuiltLinux/linux/issues/916
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so
it's useless setting it for a next req of a link. Thus, removed it
from io_prep_linked_timeout(), and inline the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After __io_queue_sqe() ended up in io_queue_async_work(), it's already
known that there is no @nxt req, so skip the check and return from the
function.
Also, @nxt initialisation now can be done just before
io_put_req_find_next(), as there is no jumping until it's checked.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently io_uring tries any request in a non-blocking manner, if it can,
and then retries from a worker thread if we get -EAGAIN. Now that we have
a new and fancy poll based retry backend, use that to retry requests if
the file supports it.
This means that, for example, an IORING_OP_RECVMSG on a socket no longer
requires an async thread to complete the IO. If we get -EAGAIN reading
from the socket in a non-blocking manner, we arm a poll handler for
notification on when the socket becomes readable. When it does, the
pending read is executed directly by the task again, through the io_uring
task work handlers. Not only is this faster and more efficient, it also
means we're not generating potentially tons of async threads that just
sit and block, waiting for the IO to complete.
The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
pollable IO is fast, and that poll<link>other_op is fast as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a pollin/pollout field to the request table, and have commands that
we can safely poll for properly marked.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For poll requests, it's not uncommon to link a read (or write) after
the poll to execute immediately after the file is marked as ready.
Since the poll completion is called inside the waitqueue wake up handler,
we have to punt that linked request to async context. This slows down
the processing, and actually means it's faster to not use a link for this
use case.
We also run into problems if the completion_lock is contended, as we're
doing a different lock ordering than the issue side is. Hence we have
to do trylock for completion, and if that fails, go async. Poll removal
needs to go async as well, for the same reason.
eventfd notification needs special case as well, to avoid stack blowing
recursion or deadlocks.
These are all deficiencies that were inherited from the aio poll
implementation, but I think we can do better. When a poll completes,
simply queue it up in the task poll list. When the task completes the
list, we can run dependent links inline as well. This means we never
have to go async, and we can remove a bunch of code associated with
that, and optimizations to try and make that run faster. The diffstat
speaks for itself.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Store the io_kiocb in the private field instead of the poll entry, this
is in preparation for allowing multiple waitqueues.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@hash_map is unsigned long, but BIT_ULL() is used for manipulations.
BIT() is a better match as it returns exactly unsigned long value.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
before the work setup (i.e. mm, override creds, etc). The setup
shouldn't take long, so it's ok to arm it a bit later and get rid
of IO_WQ_WORK_CB.
Make io-wq call work->func() only once, callbacks will handle the rest.
i.e. the linked timeout handler will do the actual issue. And as a
bonus, it removes an extra indirect call.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Deduplicate call to io_cqring_fill_event(), plain and easy
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for splice(2).
- output file is specified as sqe->fd, so it's handled by generic code
- hash_reg_file handled by generic code as well
- len is 32bit, but should be fine
- the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
is a splice flag (i.e. sqe->splice_flags).
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Preparation without functional changes. Adds io_get_file(), that allows
to grab files not only into req->file.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make do_splice(), so other kernel parts can reuse it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->in_async is not really needed, it only prevents propagation of
@nxt for fast not-blocked submissions. Remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_prep_async_worker() called io_wq_assign_next() do many useless checks:
io_req_work_grab_env() was already called during prep, and @do_hashed
is not ever used. Add io_prep_next_work() -- simplified version, that
can be called io-wq.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Many operations define custom work.func before getting into an io-wq.
There are several points against:
- it calls io_wq_assign_next() from outside io-wq, that may be confusing
- sync context would go unnecessary through io_req_cancelled()
- prototypes are quite different, so work!=old_work looks strange
- makes async/sync responsibilities fuzzy
- adds extra overhead
Don't call generic path and io-wq handlers from each other, but use
helpers instead
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't drop an early reference, hang on to it and let the caller drop
it. This makes it behave more like "regular" requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the -EAGAIN happens because of a static condition, then a poll
or later retry won't fix it. We must call it again from blocking
condition. Play it safe and ensure that any -EAGAIN condition from read
or write must retry from async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_wq_flush() is buggy, during cancelation of a flush, the associated
work may be passed to the caller's (i.e. io_uring) @match callback. That
callback is expecting it to be embedded in struct io_kiocb. Cancelation
of internal work probably doesn't make a lot of sense to begin with.
As the flush helper is no longer used, just delete it and the associated
work flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When converting and moving nfsroot.txt to nfsroot.rst the references to
the old text file was not updated to match the change, fix this.
Fixes: f9a9349846 ("Documentation: nfsroot.txt: convert to ReST")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20200212181332.520545-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the
callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may
return next work, which will be ignored and lost.
Cancel the whole link.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl5cMPoACgkQ8vlZVpUN
gaNYmgf/WX4/jMSYQu2fICudCqLr5fkLqsybvYGZGei3F8BaJ90zohQAQybNznWS
iyF0JzrOp37b/o0haz7KfDr7xVB3lAVsKu9Bglq+zL8mc9IkPmjhCXuLbknUtOUw
j3aVdntt4d6S3szbtP4PIZxNqh+/4KJDS2soWvuNWRpYMOv2yoMClptWWQtsimAt
3fYpxasSz0Jrhtbuf+I1oID++wOycDT3RKiko5tpLlQiFVoKBzfou+0ZdkC4+UIl
KvcpMBm1ijdGAaN9jfb2L2KCY5UdSvmeVui3sMXtHBEpKMJl2QsClylR1wGfgBKi
+YMEsjBONxKo3kH2DaPJaU6LEm8JuQ==
=rszH
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Two more bug fixes (including a regression) for 5.6"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
jbd2: fix data races at struct journal_head
If sbi->s_flex_groups_allocated is zero and the first allocation fails
then this code will crash. The problem is that "i--" will set "i" to
-1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX
is more than zero, the condition is true so we call kvfree(new_groups[-1]).
The loop will carry on freeing invalid memory until it crashes.
Fixes: 7c990728b9 ("ext4: fix potential race between s_flex_groups online resizing and access")
Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
journal_head::b_transaction and journal_head::b_next_transaction could
be accessed concurrently as noticed by KCSAN,
LTP: starting fsync04
/dev/zero: Can't open blockdev
EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
==================================================================
BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]
write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
__jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
__jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
(inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
kjournald2+0x13b/0x450 [jbd2]
kthread+0x1cd/0x1f0
ret_from_fork+0x27/0x50
read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
jbd2_write_access_granted+0x1b2/0x250 [jbd2]
jbd2_write_access_granted at fs/jbd2/transaction.c:1155
jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
__ext4_journal_get_write_access+0x50/0x90 [ext4]
ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
ext4_mb_new_blocks+0x54f/0xca0 [ext4]
ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
ext4_map_blocks+0x3b4/0x950 [ext4]
_ext4_get_block+0xfc/0x270 [ext4]
ext4_get_block+0x3b/0x50 [ext4]
__block_write_begin_int+0x22e/0xae0
__block_write_begin+0x39/0x50
ext4_write_begin+0x388/0xb50 [ext4]
generic_perform_write+0x15d/0x290
ext4_buffered_write_iter+0x11f/0x210 [ext4]
ext4_file_write_iter+0xce/0x9e0 [ext4]
new_sync_write+0x29c/0x3b0
__vfs_write+0x92/0xa0
vfs_write+0x103/0x260
ksys_write+0x9d/0x130
__x64_sys_write+0x4c/0x60
do_syscall_64+0x91/0xb05
entry_SYSCALL_64_after_hwframe+0x49/0xbe
5 locks held by fsync04/25724:
#0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
#1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
#2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
#3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
#4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
irq event stamp: 1407125
hardirqs last enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
softirqs last enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0
Reported by Kernel Concurrency Sanitizer on:
CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
The plain reads are outside of jh->b_state_lock critical section which result
in data races. Fix them by adding pairs of READ|WRITE_ONCE().
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5ZXkgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprZqEACOvhiprH9Q75Pp+ZwQknM3xGyJRWI3Mbj9
ZOyTVTK0qhTeaq6rN4MLSYevXOh+L68x5WRt1YJ1UnQRE0i8+ZQyZczqLKxxl8gF
trhbYDXjvXIWr9zvdtiL01PKKu4Vjjp6eZAomrbxCTFku0qn76fo9wDgGPRGL+Kx
lNO/6QvCXr9EjDniEUhlQsxTad5xc4sL0cnL4s2i7RlTCYtW4WJXJMC/4Gkg69j+
W5GBZyjJDa8Sj3pEbLjtDtA4ooE9VMaldb7ZvR62ONUVwGpftPsbN7UhVlhyhpW+
8v4ZEf07CxB246+hj7oL0RvEW3+/nB2hym1ySMXyBzpbx4O1JOUG7hQtNgdLRbCZ
27IOg2O36qbUKM1hUwn7Qm3XAfBPQdFpVmqE2+E9MEOKzigLzhRP6Bu5d9x9VQGh
JDxsm3B8PRHFJVAasiYu0p7mlx/+BCLjB84UrMB3I9UCBuVfk4mtmuwZX+mcK2PR
pV1xJlEMYKme3cz2/u6uB8p3Nq6ipE1nSVrI6AnfEvJbQ9sFL61KaG4wHKPvtb0y
mlNgc4seSjiWcBR2/84561a4CSmlXAn9dWMIGdHFFA43mTPYGc5omTcM8FwcEDkW
cTFGB8sFukcTNmOw62HUHYI1vPpowX6apV08lEQrScz7GiK5piTYqTFNneqEzcwZ
3bIMisH3Gg==
=WheR
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix for a race with IOPOLL used with SQPOLL (Xiaoguang)
- Only show ->fdinfo if procfs is enabled (Tobias)
- Fix for a chain with multiple personalities in the SQEs
- Fix for a missing free of personality idr on exit
- Removal of the spin-for-work optimization
- Fix for next work lookup on request completion
- Fix for non-vec read/write result progation in case of links
- Fix for a fileset references on switch
- Fix for a recvmsg/sendmsg 32-bit compatability mode
* tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
io_uring: fix 32-bit compatability with sendmsg/recvmsg
io_uring: define and set show_fdinfo only if procfs is enabled
io_uring: drop file set ref put/get on switch
io_uring: import_single_range() returns 0/-ERROR
io_uring: pick up link work on submit reference drop
io-wq: ensure work->task_pid is cleared on init
io-wq: remove spin-for-work optimization
io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL
io_uring: fix personality idr leak
io_uring: handle multiple personalities in link chains
There remains no more code in the kernel using pids_ns->proc_mnt,
therefore remove it from the kernel.
The big benefit of this change is that one of the most error prone and
tricky parts of the pid namespace implementation, maintaining kernel
mounts of proc is removed.
In addition removing the unnecessary complexity of the kernel mount
fixes a regression that caused the proc mount options to be ignored.
Now that the initial mount of proc comes from userspace, those mount
options are again honored. This fixes Android's usage of the proc
hidepid option.
Reported-by: Alistair Strachan <astrachan@google.com>
Fixes: e94591d0d9 ("proc: Convert proc_mount to use mount_ns.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Two fixes in this pull request:
* Revert the initial decision to silently ignore IOCB_NOWAIT for
asynchronous direct IOs to sequential zone files. Instead, return an
error to the user to signal that the feature is not supported (from
Christoph)
* A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures if
no other file system already selected this option (from Johannes).
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXljJWAAKCRDdoc3SxdoY
dmztAP9Sj74cHVTxac+HoDKwf6DYWfjPWonT5tO4wc8q0PBDOgEAhKzHQJZNqJvd
a0BrEf/t6RLWDgsi75cB/U6HsiGkiA0=
=+maQ
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fixes from Damien Le Moal:
"Two fixes in here:
- Revert the initial decision to silently ignore IOCB_NOWAIT for
asynchronous direct IOs to sequential zone files. Instead, return
an error to the user to signal that the feature is not supported
(from Christoph)
- A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures
if no other file system already selected this option (from
Johannes)"
* tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: select FS_IOMAP
zonefs: fix IOCB_NOWAIT handling
The mptcp conflict was overlapping additions.
The SMC conflict was an additional and removal happening at the same
time.
Signed-off-by: David S. Miller <davem@davemloft.net>
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise
the iovec import for these commands will not do the right thing and fail
the command with -EINVAL.
Found by running the test suite compiled as 32-bit.
Cc: stable@vger.kernel.org
Fixes: aa1fa28fc7 ("io_uring: add support for recvmsg()")
Fixes: 0fa03c624d ("io_uring: add support for sendmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently it either returns -E... or puts (nd->path.mnt,dentry)
into *path and returns 0. Make it return ERR_PTR(-E...) or
dentry; adjust the caller. Fewer arguments and it's easier
to keep track of *path contents that way.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All callers of follow_managed() follow it on success with the same steps -
d_backing_inode(path->dentry) is calculated and stored into some struct inode *
variable and, in all but one case, an unsigned variable (nd->seq to be) is
zeroed. The single exception is lookup_fast() and there zeroing is correct
thing to do - not doing it is a pointless microoptimization.
Add a wrapper for follow_managed() that would do that combination.
It's mostly a vehicle for code massage - it will be changing quite a bit,
and the current calling conventions are by no means final. Right now it
takes path, nameidata and (as out params) inode and seq, similar to
__follow_mount_rcu(). Which will soon get folded into it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
As it is, we might or might not have LOOKUP_FOLLOW in op->intent
in that case - that depends upon having O_NOFOLLOW in open flags.
It doesn't matter, since we won't be checking it in that case -
do_last() bails out earlier.
However, making sure it's not set (i.e. acting as if we had an explicit
O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
check for O_CREAT | O_EXCL in do_last() with the call of step_into()
immediately following it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only the address of ->total_link_count and the flags.
And fix an off-by-one is ELOOP detection - make it
consistent with symlink following, where we check if
the pre-increment value has reached 40, rather than
check the post-increment one.
[kudos to Christian Brauner for spotted braino]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) no instances of ->d_automount() have ever made use of the "return
ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
a rudiment of plans that got superseded before the thing went into
the tree. Despite the comment in follow_automount(), autofs has
never done that.
2) if there's no ->d_automount() in dentry_operations, filesystems
should not set DCACHE_NEED_AUTOMOUNT in the first place. None have
ever done so...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Protection against automount/automount races (two threads hitting the same
referral point at the same time) is based upon do_add_mount() prevention of
identical overmounts - trying to overmount the root of mounted tree with
the same tree fails with -EBUSY. It's unreliable (the other thread might've
mounted something on top of the automount it has triggered) *and* causes
no end of headache for follow_automount() and its caller, since
finish_automount() behaves like do_new_mount() - if the mountpoint to be is
overmounted, it mounts on top what's overmounting it. It's not only wrong
(we want to go into what's overmounting the automount point and quietly
discard what we planned to mount there), it introduces the possibility of
original parent mount getting dropped. That's what 8aef188452 (VFS: Fix
vfsmount overput on simultaneous automount) deals with, but it can't do
anything about the reliability of conflict detection - if something had
been overmounted the other thread's automount (e.g. that other thread
having stepped into automount in mount(2)), we don't get that -EBUSY and
the result is
referral point under automounted NFS under explicit overmount
under another copy of automounted NFS
What we need is finish_automount() *NOT* digging into overmounts - if it
finds one, it should just quietly discard the thing it was asked to mount.
And don't bother with actually crossing into the results of finish_automount() -
the same loop that calls follow_automount() will do that just fine on the
next iteration.
IOW, instead of calling lock_mount() have finish_automount() do it manually,
_without_ the "move into overmount and retry" part. And leave crossing into
the results to the caller of follow_automount(), which simplifies it a lot.
Moral: if you end up with a lot of glue working around the calling conventions
of something, perhaps these calling conventions are simply wrong...
Fixes: 8aef188452 (VFS: Fix vfsmount overput on simultaneous automount)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Using f2fs_trylock_op() in f2fs_write_compressed_pages() to avoid potential
deadlock like we did in f2fs_write_single_data_page().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, we can not distinguish the exact location of messages,
when there are more than one places printing same message.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In Struct compress_data, chksum field was never used, remove it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Unable to handle kernel NULL pointer dereference at virtual address 00000000
PC is at f2fs_free_dic+0x60/0x2c8
LR is at f2fs_decompress_pages+0x3c4/0x3e8
f2fs_free_dic+0x60/0x2c8
f2fs_decompress_pages+0x3c4/0x3e8
__read_end_io+0x78/0x19c
f2fs_post_read_work+0x6c/0x94
process_one_work+0x210/0x48c
worker_thread+0x2e8/0x44c
kthread+0x110/0x120
ret_from_fork+0x10/0x18
In f2fs_free_dic(), we can not use f2fs_put_page(,1) to release dic->tpages[i],
as the page's mapping is NULL.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When the compressed data of a cluster doesn't end on a page boundary,
the remainder of the last page must be zeroed in order to avoid leaking
uninitialized memory to disk.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There could be a scenario where f2fs_sync_meta_pages() will not
ensure that all F2FS_DIRTY_META pages are submitted for IO. Thus,
resulting in the below panic in do_checkpoint() -
f2fs_bug_on(sbi, get_pages(sbi, F2FS_DIRTY_META) &&
!f2fs_cp_error(sbi));
This can happen in a low-memory condition, where shrinker could
also be doing the writepage operation (stack shown below)
at the same time when checkpoint is running on another core.
schedule
down_write
f2fs_submit_page_write -> by this time, this page in page cache is tagged
as PAGECACHE_TAG_WRITEBACK and PAGECACHE_TAG_DIRTY
is cleared, due to which f2fs_sync_meta_pages()
cannot sync this page in do_checkpoint() path.
f2fs_do_write_meta_page
__f2fs_write_meta_page
f2fs_write_meta_page
shrink_page_list
shrink_inactive_list
shrink_node_memcg
shrink_node
kswapd
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is a race condition that we may miss to wait for all node pages
writeback, fix it.
- fsync() - shrink
- f2fs_do_sync_file
- __write_node_page
- set_page_writeback(page#0)
: remove DIRTY/TOWRITE flag
- f2fs_fsync_node_pages
: won't find page #0 as TOWRITE flag was removeD
- f2fs_wait_on_node_pages_writeback
: wont' wait page #0 writeback as it was not in fsync_node_list list.
- f2fs_add_fsync_node_entry
Fixes: 50fa53eccf ("f2fs: fix to avoid broken of dnode block list")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In Aug 2018 NeilBrown noticed
commit 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code and interface")
"Some ->next functions do not increment *pos when they return NULL...
Note that such ->next functions are buggy and should be fixed.
A simple demonstration is
dd if=/proc/swaps bs=1000 skip=1
Choose any block size larger than the size of /proc/swaps. This will
always show the whole last line of /proc/swaps"
/proc/swaps output was fixed recently, however there are lot of other
affected files, and one of them is related to pstore subsystem.
If .next function does not change position index, following .show function
will repeat output related to current position index.
There are at least 2 related problems:
- read after lseek beyond end of file, described above by NeilBrown
"dd if=<AFFECTED_FILE> bs=1000 skip=1" will generate whole last list
- read after lseek on in middle of last line will output expected rest of
last line but then repeat whole last line once again.
If .show() function generates multy-line output (like
pstore_ftrace_seq_show() does ?) following bash script cycles endlessly
$ q=;while read -r r;do echo "$((++q)) $r";done < AFFECTED_FILE
Unfortunately I'm not familiar enough to pstore subsystem and was unable
to find affected pstore-related file on my test node.
If .next function does not change position index, following .show function
will repeat output related to current position index.
Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Link: https://lore.kernel.org/r/4e49830d-4c88-0171-ee24-1ee540028dad@virtuozzo.com
[kees: with robustness tweak from Joel Fernandes <joelaf@google.com>]
Signed-off-by: Kees Cook <keescook@chromium.org>
Follow the pattern used with other *_show_fdinfo functions and only
define and use io_uring_show_fdinfo and its helper functions if
CONFIG_PROC_FS is set.
Fixes: 87ce955b24 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Function leaf_dealloc was not allocating enough journal space for
revokes. Before, it allocated 'l_blocks' revokes, but it needs one
more for the revoke of the dinode that is modified. This patch adds
the needed revoke entry to function leaf_dealloc.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, journal replays could stomp on log flushes
and each other because both log flushes and journal replays used
the same sd_log_bio. Function gfs2_log_flush prevents other log
flushes from interfering by taking the sd_log_flush_lock rwsem
during the flush. However, it does not protect against journal
replays. This patch allows the journal replay to take the same
sd_log_flush_lock rwsem so use of the sd_log_bio is not stomped.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function gfs2_releasepage would free any bd
elements that had been used for the page being released. However,
those bd elements may still be queued to the sd_log_revokes list,
in which case we cannot free them until the revoke has been issued.
This patch adds additional checks for bds that are still being
used for revokes.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function gfs2_log_flush() had a few places where it tried to withdraw
from the file system when errors were encountered. The problem is,
it should delay those withdraws until the log flush lock is no longer
held.
This patch creates a new function just for delayed withdraws for
situations like this. If errors=panic was specified on mount, we
still want to do it the old fashioned way because the panic it does
not help to delay in that situation.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function do_xmote would try to sync out the glock
dirty data by calling the appropriate glops function XXX_go_sync()
but it did not check for a good return code. If the sync was not
possible due to an io error or whatever, do_xmote would continue on
and call go_inval and release the glock to other cluster nodes.
When those nodes go to replay the journal, they may already be holding
glocks for the journal records that should have been synced, but were
not due to the ignored error.
This patch introduces proper error code checking to the go_sync
family of glops functions.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, run_queue would demote glocks based on whether
there are any more holders. But if the glock has pending revokes that
haven't been written to the media, giving up the glock might end in
file system corruption if the revokes never get written due to
io errors, node crashes and fences, etc. In that case, another node
will replay the metadata blocks associated with the glock, but
because the revoke was never written, it could replay that block
even though the glock had since been granted to another node who
might have made changes.
This patch changes the logic in run_queue so that it never demotes
a glock until its count of pending revokes reaches zero.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2_logd continually tried to flush its journal
log, after the file system is withdrawn. We don't want to write anything
to the journal, lest we add corruption. Best course of action is to
drain the ail1 into the ail2 list (via gfs2_ail1_empty) then drain the
ail2 list with a new function, ail2_drain.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_ail1_start_one would return any
errors it received from write_cache_pages (except -EBUSY) but it did
not withdraw. Since function gfs2_ail1_flush just checks for the bad
return code and loops, the loop might potentially never end.
This patch adds some logic to allow it to exit the loop and withdraw
properly when errors are received from write_cache_pages.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, if gfs2_ail_empty_gl saw there was nothing on
the ail list, it would return and not flush the log. The problem
is that there could still be a revoke for the rgrp sitting on the
sd_log_le_revoke list that's been recently taken off the ail list.
But that revoke still needs to be written, and the rgrp_go_inval
still needs to call log_flush_wait to ensure the revokes are all
properly written to the journal before we relinquish control of
the glock to another node. If we give the glock to another node
before we have this knowledge, the node might crash and its journal
replayed, in which case the missing revoke would allow the journal
replay to replay the rgrp over top of the rgrp we already gave to
another node, thus overwriting its changes and corrupting the
file system.
This patch makes gfs2_ail_empty_gl still call gfs2_log_flush rather
than returning.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function do_xmote just assumed all the writes
submitted to the journal were finished and successful, and it
called the go_unlock function to release the dlm lock. But if
they're not, and a revoke failed to make its way to the journal,
a journal replay on another node will cause corruption if we
let the go_inval function continue and tell dlm to release the
glock to another node. This patch adds a couple checks for errors
in do_xmote after the calls to go_sync and go_inval. If an error
is found, we cannot withdraw yet, because the withdraw itself
uses glocks to make the file system read-only. Instead, we flag
the error. Later, asserts should cause another node to replay
the journal before continuing, thus protecting rgrp and dinode
glocks and maintaining the integrity of the metadata. Note that
we only need to do this for journaled glocks. System glocks
should be able to progress even under withdrawn conditions.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_end_log_write would detect any IO
errors writing to the journal and put out an appropriate message,
but it never set a withdrawing condition. Eventually, the log daemon
would see the error and determine it was time to withdraw, but in
the meantime, other processes could continue running as if nothing
bad ever happened. The biggest consequence is that __gfs2_glock_put
would BUG() when it saw that there were still unwritten items.
This patch sets the WITHDRAWING status as soon as an IO error is
detected, and that way, the BUG will be avoided so the file system
can be properly withdrawn and unmounted.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_write_revokes would call
gfs2_ail1_empty, then traverse the sd_ail1_list looking for
transactions that had bds which were no longer queued to a glock.
And if it found some, it would try to issue revokes for them, up to
a predetermined maximum. There were two problems with how it did
this. First was the fact that gfs2_ail1_empty moves transactions
which have nothing remaining on the ail1 list from the sd_ail1_list
to the sd_ail2_list, thus making its traversal of sd_ail1_list
miss them completely, and therefore, never issue revokes for them.
Second was the fact that there were three traversals (or partial
traversals) of the sd_ail1_list, each of which took and then
released the sd_ail_lock lock: First inside gfs2_ail1_empty,
second to determine if there are any revokes to be issued, and
third to actually issue them. All this taking and releasing of the
sd_ail_lock meant other processes could modify the lists and the
conditions in which we're working.
This patch simplies the whole process by adding a new parameter
to function gfs2_ail1_empty, max_revokes. For normal calls, this
is passed in as 0, meaning we don't want to issue any revokes.
For function gfs2_write_revokes, we pass in the maximum number
of revokes we can, thus allowing gfs2_ail1_empty to add the
revokes where needed. This simplies the code, allows for a single
holding of the sd_ail_lock, and allows gfs2_ail1_empty to add
revokes for all the necessary bd items without missing any.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function check_journal_clean would give messages
related to journal recovery. That's fine for mount time, but when a
node withdraws and forces replay that way, we don't want all those
distracting and misleading messages. This patch adds a new parameter
to make those messages optional.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, the rgrp_go_inval and inode_go_inval functions each
checked if there were any items left on the ail count (by way of a
count), and if so, did a withdraw. But the withdraw code now uses
glocks when changing the file system to read-only status. So we can
not have glock functions withdrawing or a hang will likely result:
The glocks can't be serviced by the work_func if the work_func is
busy doing its own withdraw.
This patch removes the checks from the go_inval functions and adds
a centralized check in do_xmote to warn about the problem and not
withdraw, but flag the error so it's eventually caught when the logd
daemon eventually runs.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add a helper to change the owner of sysfs objects.
This function will be used to correctly account for kobject ownership
changes, e.g. when moving network devices between network namespaces.
This mirrors how a kobject is added through driver core which in its guts is
done via kobject_add_internal() which in summary creates the main directory via
create_dir(), populates that directory with the groups associated with the
ktype of the kobject (if any) and populates the directory with the basic
attributes associated with the ktype of the kobject (if any). These are the
basic steps that are associated with adding a kobject in sysfs.
Any additional properties are added by the specific subsystem itself (not by
driver core) after it has registered the device. So for the example of network
devices, a network device will e.g. register a queue subdirectory under the
basic sysfs directory for the network device and than further subdirectories
within that queues subdirectory. But that is all specific to network devices
and they call the corresponding sysfs functions to do that directly when they
create those queue objects. So anything that a subsystem adds outside of what
driver core does must also be changed by it (That's already true for removal of
files it created outside of driver core.) and it's the same for ownership
changes.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helpers to change the owner of sysfs groups.
This function will be used to correctly account for kobject ownership
changes, e.g. when moving network devices between network namespaces.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to change the owner of a sysfs link.
This function will be used to correctly account for kobject ownership
changes, e.g. when moving network devices between network namespaces.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helpers to change the owner of a sysfs files.
This function will be used to correctly account for kobject ownership
changes, e.g. when moving network devices between network namespaces.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This time, the set of changes for the EFI subsystem is much larger than
usual. The main reasons are:
- Get things cleaned up before EFI support for RISC-V arrives, which will
increase the size of the validation matrix, and therefore the threshold to
making drastic changes,
- After years of defunct maintainership, the GRUB project has finally started
to consider changes from the distros regarding UEFI boot, some of which are
highly specific to the way x86 does UEFI secure boot and measured boot,
based on knowledge of both shim internals and the layout of bootparams and
the x86 setup header. Having this maintenance burden on other architectures
(which don't need shim in the first place) is hard to justify, so instead,
we are introducing a generic Linux/UEFI boot protocol.
Summary of changes:
- Boot time GDT handling changes (Arvind)
- Simplify handling of EFI properties table on arm64
- Generic EFI stub cleanups, to improve command line handling, file I/O,
memory allocation, etc.
- Introduce a generic initrd loading method based on calling back into
the firmware, instead of relying on the x86 EFI handover protocol or
device tree.
- Introduce a mixed mode boot method that does not rely on the x86 EFI
handover protocol either, and could potentially be adopted by other
architectures (if another one ever surfaces where one execution mode
is a superset of another)
- Clean up the contents of struct efi, and move out everything that
doesn't need to be stored there.
- Incorporate support for UEFI spec v2.8A changes that permit firmware
implementations to return EFI_UNSUPPORTED from UEFI runtime services at
OS runtime, and expose a mask of which ones are supported or unsupported
via a configuration table.
- Various documentation updates and minor code cleanups (Heinrich)
- Partial fix for the lack of by-VA cache maintenance in the decompressor
on 32-bit ARM. Note that these patches were deliberately put at the
beginning so they can be used as a stable branch that will be shared with
a PR containing the complete fix, which I will send to the ARM tree.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEnNKg2mrY9zMBdeK7wjcgfpV0+n0FAl5S7WYACgkQwjcgfpV0
+n1jmQgAmwV3V8pbgB4mi4P2Mv8w5Zj5feUe6uXnTR2AFv5nygLcTzdxN+TU/6lc
OmZv2zzdsAscYlhuUdI/4t4cXIjHAZI39+UBoNRuMqKbkbvXCFscZANLxvJjHjZv
FFbgUk0DXkF0BCEDuSLNavidAv4b3gZsOmHbPfwuB8xdP05LbvbS2mf+2tWVAi2z
ULfua/0o9yiwl19jSS6iQEPCvvLBeBzTLW7x5Rcm/S0JnotzB59yMaeqD7jO8JYP
5PvI4WM/l5UfVHnZp2k1R76AOjReALw8dQgqAsT79Q7+EH3sNNuIjU6omdy+DFf4
tnpwYfeLOaZ1SztNNfU87Hsgnn2CGw==
=pDE3
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core
Pull EFI updates for v5.7 from Ard Biesheuvel:
This time, the set of changes for the EFI subsystem is much larger than
usual. The main reasons are:
- Get things cleaned up before EFI support for RISC-V arrives, which will
increase the size of the validation matrix, and therefore the threshold to
making drastic changes,
- After years of defunct maintainership, the GRUB project has finally started
to consider changes from the distros regarding UEFI boot, some of which are
highly specific to the way x86 does UEFI secure boot and measured boot,
based on knowledge of both shim internals and the layout of bootparams and
the x86 setup header. Having this maintenance burden on other architectures
(which don't need shim in the first place) is hard to justify, so instead,
we are introducing a generic Linux/UEFI boot protocol.
Summary of changes:
- Boot time GDT handling changes (Arvind)
- Simplify handling of EFI properties table on arm64
- Generic EFI stub cleanups, to improve command line handling, file I/O,
memory allocation, etc.
- Introduce a generic initrd loading method based on calling back into
the firmware, instead of relying on the x86 EFI handover protocol or
device tree.
- Introduce a mixed mode boot method that does not rely on the x86 EFI
handover protocol either, and could potentially be adopted by other
architectures (if another one ever surfaces where one execution mode
is a superset of another)
- Clean up the contents of struct efi, and move out everything that
doesn't need to be stored there.
- Incorporate support for UEFI spec v2.8A changes that permit firmware
implementations to return EFI_UNSUPPORTED from UEFI runtime services at
OS runtime, and expose a mask of which ones are supported or unsupported
via a configuration table.
- Various documentation updates and minor code cleanups (Heinrich)
- Partial fix for the lack of by-VA cache maintenance in the decompressor
on 32-bit ARM. Note that these patches were deliberately put at the
beginning so they can be used as a stable branch that will be shared with
a PR containing the complete fix, which I will send to the ARM tree.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unlike the other core import helpers, import_single_range() returns 0 on
success, not the length imported. This means that links that depend on
the result of non-vec based IORING_OP_{READ,WRITE} that were added for
5.5 get errored when they should not be.
Fixes: 3a6820f2bb ("io_uring: add non-vectored read/write commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If work completes inline, then we should pick up a dependent link item
in __io_queue_sqe() as well. If we don't do so, we're forced to go async
with that item, which is suboptimal.
This also fixes an issue with io_put_req_find_next(), which always looks
up the next work item. That should only be done if we're dropping the
last reference to the request, to prevent multiple lookups of the same
work item.
Outside of being a fix, this also enables a good cleanup series for 5.7,
where we never have to pass 'nxt' around or into the work handlers.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lockdep complains about a chain:
sb_internal#2 --> &ei->xattr_sem#2 --> fs_reclaim
and shrink_dentry_list -> ext2_evict_inode -> ext2_xattr_delete_inode ->
down_write(ei->xattr_sem) creating a locking cycle in the reclaim path.
This is however a false positive because when we are in
ext2_evict_inode() we are the only holder of the inode reference and
nobody else should touch xattr_sem of that inode. So we cannot ever
block on acquiring the xattr_sem in the reclaim path.
Silence the lockdep warning by using down_write_trylock() in
ext2_xattr_delete_inode() to not create false locking dependency.
Reported-by: "J. R. Okajima" <hooanon05g@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Zonefs makes use of iomap internally, so it should also select iomap in
Kconfig.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
IOCB_NOWAIT can't just be ignored as it breaks applications expecting
it not to block. Just refuse the operation as applications must handle
that (e.g. by falling back to a thread pool).
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
We use ->task_pid for exit cancellation, but we need to ensure it's
cleared to zero for io_req_work_grab_env() to do the right thing. Take
a suggestion from Bart and clear the whole thing, just setting the
function passed in. This makes it more future proof as well.
Fixes: 36282881a7 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a potential mem leak when pstore_init_fs failed,
since the pstore compression maybe unlikey to initialized
successfully. We must clean up the allocation once this
unlikey issue happens.
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Link: https://lore.kernel.org/r/1581068800-13817-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
An NFS client that mounts multiple exports from the same NFS
server with higher NFSv4 versions disabled (i.e. 4.2) and without
forcing a specific NFS version results in fscache index cookie
collisions and the following messages:
[ 570.004348] FS-Cache: Duplicate cookie detected
Each nfs_client structure should have its own fscache index cookie,
so add the minorversion to nfs_server_key.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200145
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If userspace passes an nfs_mount_data struct in the data argument of
mount(2), then nfs23_parse_monolithic() or nfs4_parse_monolithic()
will allocate memory for ctx->nfs_server.hostname. This needs to be
freed in nfs_parse_source(), which also allocates memory for
ctx->nfs_server.hostname, otherwise a leak will occur.
Reported-by: syzbot+193c375dcddb4f345091@syzkaller.appspotmail.com
Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Hard-coding the fstype causes "nfs4" mounts to appear as "nfs",
which breaks scripts that do "umount -at nfs4".
Reported-by: Patrick Steinhardt <ps@pks.im>
Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Andres reports that buffered IO seems to suck up more cycles than we
would like, and he narrowed it down to the fact that the io-wq workers
will briefly spin for more work on completion of a work item. This was
a win on the networking side, but apparently some other cases take a
hit because of it. Remove the optimization to avoid burning more CPU
than we have to for disk IO.
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After making ext4 support iopoll method:
let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
we found fio can easily hang in fio_ioring_getevents() with below fio
job:
rm -f testfile; sync;
sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
-rw=write -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1
-bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
There are two issues that results in this hang, one reason is that
when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
does not use io_uring_enter to get completed events, it relies on
kernel io_sq_thread to poll for completed events.
Another reason is that there is a race: when io_submit_sqes() in
io_sq_thread() submits a batch of sqes, variable 'inflight' will
record the number of submitted reqs, then io_sq_thread will poll for
reqs which have been added to poll_list. But note, if some previous
reqs have been punted to io worker, these reqs will won't be in
poll_list timely. io_sq_thread() will only poll for a part of previous
submitted reqs, and then find poll_list is empty, reset variable
'inflight' to be zero. If app just waits these deferred reqs and does
not wake up io_sq_thread again, then hang happens.
For app that entirely relies on io_sq_thread to poll completed requests,
let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
element to poll_list, and when io_sq_thread prepares to sleep, check
whether poll_list is empty again, if not empty, continue to poll.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All other uses of cifs_dbg use defines so change this one.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
To rename a file in SMB2 we open it with the DELETE access and do a
special SetInfo on it. If the handle is missing the DELETE bit the
server will fail the SetInfo with STATUS_ACCESS_DENIED.
We currently try to reuse any existing opened handle we have with
cifs_get_writable_path(). That function looks for handles with WRITE
access but doesn't check for DELETE, making rename() fail if it finds
a handle to reuse. Simple reproducer below.
To select handles with the DELETE bit, this patch adds a flag argument
to cifs_get_writable_path() and find_writable_file() and the existing
'bool fsuid_only' argument is converted to a flag.
The cifsFileInfo struct only stores the UNIX open mode but not the
original SMB access flags. Since the DELETE bit is not mapped in that
mode, this patch stores the access mask in cifs_fid on file open,
which is accessible from cifsFileInfo.
Simple reproducer:
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#define E(s) perror(s), exit(1)
int main(int argc, char *argv[])
{
int fd, ret;
if (argc != 3) {
fprintf(stderr, "Usage: %s A B\n"
"create&open A in write mode, "
"rename A to B, close A\n", argv[0]);
return 0;
}
fd = openat(AT_FDCWD, argv[1], O_WRONLY|O_CREAT|O_SYNC, 0666);
if (fd == -1) E("openat()");
ret = rename(argv[1], argv[2]);
if (ret) E("rename()");
ret = close(fd);
if (ret) E("close()");
return ret;
}
$ gcc -o bugrename bugrename.c
$ ./bugrename /mnt/a /mnt/b
rename(): Permission denied
Fixes: 8de9e86c67 ("cifs: create a helper to find a writeable handle by path name")
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
We were not displaying the mount option "signloosely" in /proc/mounts
for cifs mounts which some users found confusing recently
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Ensure that full_path is an UNC path that contains '\\' as delimiter,
which is required by cifs_build_devname().
The build_path_from_dentry_optional_prefix() function may return a
path with '/' as delimiter when using SMB1 UNIX extensions, for
example.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
If from cifs_revalidate_dentry_attr() the SMB2/QUERY_INFO call fails with an
error, such as STATUS_SESSION_EXPIRED, causing the session to be reconnected
it is possible we will leak -EAGAIN back to the application even for
system calls such as stat() where this is not a valid error.
Fix this by re-trying the operation from within cifs_revalidate_dentry_attr()
if cifs_get_inode_info*() returns -EAGAIN.
This fixes stat() and possibly also other system calls that uses
cifs_revalidate_dentry*().
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Rework the flushing of proc to use a list of directory inodes that
need to be flushed.
The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.
This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.
This flushing remains an optimization. The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died. When unused the shrinker will
eventually be able to remove these dentries.
There is a case in de_thread where proc_flush_pid can be
called early for a given pid. Which winds up being
safe (if suboptimal) as this is just an optiimization.
Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.
So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid. Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place. This removes a small race
where a dentry could recreated.
As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.
The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.
v2: Initialize pid->inodes. I somehow failed to get that
initialization into the initial version of the patch. A boot
failure was reported by "kernel test robot <lkp@intel.com>", and
failure to initialize that pid->inodes matches all of the reported
symptoms.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.
The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused. It will not remove aliases for
the dcache or even think of removing mounts from the dcache. For that
behavior d_invalidate is needed.
To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.
For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.
Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.
To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.
V2: Split the directory and non-directory cases. To make this
code robust to future changes in proc.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We somehow never free the idr, even though we init it for every ctx.
Free it when the rest of the ring data is freed.
Fixes: 071698e13a ("io_uring: allow registering credentials")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have a chain of requests and they don't all use the same
credentials, then the head of the chain will be issued with the
credentails of the tail of the chain.
Ensure __io_queue_sqe() overrides the credentials, if they are different.
Once we do that, we can clean up the creds handling as well, by only
having io_submit_sqe() do the lookup of a personality. It doesn't need
to assign it, since __io_queue_sqe() now always does the right thing.
Fixes: 75c6a03904 ("io_uring: support using a registered personality for commands")
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The UEFI spec rev 2.8 permits firmware implementations to support only
a subset of EFI runtime services at OS runtime (i.e., after the call to
ExitBootServices()), so let's take this into account in the drivers that
rely specifically on the availability of the EFI variable services.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5SdTAACgkQxWXV+ddt
WDtOPQ/+PGbN2bfZSf0DBUcOC+IOdiCYWHmaO6NQZzQb4CTjuumsDErSTzbg77OQ
gBqRBfcaWnyZZIabHe0JjctQ//buycP8yGb8VevgVdD4qBnb5sM/pvBqT4kO9Omy
6K3W7RhcP8gejOxGJb4nuiUBdYRW3KTji8QT3TMv/njuqn0p8n1NgzVlz0m6+xtO
IfJOqThvvG9LBytsQq/pvqaoXID/06lXRyM42XbsKFyygv10vp69xz5Skdl1XkGk
pyymzLlLDkLorRuDkjnzIvOqNFDAYiSFJafHupSE4SlfscRYfTwCV0uXE+NX4bdL
piTr5169ALoIRyTHU37YNNCPZXNGHnmn5Mtf4o8Q0ps0MKz1+sIjFRAxrdolLHjw
iYcocXU9L5wtHUBouTBzAsnuWJnizJNExHYb/3MHnHZYDu371l8wOGL1AHzRXEm/
qkxTPS3V2JpaFkFvTmwEQbl8PzDgpxpPDDBQUcEBZn3Vb9AFX38Fo/5OiGOnpnTd
9wVKXK2S6vHz/xpfcT/3SOtDUljSPJMUXUVkVcdz+OfUWvn6icXzR0t4keXVqZv+
INSOHzXb+iCuIb+NX7VZTN9oq0GH58aA+ApIo5beNbwJ6EzFotGE8dQmK3AfQ6pU
Pod0gzW5Hqj9Q2AfWYTZjWP+Og+dF2+bQFYSRK8NjcKjcrvUEVU=
=Asj0
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"These are fixes that were found during testing with help of error
injection, plus some other stable material.
There's a fixup to patch added to rc1 causing locking in wrong context
warnings, tests found one more deadlock scenario. The patches are
tagged for stable, two of them now in the queue but we'd like all
three released at the same time.
I'm not happy about fixes to fixes in such a fast succession during
rcs, but I hope we found all the fallouts of commit 28553fa992
('Btrfs: fix race between shrinking truncate and fiemap')"
* tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof
Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents
btrfs: fix bytes_may_use underflow in prealloc error condtition
btrfs: handle logged extent failure properly
btrfs: do not check delayed items are empty for single transaction cleanup
btrfs: reset fs_root to NULL on error in open_ctree
btrfs: destroy qgroup extent records on transaction abort
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl5R8vEACgkQ8vlZVpUN
gaMkPQf/QpimFVWvW+y2u9wOCl4pS38fog3SEbaCMcmCjndUfgLd9zf43GetFUfD
DYbxmzotu+WEqHH83H6c+Cr/9tmhxrH5njhydxlzucocqyxdWmdWKe5cNz3ECJ6Z
c4B1HFux+w/AfSGs73AU1K9APHlc/yXnZhgHpjLON6mP0Ata9lRZkmxwe9RnSWEn
186U1/kWe6sHNyOe1iQJC1QOPSauqY8SQDTZr5QSHLEyO7M/eJje+bplocor6JnJ
HTsKHdP1dNQaQzZxup4QgvZ33vAfgsgwIFtJKhF4ps+2NsILJzH5FfYW+dHTpnqe
INuJM5kPkkUuNnQqCfFDOvmaDGwjqQ==
=i1ka
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"More miscellaneous ext4 bug fixes (all stable fodder)"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix mount failure with quota configured as module
jbd2: fix ocfs2 corrupt when clearing block group bits
ext4: fix race between writepages and enabling EXT4_EXTENTS_FL
ext4: rename s_journal_flag_rwsem to s_writepages_rwsem
ext4: fix potential race between s_flex_groups online resizing and access
ext4: fix potential race between s_group_info online resizing and access
ext4: fix potential race between online resizing and write operations
ext4: add cond_resched() to __ext4_find_entry()
ext4: fix a data race in EXT4_I(inode)->i_disksize
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5RXt4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprPZEACevRyIjhdEmD9eyXlixw1O6zs/dHR4QVf6
RuuNoX1Ssxmf4zHBPcifBVenoUhIviJ/hBACdYNuPz+YWdx3FO/BF8FFv656ssHr
xhj8sC/8vz+fnwKyb/Lwt56NdRc8Ddtw6iWsF4po650n7JItq8BmDkHT/y3SJI0Z
L1UrUX4TxXEDfKsW2gbNCNIPjaiDSErJFP6FT1pcUZwLmF3zyJC6btR21AaAJbRC
CwatdbBg9K1SnvArn/NMd16C0p1LVBt3P2clagC90zlkCyb2vANN+YTnbo7KCsX7
XmssosPu5lamJQdsTNNxH7DHVUh/lZg9CEhUpy2ctXYSf1a6Ak6Y3qktCM5VW7FX
x+6aZdJj0UDdA+MvdcHZWjxKfJFmbS2iRjTfbTXpyLX/1qFmvI9ww9xzgP68iK8s
guxLxOQoCDx102SNKGmffcKY2C+yl3HHGRZATxy9C85WSvz7bwtvcbWwT/x13UxO
TWa8ghe0N4jfJ3sNfADZ0Dtehrj8ryslrRc0XS6y7v3m7MqOABkz7texH006j43G
FW23kqMyYJTlm+JIEIly9C5MSd4nFU0gyfBtMKGMBHF2JHgZez1LkDEiC2B5O1he
m9IAhGgFzgOuTFwJxwLcutDNUv4GyK6dMdLl+DzAv0hthSHjVsT2vb06X99NMenq
nzMADXHvxQ==
=euzH
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-02-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Here's a small collection of fixes that were queued up:
- Remove unnecessary NULL check (Dan)
- Missing io_req_cancelled() call in fallocate (Pavel)
- Put the cleanup check for aux data in the right spot (Pavel)
- Two fixes for SQPOLL (Stefano, Xiaoguang)"
* tag 'io_uring-5.6-2020-02-22' of git://git.kernel.dk/linux-block:
io_uring: fix __io_iopoll_check deadlock in io_sq_thread
io_uring: prevent sq_thread from spinning when it should stop
io_uring: fix use-after-free by io_cleanup_req()
io_uring: remove unnecessary NULL checks
io_uring: add missing io_req_cancelled()
Since commit a3a0e43fd7 ("io_uring: don't enter poll loop if we have
CQEs pending"), if we already events pending, we won't enter poll loop.
In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
been terminated and don't reap pending events which are already in cq
ring, and there are some reqs in poll_list, io_sq_thread will enter
__io_iopoll_check(), and find pending events, then return, this loop
will never have a chance to exit.
I have seen this issue in fio stress tests, to fix this issue, let
io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
and remove __io_iopoll_check().
Fixes: a3a0e43fd7 ("io_uring: don't enter poll loop if we have CQEs pending")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hardware registers of devices under control of power management cannot
be accessed at all times. If such a device is suspended, register
accesses may lead to undefined behavior, like reading bogus values, or
causing exceptions or system lock-ups.
Extend struct debugfs_regset32 with an optional field to let device
drivers specify the device the registers in the set belong to. This
allows debugfs_show_regset32() to make sure the device is resumed while
its registers are being read.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Niklas Söderlund <niklas.soderlund@ragnatech.se>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When CONFIG_QFMT_V2 is configured as a module, the test in
ext4_feature_set_ok() fails and so mount of filesystems with quota or
project features fails. Fix the test to use IS_ENABLED macro which
works properly even for modules.
Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz
Fixes: d65d87a074 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
on it, the following warning in ext4_add_complete_io() can be hit:
WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
Here's a minimal reproducer (not 100% reliable) (root isn't required):
while true; do
sync
done &
while true; do
rm -f file
touch file
chattr -e file
echo X >> file
chattr +e file
done
The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
(which only returns true on extent-based files) is checked once to set
the number of reserved journal credits, and also again later to select
the flags for ext4_map_blocks() and copy the reserved journal handle to
ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set,
the first check can see dioread_nolock disabled while the later one can
see it enabled, causing the reserved handle to unexpectedly be NULL.
Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
related to doing so as well, fix this by synchronizing changing
EXT4_EXTENTS_FL with ext4_writepages() via the existing
s_writepages_rwsem (previously called s_journal_flag_rwsem).
This was originally reported by syzbot without a reproducer at
https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
but now that dioread_nolock is the default I also started seeing this
when running syzkaller locally.
Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
Fixes: 6b523df4fb ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
In preparation for making s_journal_flag_rwsem synchronize
ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
flags (rather than just JOURNAL_DATA as it does currently), rename it to
s_writepages_rwsem.
Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.
The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
This is necessary because unless userspace explicitly requests fstype
"nfs4" (either via "mount -t nfs4" or by calling the "mount.nfs4" helper
directly), the fstype will default to "nfs".
This was fine on older kernels because the super_block->s_type was set
via mount_info->nfs_mod->nfs_fs, which was set when parsing the mount
options and subsequently passed in the "type" argument of sget().
After commit f2aedb713c ("NFS: Add fs_context support."), sget_fc(),
which has no "type" argument, is called instead. In sget_fc(), the
super_block->s_type is set via fs_context->fs_type, which was set when
the filesystem context was initially created.
Reported-by: Patrick Steinhardt <ps@pks.im>
Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Because there are likely to be several sysctls in a row on the
same superblock cache the super_block after the count has
been raised and don't deactivate it until we are processing
another super_block.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch drops 'cur_mm' before calling cond_resched(), to prevent
the sq_thread from spinning even when the user process is finished.
Before this patch, if the user process ended without closing the
io_uring fd, the sq_thread continues to spin until the
'sq_thread_idle' timeout ends.
In the worst case where the 'sq_thread_idle' parameter is bigger than
INT_MAX, the sq_thread will spin forever.
Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During an online resize an array of pointers to s_group_info gets replaced
so it can get enlarged. If there is a concurrent access to the array in
ext4_get_group_info() and this memory has been reused then this can lead to
an invalid memory access.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Balbir Singh <sblbir@amazon.com>
Cc: stable@kernel.org
During an online resize an array of pointers to buffer heads gets
replaced so it can get enlarged. If there is a racing block
allocation or deallocation which uses the old array, and the old array
has gotten reused this can lead to a GPF or some other random kernel
memory getting modified.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu
Reported-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This prepares the way for allowing the pid part of proc to use this
dcache pruning code as well.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
I about to need and use the same functionality for pid based
inodes and there is no point in adding a second field when
this field is already here and serving the same purporse.
Just give the field a generic name so it is clear that
it is no longer sysctl specific.
Also for good measure initialize sibling_inodes when
proc_inode is initialized.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
We tested a soft lockup problem in linux 4.19 which could also
be found in linux 5.x.
When dir inode takes up a large number of blocks, and if the
directory is growing when we are searching, it's possible the
restart branch could be called many times, and the do while loop
could hold cpu a long time.
Here is the call trace in linux 4.19.
[ 473.756186] Call trace:
[ 473.756196] dump_backtrace+0x0/0x198
[ 473.756199] show_stack+0x24/0x30
[ 473.756205] dump_stack+0xa4/0xcc
[ 473.756210] watchdog_timer_fn+0x300/0x3e8
[ 473.756215] __hrtimer_run_queues+0x114/0x358
[ 473.756217] hrtimer_interrupt+0x104/0x2d8
[ 473.756222] arch_timer_handler_virt+0x38/0x58
[ 473.756226] handle_percpu_devid_irq+0x90/0x248
[ 473.756231] generic_handle_irq+0x34/0x50
[ 473.756234] __handle_domain_irq+0x68/0xc0
[ 473.756236] gic_handle_irq+0x6c/0x150
[ 473.756238] el1_irq+0xb8/0x140
[ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4]
[ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4]
[ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4]
[ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4]
[ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4]
[ 473.756402] ext4_lookup+0x8c/0x258 [ext4]
[ 473.756407] __lookup_hash+0x8c/0xe8
[ 473.756411] filename_create+0xa0/0x170
[ 473.756413] do_mkdirat+0x6c/0x140
[ 473.756415] __arm64_sys_mkdirat+0x28/0x38
[ 473.756419] el0_svc_common+0x78/0x130
[ 473.756421] el0_svc_handler+0x38/0x78
[ 473.756423] el0_svc+0x8/0xc
[ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149]
Add cond_resched() to avoid soft lockup and to provide a better
system responding.
Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4]
write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127:
ext4_write_end+0x4e3/0x750 [ext4]
ext4_update_i_disksize at fs/ext4/ext4.h:3032
(inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046
(inlined by) ext4_write_end at fs/ext4/inode.c:1287
generic_perform_write+0x208/0x2a0
ext4_buffered_write_iter+0x11f/0x210 [ext4]
ext4_file_write_iter+0xce/0x9e0 [ext4]
new_sync_write+0x29c/0x3b0
__vfs_write+0x92/0xa0
vfs_write+0x103/0x260
ksys_write+0x9d/0x130
__x64_sys_write+0x4c/0x60
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe
read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37:
ext4_writepages+0x10ac/0x1d00 [ext4]
mpage_map_and_submit_extent at fs/ext4/inode.c:2468
(inlined by) ext4_writepages at fs/ext4/inode.c:2772
do_writepages+0x5e/0x130
__writeback_single_inode+0xeb/0xb20
writeback_sb_inodes+0x429/0x900
__writeback_inodes_wb+0xc4/0x150
wb_writeback+0x4bd/0x870
wb_workfn+0x6b4/0x960
process_one_work+0x54c/0xbe0
worker_thread+0x80/0x650
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
Reported by Kernel Concurrency Sanitizer on:
CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
Workqueue: writeback wb_workfn (flush-7:0)
Since only the read is operating as lockless (outside of the
"i_data_sem"), load tearing could introduce a logic bug. Fix it by
adding READ_ONCE() for the read and WRITE_ONCE() for the write.
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
The __compat_only_sysfs_link_entry_to_kobj function creates a symlink
to a kobject but doesn't provide an option to change the symlink file
name.
This patch adds a wrapper function compat_only_sysfs_link_entry_to_kobj
that extends the __compat_only_sysfs_link_entry_to_kobj functionality
which allows function caller to customize the symlink name.
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
[mpe: Fix compile error when CONFIG_SYSFS=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191211160910.21656-3-sourabhjain@linux.ibm.com
io_cleanup_req() should be called before req->io is freed, and so
shouldn't be after __io_free_req() -> __io_req_aux_free(). Also,
it will be ignored for in io_free_req_many(), which use
__io_req_aux_free().
Place cleanup_req() into __io_req_aux_free().
Fixes: 99bc4c3853 ("io_uring: fix iovec leaks")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In btrfs_wait_ordered_range() once we find an ordered extent that has
finished with an error we exit the loop and don't wait for any other
ordered extents that might be still in progress.
All the users of btrfs_wait_ordered_range() expect that there are no more
ordered extents in progress after that function returns. So past fixes
such like the ones from the two following commits:
ff612ba784 ("btrfs: fix panic during relocation after ENOSPC before
writeback happens")
28aeeac1dd ("Btrfs: fix panic when starting bg cache writeout after
IO error")
don't work when there are multiple ordered extents in the range.
Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
even after it finds one that had an error.
Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I hit the following warning while running my error injection stress
testing:
WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
Call Trace:
btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
__btrfs_prealloc_file_range+0x378/0x470 [btrfs]
elfcorehdr_read+0x40/0x40
? elfcorehdr_read+0x40/0x40
? btrfs_commit_transaction+0xca/0xa50 [btrfs]
? dput+0xb4/0x2a0
? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
? btrfs_sync_file+0x30e/0x420 [btrfs]
? do_fsync+0x38/0x70
? __x64_sys_fdatasync+0x13/0x20
? do_syscall_64+0x5b/0x1b0
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens if we fail to insert our reserved file extent. At this
point we've already converted our reservation from ->bytes_may_use to
->bytes_reserved. However once we break we will attempt to free
everything from [cur_offset, end] from ->bytes_may_use, but our extent
reservation will overlap part of this.
Fix this problem by adding ins.offset (our extent allocation size) to
cur_offset so we remove the actual remaining part from ->bytes_may_use.
I validated this fix using my inject-error.py script
python inject-error.py -o should_fail_bio -t cache_save_setup -t \
__btrfs_prealloc_file_range \
-t insert_reserved_file_extent.constprop.0 \
-r "-5" ./run-fsstress.sh
where run-fsstress.sh simply mounts and runs fsstress on a disk.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're allocating a logged extent we attempt to insert an extent
record for the file extent directly. We increase
space_info->bytes_reserved, because the extent entry addition will call
btrfs_update_block_group(), which will convert the ->bytes_reserved to
->bytes_used. However if we fail at any point while inserting the
extent entry we will bail and leave space on ->bytes_reserved, which
will trigger a WARN_ON() on umount. Fix this by pinning the space if we
fail to insert, which is what happens in every other failure case that
involves adding the extent entry.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_assert_delayed_root_empty() will check if the delayed root is
completely empty, but this is a filesystem-wide check. On cleanup we
may have allowed other transactions to begin, for whatever reason, and
thus the delayed root is not empty.
So remove this check from cleanup_one_transation(). This however can
stay in btrfs_cleanup_transaction(), because it checks only after all of
the transactions have been properly cleaned up, and thus is valid.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While running my error injection script I hit a panic when we tried to
clean up the fs_root when freeing the fs_root. This is because
fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by
setting fs_info->fs_root = NULL; if we fail to read the root.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.
This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.
Fixes: 3368d001ba ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Andrei Vagin reported that commit 0ddad21d3e ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests. He even
has a trivial reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main()
{
int p[2];
pid_t p1, p2;
int status;
if (pipe(p) == -1)
return 1;
p1 = fork();
if (p1 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
p2 = fork();
if (p2 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
sleep(1);
close(p[1]);
wait(&status);
wait(&status);
return 0;
}
and the problem - once he points it out - is obvious. We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).
In fact, when going through this, we had several small oddities around
how to wake things. We did in fact wake every reader when we changed
the size of the pipe buffers. But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.
And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.
On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody. So don't do that: just
wake up everybody - but only do that if the counts changed to zero.
So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer. And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).
Fixes: 0ddad21d3e ("pipe: use exclusive waits when reading or writing")
Reported-and-tested-by: Andrei Vagin <avagin@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "kmsg" pointer can't be NULL and we have already dereferenced it so
a check here would be useless.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Downgrade the eCryptfs maintenance status to "Odd Fixes"
- Change my email address
- Fix a couple memory leaks in error paths
- Stability improvement to avoid a needless BUG_ON()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKvuQkp28KvPJn/fVfL7QslYSS/0FAl5K+wkACgkQfL7QslYS
S/1vrA//TIaQWadfWeXbdd8Kj+zEsd6BDZEiGq56w5WtuYe+XcAKdXITBLXXVZR1
/MHR4h16BpkmyRbKojvGjqECLf8W8XeZC30gEXJEM7PVbFvLvM92ZKUT/K3ax7Tt
HBQs8L0MM2BDWFDyfQ9vWnPakIvy1pHS4iHQeJgtcBfqcMS+u6fYmpoCIbWZ9UfA
7ilS931CjFxgRvKdwMRp4C08fUs7awqYcZxkDMjp3uAOSp0YFtp3/EiPrNAe1+ot
/PuwzV69a7raqRurdBUApj9QMY1VmKheY80Wgl4C0aS5lc+JUiOpclKDRW174W51
UFI9CanRcqNig7KY+eTenJWDS8l9nEK40nQWKHQZ5JT+kitWMyvjrXh6MwIRFvKE
N+7gSWZWumQknKSg3y1AVkEZK+mHw7lRJ8ziLCxrR28EkS3TTvw2mpp80ed0jC/v
SCUwj4rqP1l7wR7xa4vlYSXOLznK/DRBSrud7bbQlZOTjVXZAeNv2LOe8qLu8C0b
j+/Mhf6OKUd5c7Pjl+7YD1/DcKsUlayyVn0k2HX/V17QK1a5Kp5R+6mbvyFclq1T
nQTgu5prq7BiS49GGY/JePO0wKCi2+A3T87rj2Y1yfGjYNgcR/zsQa2LuPtP92Tt
5aAHpP+lC9CoJohYjM/cRncq1rsOnBchQRz+i/VgVtUqJ9vfgXM=
=kXnY
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-5.6-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs fixes from Tyler Hicks:
- downgrade the eCryptfs maintenance status to "Odd Fixes"
- change my email address
- fix a couple memory leaks in error paths
- stability improvement to avoid a needless BUG_ON()
* tag 'ecryptfs-5.6-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
ecryptfs: replace BUG_ON with error handling code
eCryptfs: Replace deactivated email address
MAINTAINERS: eCryptfs: Update maintainer address and downgrade status
ecryptfs: fix a memory leak bug in ecryptfs_init_messaging()
ecryptfs: fix a memory leak bug in parse_tag_1_packet()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5K0xIACgkQxWXV+ddt
WDvihhAAjTdEr8Gbt4Beu+F9I2j+1+x/qdVADsrJErBJ5s1J0prXI27Vx5auHMMD
qJViTGTTjwlIo4jFI+AtlrzcE07eIMN1zzjVb2o8efrpkFDTfgT+dAkSVR8zAX20
nycsQ93QebCrlRGpPuXsniczy+QTRXWBY+vPXe1UbfGrzb3q+1pBJgfdq/7pkt/V
0RgB78JDVWUhw+YBjLuV0Y7tjmWlANGvofT34A34CR23i/JoIjVx7GfeQyILBmHE
iuCwzMgAdIKlRpjG5WJeK60bW0Dkjmbr3HDyJexiXSa4u2UwgUlT8UsITbVCr48t
OrfYxb7aWmwEqa5fbZzIiyTCWDKSxiQ5X0XW8e6ehdkh+s9DWmNp48xt9OogpYcv
8qni+vMgjrOLZwj//DgHpr2CYrISxJ+vhCvaOLMK++8TMIH/EQZEyaYkKn5Osg/8
iSwI9b2IF8LmnnvWhkbs1p4Jn2Ca4fMhlN2D5xflwSK/lbhgJk5E6P+4h6pQ9MvG
U5bWR2mqW1xT+lCO7AD+MBq+VbF/BZik0TjuNssoYJJtx/VbTbpydrHW65SvZvQK
QT7MTHekkbtIR+SGYl66n1Sc4RG7znWNwjfoyVEbSblLIczfzDDsl1O/9ldqbia4
BoPWghEO8Wy543UTH9DyX+oMzB2Pwb/TMa4eQyNhbQMqcwXHv5Q=
=nlEW
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"This is the fix for sleeping in a locked section bug reported by Dave
Jones, caused by a patch dependence in development and pulled
branches.
I picked the existing patch over the fixup that Filipe sent, as it's a
bit more generic fix. I've verified it with a specific test case, some
rsync stress and one round of fstests"
* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't set path->leave_spinning for truncate
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence. We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation. With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.
Note: this patch also fixes a warning after 28553fa992 ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.
[70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
[70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
[70.794863] 5 locks held by rsync/1141:
[70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
[70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
[70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
[70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
[70.795362] Call Trace:
[70.795374] dump_stack+0x71/0xa0
[70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
[70.795459] kmem_cache_alloc+0x1d3/0x290
[70.795471] alloc_extent_state+0x22/0x1c0
[70.795544] __clear_extent_bit+0x3ba/0x580
[70.795557] ? _raw_spin_unlock_irq+0x24/0x30
[70.795569] btrfs_truncate_inode_items+0x339/0xe50
[70.795647] btrfs_evict_inode+0x269/0x540
[70.795659] ? dput.part.38+0x29/0x460
[70.795671] evict+0xcd/0x190
[70.795682] __dentry_kill+0xd6/0x180
[70.795754] dput.part.38+0x2ad/0x460
[70.795765] do_renameat2+0x3cb/0x540
[70.795777] __x64_sys_rename+0x1c/0x20
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 28553fa992 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5JQQUACgkQxWXV+ddt
WDv+fRAAh2VeACdVgGPCqH3kBphdBzqT2x9EUWZK++AWTr0Upg4Mz1mvD2h1WX6L
Yd/cuG0hIy2AFl1fEwrOp9d64PGFRnVcs0IS0C9GC/kH9ZzrSuz3pcfni3uxCr0N
t+/Pghxb654fLPd8eho9Pnc9bi/EPPRZ+mmwup4A9MyC7Bdfzf821ykrhqwam54V
LWRUanVfeJOuxNbgF2Hjnr9ZI2fALsFfh5+q1VCeNsm76Le67RGVi7V6i9Gxj8eE
3zOH/XM1Jzft3cLinfimIUC0BQKBuFXSKnYun5zP47Y5VHgxYCcfBBAD71uGgs52
mxaz1vLGE6GZo631zBBXvM6BalEHyQROP8Daman9NiJDctkvICXzv/bwi4k5ZHSF
i45MANAtBb10S/AYitscqmGR7sYHZtKBkgJHjH+0YqKgcu3OWInqNVuW77kZEguV
/UjcdjAmxirvyAaxOdfXtBfuZKBU1kL+dZwg4OeqzbDoBsUAtkEG+3oIcjkGnP9e
6QuQx5XYKmI1tugcAz0uj5ryXW6UtIFVy+Fgxht+XMSNRR5CiUpu7Ndm/ZrKx5RS
/hv2KZot+/p7IA9J0YvZbYwwvw28YxR4PgGMOKYhqKnTVHR6eaFTDNKiaQRkJaXr
Nimk6TQrpc9Qu/rcoeyuQajxCRhwSqNAJH97MJPMDz39xolEys8=
=i1Ct
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two races fixed, memory leak fix, sysfs directory fixup and two new
log messages:
- two fixed race conditions: extent map merging and truncate vs
fiemap
- create the right sysfs directory with device information and move
the individual device dirs under it
- print messages when the tree-log is replayed at mount time or
cannot be replayed on remount"
* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs, move device id directories to UUID/devinfo
btrfs: sysfs, add UUID/devinfo kobject
Btrfs: fix race between shrinking truncate and fiemap
btrfs: log message when rw remount is attempted with unclean tree-log
btrfs: print message when tree-log replay starts
Btrfs: fix race between using extent maps and merging them
btrfs: ref-verify: fix memory leaks
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5IHmgACgkQiiy9cAdy
T1Hv9gv/QzgtpGUkK+8mLkv6BCoc0vChvr6s3xIxuSBOnAdhe6km1NSwVek7lUxU
Liupsw+mLOt4TO8hCsHKEIfR4QKnPp22UF0nUj6bYWMJH2+zNNwlcV1TF01xGW8t
J7jTKNd0MYOC1jQ+NLX4pwAz85Sp6yw4iYxg+PBpgPjEOUlGFttVK7U22nDMR1cz
hlNY8KfJ3CiAOQEGIu+lIpQAo2/WcvUUGKjDwOlIRUwsodPJdy5XCromPa4wktuS
OcDcrehfs16QfYW6K7rFSvRvcHUJ9nOf/Ht31rm9kD3DL8nN3MSV2vVWzszq+b4y
LFm/6Z0/rjYE9+W+/w0H5jFGatjGpChrGzzGTaTjqIEAh9xNFGmfd96ZpixGZpIL
SaJD8ab0RBUJ5Q0DAmBJHcaWCuN1ELbGu2ZFJNiJROH53epaRtyi54ryVv9XHNsh
RQTbmmnbTPPqlv4S0Rd++Z6H+eYBSU1Vbm7lmZzK84XHQ0w2ttnU+grVsxoUdPlb
530Yvu5z
=IQBE
-----END PGP SIGNATURE-----
Merge tag '5.6-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four small CIFS/SMB3 fixes. One (the EA overflow fix) for stable"
* tag '5.6-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: make sure we do not overflow the max EA buffer size
cifs: enable change notification for SMB2.1 dialect
cifs: Fix mode output in debugging statements
cifs: fix mount option display for sec=krb5i
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl5IrGgACgkQ8vlZVpUN
gaMi2Qf+NuXFQ6xpxjw90ZmkOXbsqbFVAIx3yjzW5WKIVyHrhG19P8NuNaCbUajP
WEUzrY/LMmcODd46zKcCR61jOygKCJ5NXyF8qTEFltiMuqugvwmTv49ofRxX+Zlf
YkavQGo+cPs5p1xzo+bKqhA29zhc4PApd3bGpjgUqDlfQxBmm2vNpkbXKj/Bk6CL
k9c1EodygpTO45Kb3Y/JvPcbTLOOu0hPnrCemtb1Vc9tm0j6I+g5RtQofUag+1BX
FOR5Z2EUwByb7F+TTxO1Jse0oFjcxH9J/VqeKi2Fbzucdq3NrRypiFj0XPFBRPFn
g7ONbaoIGiK+02pCNFuwbY0WmLQVfg==
=T3j1
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous ext4 bug fixes (all stable fodder)"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: improve explanation of a mount failure caused by a misconfigured kernel
jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer
jbd2: move the clearing of b_modified flag to the journal_unmap_buffer()
ext4: add cond_resched() to ext4_protect_reserved_inode
ext4: fix checksum errors with indexed dirs
ext4: fix support for inode sizes > 1024 bytes
ext4: simplify checking quota limits in ext4_statfs()
ext4: don't assume that mmp_nodename/bdevname have NUL
fallocate_finish() is missing cancellation check. Add it.
It's safe to do that, as only flags setup and sqe fields copy are done
before it gets into __io_fallocate().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If CONFIG_QFMT_V2 is not enabled, but CONFIG_QUOTA is enabled, when a
user tries to mount a file system with the quota or project quota
enabled, the kernel will emit a very confusing messsage:
EXT4-fs warning (device vdc): ext4_enable_quotas:5914: Failed to enable quota tracking (type=0, err=-3). Please run e2fsck to fix.
EXT4-fs (vdc): mount failed
We will now report an explanatory message indicating which kernel
configuration options have to be enabled, to avoid customer/sysadmin
confusion.
Link: https://lore.kernel.org/r/20200215012738.565735-1-tytso@mit.edu
Google-Bug-Id: 149093531
Fixes: 7c319d3285 ("ext4: make quota as first class supported feature")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Stable Bugfixes:
- Fix DMA scatter-gather list mapping imbalance
Other Fixes:
- Fix directory verifier races
- Fix races between open and dentry revalidation
- Fix revalidation of dentries with delegations
- Fix "cachethis" setting for writes
- Fix delegation and delegation cred pinning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5HEa4ACgkQ18tUv7Cl
QOvwAA/9FAIwiE5UU+OrbpJ47IKFay2lfu6ywWVt0Q9i/tDeTRawa9aZDNKmKuYk
rViVzRhIrmdkhpSVeTl3BdV/U9oXmNQ9ubFxWkQ0TI1j/tkTotT4dA0/kjeampZJ
dZeW+NtRnFxYtfyV6fV/Bp/GB1RE9tAr1hTzHYEpNfnKxmAgkOeIXsVebuWKvUxz
XCeRED30Ekt2n1JQqb1m3VAMpYmprF2i9TErAVR8ILt0dkkJD8nNlPogrZwMjzuD
GtB7BSJ7rIDHQomuF3HVrz+SpMhM530mpsZiVGYCKjChJWaPm12+AZGPtv3UcpAf
/M2VTa+z8MJDwirzjWogo3OvYAqZGI6KUG0o2t5dKmguQoi1RrDlPbFMLXlxow2e
5gi/qKFbnBOt6iIV9FffMb50jpnShBE6hQS4/rb2inGmAKYgEXcL+ouBY6308soS
fnWJSViTQfGbF9uVxGeOYl766zEqlNleclfjKBCNXMC3HOfIslIBpilMDwCoOHeB
MqYSqmnwAH9ZrqjNmR3Qh2RlcOxN8CTUMhkUW2KLMAgnEAd9WINaYu4u2Y50Fe6j
yDa6YEAmaNSGVYVofh8qA1R1Zi4tnKlg1/WFdy/SnEQKi8mQqXASGiVu2bXjhlF1
22cHVXNIjp1PMijZKnJtimuxChygj9xxgzw4Kstd1WCLm2VirAc=
=Ase9
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"The only stable fix this time is the DMA scatter-gather list bug fixed
by Chuck.
The rest fix up races and refcounting issues that have been found
during testing.
Stable fix:
- fix DMA scatter-gather list mapping imbalance
The rest:
- fix directory verifier races
- fix races between open and dentry revalidation
- fix revalidation of dentries with delegations
- fix "cachethis" setting for writes
- fix delegation and delegation cred pinning"
* tag 'nfs-for-5.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4: Ensure the delegation cred is pinned when we call delegreturn
NFSv4: Ensure the delegation is pinned in nfs_do_return_delegation()
NFSv4.1 make cachethis=no for writes
xprtrdma: Fix DMA scatter-gather list mapping imbalance
NFSv4: Fix revalidation of dentries with delegations
NFSv4: Fix races between open and dentry revalidation
NFS: Fix up directory verifier races
of server path canonicalization patch that went into -rc1 and a fixup
for noacl mount option that got broken by the conversion to the new
mount API in 5.5.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl5GzToTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi6tKB/9VZ5DIkd5m3UyKh+rgw565LddPiJTU
zIngCUV2EwyRyTwgOCMzG/AsCxcjQDS3RzEZsCCIpAsT09vAJxP0uFFm8o3LE0xX
NX8F/0cHc14lqpI39dzB3HiBETx3XIf3eNUpSJphtwKqmn3MOhAASPK9xDZdICNh
VziVMOogns4QEYTxY+aA5DoKmkOuuolCVKUZj7/fyFXX90NGhLVBsIfRIuIE2O0v
xbtitsGqqZtf+1Iar4gKRLIWCpwzETSBE9G+GJlxqUSDz2VTovDDhn6tBtu65m8d
APvN9+Le355ts9iFuep3SYkYQhNXRZ/ydxx3fOqTM60Tzi3cVTtO+Mk4
=YoNz
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.6-rc2' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
- make O_DIRECT | O_APPEND combination work better
- redo the server path canonicalization patch that went into -rc1
- fix the 'noacl' mount option that got broken by the conversion to the
new mount API in 5.5
* tag 'ceph-for-5.6-rc2' of https://github.com/ceph/ceph-client:
ceph: noacl mount option is effectively ignored
ceph: canonicalize server path in place
ceph: do not execute direct write in parallel if O_APPEND is specified
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5GzcYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptCNEAC7F8fAOTKDtYxl7vNgoSVd4lUJAllyaYFk
gFaxuh6fQXqXnqQhPQxJxHCTqvREqfdbGItLmWp/g3r9bEYZouKflPoVTSdgeTjz
47JoIXhdcv4A26lGO2N+tAuRDk68dcyG31tHqHNcrHhFL+1hcs8pMMf++KD8i/VF
kooo56E7Wfh4170q5Jr2pRpVv5O2G/WgkKqbmHRfOXLvvBvCNUDYGfK8Gu8glp7T
auMNdRy7+3uia5oBre5zz8oqeSXS8Th6BWUPTJYEE3kl4rnZMnVsRSfCXDszUchP
yl59tDM83WR4Jc/uq6baStL/IAjK6y76O3mKIZPdjoxhL3YIzPHOTxiVzynWsFDh
abbRqMdmst7c7qYIaWA0S6jZE7uN8eR9huuo+EwKdzXQeKJIYBLo447ysHmaglq3
amzEJUBrw/attgpqakQ1nL6aj5D+XhthuDCYKK5/447ygdNUQfES1ZobzHZVMjx1
74ljXJ24Y/QsFfqntf1szf8mh3lxD3KO85OGgxjYj/iv7mtBCPALTLn3xbwNzJ15
Mrb2xY7n6rxO6gCEmJEJAAs2vGQFewS8MaXQu/fOcUWujRP4a5xNvaL4YDZ6epXh
wXHPTrNDxmF948r78EYxF6N7BqR/uRv+razMYyZ/OWVi4CLogwAOoTtfaHhbA04H
Jh2GUgotdA==
=6MrA
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-02-14' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Here's a set of fixes for io_uring:
- Various fixes with cleanups from Pavel, fixing corner cases where
we're not correctly dealing with iovec cleanup.
- Clarify that statx/openat/openat2 don't accept fixed files
- Buffered raw device write EOPTNOTSUPP fix
- Ensure async workers grab current->fs
- A few task exit fixes with pending requests that grab the file
table
- send/recvmsg async load fix
- io-wq offline node setup fix
- CQ overflow flush in poll"
* tag 'io_uring-5.6-2020-02-14' of git://git.kernel.dk/linux-block: (21 commits)
io_uring: prune request from overflow list on flush
io-wq: don't call kXalloc_node() with non-online node
io_uring: retain sockaddr_storage across send/recvmsg async punt
io_uring: cancel pending async work if task exits
io-wq: add io_wq_cancel_pid() to cancel based on a specific pid
io-wq: make io_wqe_cancel_work() take a match handler
io_uring: fix openat/statx's filename leak
io_uring: fix double prep iovec leak
io_uring: fix async close() with f_op->flush()
io_uring: allow AT_FDCWD for non-file openat/openat2/statx
io_uring: grab ->fs as part of async preparation
io-wq: add support for inheriting ->fs
io_uring: retry raw bdev writes if we hit -EOPNOTSUPP
io_uring: add cleanup for openat()/statx()
io_uring: fix iovec leaks
io_uring: remove unused struct io_async_open
io_uring: flush overflowed CQ events in the io_uring_poll()
io_uring: statx/openat/openat2 don't support fixed files
io_uring: fix deferred req iovec leak
io_uring: fix 1-bit bitfields to be unsigned
...
In crypt_scatterlist, if the crypt_stat argument is not set up
correctly, the kernel crashes. Instead, by returning an error code
upstream, the error is handled safely.
The issue is detected via a static analysis tool written by us.
Fixes: 237fead619 (ecryptfs: fs/Makefile and fs/Kconfig)
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: Tyler Hicks <code@tyhicks.com>
Replace a recently deactived email address with one that I'll be able to
personally control and keep alive.
Signed-off-by: Tyler Hicks <code@tyhicks.com>
RHBZ: 1752437
Before we add a new EA we should check that this will not overflow
the maximum buffer we have available to read the EAs back.
Otherwise we can get into a situation where the EAs are so big that
we can not read them back to the client and thus we can not list EAs
anymore or delete them.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
It was originally enabled only for SMB3 or later dialects, but
had requests to add it to SMB2.1 mounts as well given the
large number of systems at that dialect level.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: L Walsh <cifs@tlinx.org>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Carter reported an issue where he could produce a stall on ring exit,
when we're cleaning up requests that match the given file table. For
this particular test case, a combination of a few things caused the
issue:
- The cq ring was overflown
- The request being canceled was in the overflow list
The combination of the above means that the cq overflow list holds a
reference to the request. The request is canceled correctly, but since
the overflow list holds a reference to it, the final put won't happen.
Since the final put doesn't happen, the request remains in the inflight.
Hence we never finish the cancelation flush.
Fix this by removing requests from the overflow list if we're canceling
them.
Cc: stable@vger.kernel.org # 5.5
Reported-by: Carter Li 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure we don't release the delegation cred during the call to
nfs4_proc_delegreturn().
Fixes: ee05f45677 ("NFSv4: Fix races between open and delegreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The call to nfs_do_return_delegation() needs to be taken without
any RCU locks. Add a refcount to make sure the delegation remains
pinned in memory until we're done.
Fixes: ee05f45677 ("NFSv4: Fix races between open and delegreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Turning caching off for writes on the server should improve performance.
Fixes: fba83f3411 ("NFS: Pass "privileged" value to nfs4_init_sequence()")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 904cdbd41d ("jbd2: clear dirty flag when revoking a buffer from
an older transaction") set the BH_Freed flag when forgetting a metadata
buffer which belongs to the committing transaction, it indicate the
committing process clear dirty bits when it is done with the buffer. But
it also clear the BH_Mapped flag at the same time, which may trigger
below NULL pointer oops when block_size < PAGE_SIZE.
rmdir 1 kjournald2 mkdir 2
jbd2_journal_commit_transaction
commit transaction N
jbd2_journal_forget
set_buffer_freed(bh1)
jbd2_journal_commit_transaction
commit transaction N+1
...
clear_buffer_mapped(bh1)
ext4_getblk(bh2 ummapped)
...
grow_dev_page
init_page_buffers
bh1->b_private=NULL
bh2->b_private=NULL
jbd2_journal_put_journal_head(jh1)
__journal_remove_journal_head(hb1)
jh1 is NULL and trigger oops
*) Dir entry block bh1 and bh2 belongs to one page, and the bh2 has
already been unmapped.
For the metadata buffer we forgetting, we should always keep the mapped
flag and clear the dirty flags is enough, so this patch pick out the
these buffers and keep their BH_Mapped flag.
Link: https://lore.kernel.org/r/20200213063821.30455-3-yi.zhang@huawei.com
Fixes: 904cdbd41d ("jbd2: clear dirty flag when revoking a buffer from an older transaction")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
There is no need to delay the clearing of b_modified flag to the
transaction committing time when unmapping the journalled buffer, so
just move it to the journal_unmap_buffer().
Link: https://lore.kernel.org/r/20200213063821.30455-2-yi.zhang@huawei.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
DIR_INDEX has been introduced as a compat ext4 feature. That means that
even kernels / tools that don't understand the feature may modify the
filesystem. This works because for kernels not understanding indexed dir
format, internal htree nodes appear just as empty directory entries.
Index dir aware kernels then check the htree structure is still
consistent before using the data. This all worked reasonably well until
metadata checksums were introduced. The problem is that these
effectively made DIR_INDEX only ro-compatible because internal htree
nodes store checksums in a different place than normal directory blocks.
Thus any modification ignorant to DIR_INDEX (or just clearing
EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
and trigger kernel errors. So we have to be more careful when dealing
with indexed directories on filesystems with checksumming enabled.
1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
DIR_INDEX is not enabled. This is harsh but it should be very rare (it
means someone disabled DIR_INDEX on existing filesystem and didn't run
e2fsck), e2fsck can fix the problem, and we don't want to answer the
difficult question: "Should we rather corrupt the directory more or
should we ignore that DIR_INDEX feature is not set?"
2) When we find out htree structure is corrupted (but the filesystem and
the directory should in support htrees), we continue just ignoring htree
information for reading but we refuse to add new entries to the
directory to avoid corrupting it more.
Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
Fixes: dbe8944404 ("ext4: Calculate and verify checksums for htree nodes")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
A recent commit, 9803387c55 ("ext4: validate the
debug_want_extra_isize mount option at parse time"), moved mount-time
checks around. One of those changes moved the inode size check before
the blocksize variable was set to the blocksize of the file system.
After 9803387c55 was set to the minimum allowable blocksize, which
in practice on most systems would be 1024 bytes. This cuased file
systems with inode sizes larger than 1024 bytes to be rejected with a
message:
EXT4-fs (sdXX): unsupported inode size: 4096
Fixes: 9803387c55 ("ext4: validate the debug_want_extra_isize mount option at parse time")
Link: https://lore.kernel.org/r/20200206225252.GA3673@mit.edu
Reported-by: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Coverity reports that conditions checking quota limits in ext4_statfs()
contain dead code. Indeed it is right and current conditions can be
simplified.
Link: https://lore.kernel.org/r/20200130111148.10766-1-jack@suse.cz
Reported-by: Coverity <scan-admin@coverity.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Normal, synchronous requests will have their args allocated on the stack.
After the FR_FINISHED bit is set by receiving the reply from the userspace
fuse server, the originating task may return and reuse the stack frame,
resulting in an Oops if the args structure is dereferenced.
Fix by setting a flag in the request itself upon initializing, indicating
whether it has an asynchronous ->end() callback.
Reported-by: Kyle Sanderson <kyle.leet@gmail.com>
Reported-by: Michael Stapelberg <michael+lkml@stapelberg.ch>
Fixes: 2b319d1f6f ("fuse: don't dereference req->args on finished request")
Cc: <stable@vger.kernel.org> # v5.4
Tested-by: Michael Stapelberg <michael+lkml@stapelberg.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
A number of the debug statements output file or directory mode
in hex. Change these to print using octal.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If a dentry was not initially looked up while we were holding a
delegation, then we do still need to revalidate that it still holds
the same name. If there are multiple hard links to the same file,
then all the hard links need validation.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
[Anna: Put nfs_unset_verifier_delegated() under CONFIG_NFS_V4]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Originally it was planned to create device id directories under
UUID/devinfo, but it got under UUID/devices by mistake. We really want
it under definfo so the bare device node names are not mixed with device
ids and are easy to enumerate.
Fixes: 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).
Consider the following example:
1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
and a 1MiB prealloc extent at file offset 128KiB;
2) Task A starts doing a shrinking truncate of our inode to reduce it to
a size of 64KiB. Before it searches the subvolume tree for file
extent items to delete, it drops all the extent maps in the range
from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
3) Task B starts doing a fiemap against our inode. When looking up for
the inode's extent maps in the range from 128KiB to (u64)-1, it
doesn't find any in the inode's extent map tree, since they were
removed by task A. Because it didn't find any in the extent map
tree, it scans the inode's subvolume tree for file extent items, and
it finds the 1MiB prealloc extent at file offset 128KiB, then it
creates an extent map based on that file extent item and adds it to
inode's extent map tree (this ends up being done by
btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
get_extent_skip_holes());
4) Task A then drops the prealloc extent at file offset 128KiB and
shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
truncation operation finishes and we end up with an extent map
representing a 1MiB prealloc extent at file offset 128KiB, despite we
don't have any more that extent;
After this the two types of problems we have are:
1) Future calls to fiemap always report that a 1MiB prealloc extent
exists at file offset 128KiB. This is stale information, no longer
correct;
2) If the size of the file is increased, by a truncate operation that
increases the file size or by a write into a file offset > 64KiB for
example, we end up not inserting file extent items to represent holes
for any range between 128KiB and 128KiB + 1MiB, since the hole
expansion function, btrfs_cont_expand() will skip hole insertion for
any range for which an extent map exists that represents a prealloc
extent. This causes fsck to complain about missing file extent items
when not using the NO_HOLES feature.
The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.
Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A remount to a read-write filesystem is not safe when there's tree-log
to be replayed. Files that could be opened until now might be affected
by the changes in the tree-log.
A regular mount is needed to replay the log so the filesystem presents
the consistent view with the pending changes included.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no logged information about tree-log replay although this is
something that points to previous unclean unmount. Other filesystems
report that as well.
Suggested-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.
With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():
$ cat -n fs/btrfs/extent_io.c
3061 static int __do_readpage(struct extent_io_tree *tree,
3062 struct page *page,
(...)
3127 em = __get_extent_map(inode, page, pg_offset, cur,
3128 end - cur + 1, get_extent, em_cached);
3129 if (IS_ERR_OR_NULL(em)) {
3130 SetPageError(page);
3131 unlock_extent(tree, cur, end);
3132 break;
3133 }
3134 extent_offset = cur - em->start;
3135 BUG_ON(extent_map_end(em) <= cur);
(...)
Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().
We have an inode with a size of 8KiB and 2 extent maps:
extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
a previous transaction
extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
persisted but writeback started for it already. The extent map
is pinned since there's writeback and an ordered extent in
progress, so it can not be merged with extent map A yet
The following sequence of steps leads to the BUG_ON():
1) The ordered extent for extent B completes, the respective page gets its
writeback bit cleared and the extent map is unpinned, at that point it
is not yet merged with extent map A because it's in the list of modified
extents;
2) Due to memory pressure, or some other reason, the MM subsystem releases
the page corresponding to extent B - btrfs_releasepage() is called and
returns 1, meaning the page can be released as it's not dirty, not under
writeback anymore and the extent range is not locked in the inode's
iotree. However the extent map is not released, either because we are
not in a context that allows memory allocations to block or because the
inode's size is smaller than 16MiB - in this case our inode has a size
of 8KiB;
3) Task B needs to read extent B and ends up __do_readpage() through the
btrfs_readpage() callback. At __do_readpage() it gets a reference to
extent map B;
4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
while holding the write lock on the inode's extent map tree - this
results in try_merge_map() being called and since it's possible to merge
extent map B with extent map A now (the extent map B was removed from
the list of modified extents), the merging begins - it sets extent map
B's start offset to 0 (was 4KiB), but before it increments the map's
length to 8KiB (4kb + 4KiB), task A is at:
BUG_ON(extent_map_end(em) <= cur);
The call to extent_map_end() sees the extent map has a start of 0
and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
the BUG_ON() is triggered.
So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.
Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).
Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
kmalloc(), respectively. In the following code, if an error occurs, the
execution will be redirected to 'out' or 'out_unlock' and the function will
be exited. However, on some of the paths, 'ref' and 'ra' are not
deallocated, leading to memory leaks. For example, if 'action' is
BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
value indicates an error, the execution will be redirected to 'out'. But,
'ref' is not deallocated on this path, causing a memory leak.
To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
the function when an error is encountered.
CC: stable@vger.kernel.org # 4.15+
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Fix RWF_NOWAIT writes to properly return -EAGAIN
- Clean up an unused helper
- Update dax_writeback_mapping_range to not need a block_device argument
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl5DH00ACgkQHtKRamZ9
iAIPUA/+KHTrnE4Pb69d1r/wmdNOQE1vCpZxN606ozfzAMIfCgzw6i7V82xpFUyZ
VdQ1wcCSYLoPTQ+y/0Bk7SQlLjBx58c0ShaOTHmE2IRdYwzKMOBtwqy5vxGYWT/i
k2b44Xdwc4x8KMdajKCZlZOWNIM4BnlhE6nqnyL7Zv7J4BC71IKTCgh0WrriKvKe
t7IkYK6fLWx9Y64UhcIoXL9jDp1r5N/pJCjKaSfpS7gH+iqz/M3NaAwRfDr8UuIY
aHUUoPHP/cbwakQnYZtpxGaP1dzkKQ1FG3nL74Lp2XUOulivSbGa0fmw/7Vs18Of
M2d8/yKWMh0pElPdoh/2ORhHAcMOsIUCx3HRxMm1x6g293BkE96ESpbN/s62oo7H
uND3AOqnE79jxd6AqECHyYXGpxwlHah5HXZdCjU5b6rmNbz9YNpHGK8STwpa2ReL
AnYpWlDPjUkSMAD/rzwR7T6xh4TlzYQa2y6QR3HyffftPg6Dm0g4I8pBi0PVZYBB
4Whg8dLsiK73KZsjhraPaSFFDT42Btd6BHKLrMLckoVoIoyt3EB4FPwMQjCwXgqI
WySehmiMfEqXWUEKsYxBf+j0+ASC5ewIKuP1Ziqxa9kBel7964my7ariGn/mEAbo
yJ6Iyn+wEeLE5VhImZKxrBNqNPo6mripT6omJmgl2G2HZt+JfK4=
=ZA5Y
-----END PGP SIGNATURE-----
Merge tag 'dax-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"A fix for an xfstest failure and some and an update that removes an
fsdax dependency on block devices.
Summary:
- Fix RWF_NOWAIT writes to properly return -EAGAIN
- Clean up an unused helper
- Update dax_writeback_mapping_range to not need a block_device
argument"
* tag 'dax-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: pass NOWAIT flag to iomap_apply
dax: Get rid of fs_dax_get_by_host() helper
dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
For the old mount API, the module parameters parseing function will
be called in ceph_mount() and also just after the default posix acl
flag set, so we can control to enable/disable it via the mount option.
But for the new mount API, it will call the module parameters
parseing function before ceph_get_tree(), so the posix acl will always
be enabled.
Fixes: 82995cc6c5 ("libceph, rbd, ceph: convert to use the new mount API")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
syzbot reported that 4fbc0c711b ("ceph: remove the extra slashes in
the server path") had caused a regression where an allocation could be
done under a spinlock -- compare_mount_options() is called by sget_fc()
with sb_lock held.
We don't really need the supplied server path, so canonicalize it
in place and compare it directly. To make this work, the leading
slash is kept around and the logic in ceph_real_mount() to skip it
is restored. CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
canonicalized) path, with the leading slash of course.
Fixes: 4fbc0c711b ("ceph: remove the extra slashes in the server path")
Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
In O_APPEND & O_DIRECT mode, the data from different writers will
be possibly overlapping each other since they take the shared lock.
For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
mode:
Writer1 Writer2
shared_lock() shared_lock()
getattr(CAP_SIZE) getattr(CAP_SIZE)
iocb->ki_pos = EOF iocb->ki_pos = EOF
write(data1)
write(data2)
shared_unlock() shared_unlock()
The data2 will overlap the data1 from the same file offset, the
old EOF.
Switch to exclusive lock instead when O_APPEND is specified.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
I have an experimental setup where almost every possible system
service (even early startup ones) runs in separate namespace, using a
dedicated, minimal file system. In process of minimizing the contents
of the file systems with regards to modules and firmware files, I
noticed that in my system, the firmware files are loaded from three
different mount namespaces, those of systemd-udevd, init and
systemd-networkd. The logic of the source namespace is not very clear,
it seems to depend on the driver, but the namespace of the current
process is used.
So, this patch tries to make things a bit clearer and changes the
loading of firmware files only from the mount namespace of init. This
may also improve security, though I think that using firmware files as
attack vector could be too impractical anyway.
Later, it might make sense to make the mount namespace configurable,
for example with a new file in /proc/sys/kernel/firmware_config/. That
would allow a dedicated file system only for firmware files and those
need not be present anywhere else. This configurability would make
more sense if made also for kernel modules and /sbin/modprobe. Modules
are already loaded from init namespace (usermodehelper uses kthreadd
namespace) except when directly loaded by systemd-udevd.
Instead of using the mount namespace of the current process to load
firmware files, use the mount namespace of init process.
Link: https://lore.kernel.org/lkml/bb46ebae-4746-90d9-ec5b-fce4c9328c86@gmail.com/
Link: https://lore.kernel.org/lkml/0e3f7653-c59d-9341-9db2-c88f5b988c68@gmail.com/
Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
Link: https://lore.kernel.org/r/20200123125839.37168-1-toiwoton@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
No one checks the return value of debugfs_create_regset32(), as it's not
needed, so make the return value void, so that no one tries to do so in
the future.
Link: https://lore.kernel.org/r/20191122104453.GA2017837@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We want to make sure that we revalidate the dentry if and only if
we've done an OPEN by filename.
In order to avoid races with remote changes to the directory on the
server, we want to save the verifier before calling OPEN. The exception
is if the server returned a delegation with our OPEN, as we then
know that the filename can't have changed on the server.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@gmail.com>
Tested-by: Benjamin Coddington <bcodding@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In order to avoid having our dentry revalidation race with an update
of the directory on the server, we need to store the verifier before
the RPC calls to LOOKUP and READDIR.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@gmail.com>
Tested-by: Benjamin Coddington <bcodding@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fix display for sec=krb5i which was wrongly interleaved by cruid,
resulting in string "sec=krb5,cruid=<...>i" instead of
"sec=krb5i,cruid=<...>".
Fixes: 96281b9e46 ("smb3: for kerberos mounts display the credential uid used")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Before this patch function check_journal_clean was in ops_fstype.c.
This patch moves it to util.c so we can make use of it elsewhere
in a future patch.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
When a node fails, user space informs dlm of the node failure,
and dlm instructs gfs2 on the surviving nodes to perform journal
recovery. It does this by calling various callback functions in
lock_dlm.c. To mark its progress, it keeps generation numbers
and recover bits in a dlm "control" lock lvb, which is seen by
all nodes to determine which journals need to be replayed.
The gfs2 on all nodes get the same recovery requests from dlm,
so they all try to do the recovery, but only one will be
granted the exclusive lock on the journal. The others fail
with a "Busy" message on their "try lock."
However, when a node is withdrawn, it cannot safely do any
recovery or replay any journals. To make matters worse,
gfs2 might withdraw as a result of attempting recovery. For
example, this might happen if the device goes offline, or if
an hba fails. But in today's gfs2 code, it doesn't check for
being withdrawn at any step in the recovery process. What's
worse is that these callbacks from dlm have no return code,
so there is no way to indicate failure back to dlm. We can
send a "Recovery failed" uevent eventually, but that tells
user space what happened, not dlm's kernel code.
Before this patch, lock_dlm would perform its recovery steps but
ignore the result, and eventually it would still update its
generation number in the lvb, despite the fact that it may have
withdrawn or encountered an error. The other nodes would then
see the newer generation number in the lvb and conclude that
they don't need to do recovery because the generation number
is newer than the last one they saw. They think a different
node has already recovered the journal.
This patch adds checks to several of the callbacks used by dlm
in its recovery state machine so that the functions are ignored
and skipped if an io error has occurred or if the file system
is withdrawn. That prevents the lvb bits from being updated, and
therefore dlm and user space still see the need for recovery to
take place.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, all io errors received by the quota daemon or the
logd daemon would cause a complaint message to be issued, such as:
gfs2: fsid=dm-13.0: Error 10 writing to journal, jid=0
This patch changes it so that the error message is only issued the
first time the error is encountered.
Also, before this patch function gfs2_end_log_write did not set the
sd_log_error value, so log errors would not cause the file system to
be withdrawn. This patch sets the error code so the file system is
properly withdrawn if an io error is encountered writing to the journal.
WARNING: This change in function breaks check xfstests generic/441
and causes it to fail: io errors writing to the log should cause a
file system to be withdrawn, and no further operations are tolerated.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2 kept track of journal io errors in two
places sd_log_error and the SDF_AIL1_IO_ERROR flag in sd_flags.
This patch consolidates the two into sd_log_error so that it
reflects the first error encountered writing to the journal.
In future patches, we will take advantage of this by checking
this value rather than having to check both when reacting to
io errors.
In addition, this fixes a tight loop in unmount: If buffers
get on the ail1 list and an io error occurs elsewhere, the
ail1 list would never be cleared because they were always busy.
So unmount would hang, waiting for the ail1 list to empty.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, the rgrp code had a serious problem related to
how it managed buffer_heads for resource groups. The problem caused
file system corruption, especially in cases of journal replay.
When an rgrp glock was demoted to transfer ownership to a
different cluster node, do_xmote() first calls rgrp_go_sync and then
rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
gfs2_rgrp_brelse() that dropped the buffer_head reference count.
In most cases, the reference count went to zero, which is right.
However, there were other places where the buffers are handled
differently.
After rgrp_go_sync, do_xmote called rgrp_go_inval which called
gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
truncate_inode_pages_range would get rid of the pages in memory,
but only if the reference count drops to 0.
Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
to the buffer_heads in cases where the reference count was still 1.
Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
it failed the check for "if (bi->bi_bh)" and thus failed to call
brelse a second time. Because of that, the reference count on those
buffers sometimes failed to drop from 1 to 0. And that caused
function truncate_inode_pages_range to keep the pages in page cache
rather than freeing them.
The next time the rgrp glock was acquired, the metadata read of
the rgrp buffers re-used the pages in memory, which were now
wrong because they were likely modified by the other node who
acquired the glock in EX (which is why we demoted the glock).
This re-use of the page cache caused corruption because changes
made by the other nodes were never seen, so the bitmaps were
inaccurate.
For some reason, the problem became most apparent when journal
replay forced the replay of rgrps in memory, which caused newer
rgrp data to be overwritten by the older in-core pages.
A big part of the problem was that the rgrp buffer were released
in multiple places: The go_unlock function would release them when
the glock was released rather than when the glock is demoted,
which is clearly wrong because our intent was to cache them until
the glock is demoted from SH or EX.
This patch attempts to clean up the mess and make one consistent
and centralized mechanism for managing the rgrp buffer_heads by
implementing several changes:
1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
We don't want to release the buffers or zero the pointers when
syncing for the reasons stated above. It only makes sense to
release them when the glock is actually invalidated (go_inval).
And when we do, then we set the bh pointers to NULL.
2. The go_unlock function (which was only used for rgrps) is
eliminated, as we've talked about doing many times before.
The go_unlock function was called too early in the glock dq
process, and should not happen until the glock is invalidated.
3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
That will now happen automatically when the rgrp glocks are
demoted, and shouldn't happen any sooner or later than that.
Instead, function gfs2_clear_rgrpd has been modified to demote
the rgrp glocks, and therefore, free those pages, before the
remaining glocks are culled by gfs2_gl_hash_clear. This
prevents the gl_object from hanging around when the glocks are
culled.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch fixes a bug in which function gfs2_log_flush can get into
an infinite loop when a gfs2 file system is withdrawn. The problem
is the infinite loop "for (;;)" in gfs2_log_flush which would never
finish because the io error and subsequent withdraw prevented the
items from being taken off the ail list.
This patch tries to clean up the mess by allowing withdraw situations
to move not-in-flight buffer_heads to the ail2 list, where they will
be dealt with later.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
File system withdraws can be delayed when inconsistencies are
discovered when we cannot withdraw immediately, for example, when
critical spin_locks are held. But delaying the withdraw can cause
gfs2 to ignore the error and keep running for a short period of time.
For example, an rgrp glock may be dequeued and demoted while there
are still buffers that haven't been properly revoked, due to io
errors writing to the journal.
This patch introduces a new concept of a pending withdraw, which
means an inconsistency has been discovered and we need to withdraw
at the earliest possible opportunity. In these cases, we aren't
quite withdrawn yet, but we still need to not dequeue glocks and
other critical things. If we dequeue the glocks and the withdraw
results in our journal being replayed, the replay could overwrite
data that's been modified by a different node that acquired the
glock in the meantime.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
The gfs2_assert functions only print messages when the filesystem hasn't been
withdrawn yet, and they indicate whether or not they've printed something in
their return value. However, none of the callers use that information, so
simply return whether or not the assert has failed.
(The gfs2_assert functions are still backwards; they return false when an
assertion is true.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Change the various gfs2_consist functions to return void.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
These arguments are always passed as 0, and they are never evaluated.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In gfs2_rgrp_verify and compute_bitstructs, make sure to report errors before
withdrawing the filesystem: otherwise, when we withdraw first and withdraw is
configured to panic, we'll never get to the error reporting.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Split gfs2_lm_withdraw into a function that prints an error message and a
function that withdraws the filesystem.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
- fix randconfig to generate a sane .config
- rename hostprogs-y / always to hostprogs / always-y, which are
more natual syntax.
- optimize scripts/kallsyms
- fix yes2modconfig and mod2yesconfig
- make multiple directory targets ('make foo/ bar/') work
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl47NfMVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGRGwP/3AHO8P0wGEeFKs3ziSMjs2W7/Pj
lN08Kuxm0u3LnyEEcHVUveoi+xBYqvrw0RsGgYf5S8q0Mpep7MPqbfkDUxV/0Zkj
QP2CsvOTbjdBjH7q3ojkwLcDl0Pxu9mg3eZMRXZ2WQeNXuMRw6Bicoh7ElvB1Bv/
HC+j30i2Me3cf/riQGSAsstvlXyIR8RaerR8PfRGESTysiiN76+JcHTatJHhOJL9
O6XKkzo8/CXMYKKVF4Ae4NP+WFg6E96/pAPx0Rf47RbPX9UG35L9rkzTDnk70Ms6
OhKiu3hXsRX7mkqApuoTqjge4+iiQcKZxYmMXU1vGlIRzjwg19/4YFP6pDSCcnIu
kKb8KN4o4N41N7MFS3OLZWwISA8Vw6RbtwDZ3AghDWb7EHb9oNW42mGfcAPr1+wZ
/KH6RHTzaz+5q2MgyMY1NhADFrhIT9CvDM+UJECgbokblnw7PHAnPmbsuVak9ZOH
u9ojO1HpTTuIYO6N6v4K5zQBZF1N+RvkmBnhHd8j6SksppsCoC/G62QxgXhF2YK3
FQMpATCpuyengLxWAmPEjsyyPOlrrdu9UxqNsXVy5ol40+7zpxuHwKcQKCa9urJR
rcpbIwLaBcLhHU4BmvBxUk5aZxxGV2F0O0gXTOAbT2xhd6BipZSMhUmN49SErhQm
NC/coUmQX7McxMXh
=sv4U
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix randconfig to generate a sane .config
- rename hostprogs-y / always to hostprogs / always-y, which are more
natual syntax.
- optimize scripts/kallsyms
- fix yes2modconfig and mod2yesconfig
- make multiple directory targets ('make foo/ bar/') work
* tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: make multiple directory targets work
kconfig: Invalidate all symbols after changing to y or m.
kallsyms: fix type of kallsyms_token_table[]
scripts/kallsyms: change table to store (strcut sym_entry *)
scripts/kallsyms: rename local variables in read_symbol()
kbuild: rename hostprogs-y/always to hostprogs/always-y
kbuild: fix the document to use extra-y for vmlinux.lds
kconfig: fix broken dependency in randconfig-generated .config
Zonefs is a very simple file system exposing each zone of a zoned block
device as a file.
Unlike a regular file system with native zoned block device support
(e.g. f2fs or the on-going btrfs effort), zonefs does not hide the
sequential write constraint of zoned block devices to the user. As a
result, zonefs is not a POSIX compliant file system. Its goal is to
simplify the implementation of zoned block devices support in
applications by replacing raw block device file accesses with a richer
file based API, avoiding relying on direct block device file ioctls
which may be more obscure to developers.
One example of this approach is the implementation of LSM
(log-structured merge) tree structures (such as used in RocksDB and
LevelDB) on zoned block devices by allowing SSTables to be stored in a
zone file similarly to a regular file system rather than as a range of
sectors of a zoned device. The introduction of the higher level
construct "one file is one zone" can help reducing the amount of changes
needed in the application while at the same time allowing the use of
zoned block devices with various programming languages other than C.
Zonefs IO management implementation uses the new iomap generic code.
Zonefs has been successfully tested using a functional test suite
(available with zonefs userland format tool on github) and a prototype
implementation of LevelDB on top of zonefs.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXj1y8QAKCRDdoc3SxdoY
dqozAP9J3t+Q95BgKgI5jP+XEtyYsPBTaVrvaSaViEnwtJLVoQD/ZQ1lTCZSE9OI
UkvWawkuFtLGfOxTqyA3eZrZi22Ttwk=
=YVvO
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull new zonefs file system from Damien Le Moal:
"Zonefs is a very simple file system exposing each zone of a zoned
block device as a file.
Unlike a regular file system with native zoned block device support
(e.g. f2fs or the on-going btrfs effort), zonefs does not hide the
sequential write constraint of zoned block devices to the user. As a
result, zonefs is not a POSIX compliant file system. Its goal is to
simplify the implementation of zoned block devices support in
applications by replacing raw block device file accesses with a richer
file based API, avoiding relying on direct block device file ioctls
which may be more obscure to developers.
One example of this approach is the implementation of LSM
(log-structured merge) tree structures (such as used in RocksDB and
LevelDB) on zoned block devices by allowing SSTables to be stored in a
zone file similarly to a regular file system rather than as a range of
sectors of a zoned device. The introduction of the higher level
construct "one file is one zone" can help reducing the amount of
changes needed in the application while at the same time allowing the
use of zoned block devices with various programming languages other
than C.
Zonefs IO management implementation uses the new iomap generic code.
Zonefs has been successfully tested using a functional test suite
(available with zonefs userland format tool on github) and a prototype
implementation of LevelDB on top of zonefs"
* tag 'zonefs-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Add documentation
fs: New zonefs file system
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl49bNsACgkQiiy9cAdy
T1EGlQwArDJiHUV7W/WaoDZnusPPQqUT3ayqAHL0P8cDsjxLu3uNMkUISr0HdbxC
kqYahSTb+/BKQzoZhVe5wK3S8W6R8+wyaPJExRCL3brlIHVP/eC9uUjSgkT6QVDl
/vZCwxj7KmTK/S+ofji/XTl2f8f8BCw2biGVxwR2Jj5pwKI4wFIMFm7mDetTQRD4
bK0UR2Owiw4DpPXdwHlXPf9N06z0ETa1UdMXklIBgeK9B1eT1STD9q/iHJh3bLpO
klhbiq5eGRCcs9cBVTQcn6U+zGYBOcdJuhPGbAObEU+R2vNX06clydKlKy1oz1VL
4jbVVn9xuGZ9evFBC3h7Na1X7C3V28WcpfeRfFxZ157hNuQSNo5wiq0rF66EQ14U
hbmlx2S2ooyNKcnrj46SUw9zVLZ0xcx1Mw7kmoyHgI/vznW9fvV0Y2JXawJMPei5
VuQTgDLFsvnIIrUnrGBu2UXMzXghxLZ3SXJVKXuW3luvNRk82RAGHmIdty3OTgPp
DN9lhGvv
=F1qf
-----END PGP SIGNATURE-----
Merge tag '5.6-rc-smb3-plugfest-patches' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"13 cifs/smb3 patches, most from testing at the SMB3 plugfest this week:
- Important fix for multichannel and for modefromsid mounts.
- Two reconnect fixes
- Addition of SMB3 change notify support
- Backup tools fix
- A few additional minor debug improvements (tracepoints and
additional logging found useful during testing this week)"
* tag '5.6-rc-smb3-plugfest-patches' of git://git.samba.org/sfrench/cifs-2.6:
smb3: Add defines for new information level, FileIdInformation
smb3: print warning once if posix context returned on open
smb3: add one more dynamic tracepoint missing from strict fsync path
cifs: fix mode bits from dir listing when mounted with modefromsid
cifs: fix channel signing
cifs: add SMB3 change notification support
cifs: make multichannel warning more visible
cifs: fix soft mounts hanging in the reconnect code
cifs: Add tracepoints for errors on flush or fsync
cifs: log warning message (once) if out of disk space
cifs: fail i/o on soft mounts if sessionsetup errors out
smb3: fix problem with null cifs super block with previous patch
SMB3: Backup intent flag missing from some more ops
Pull vboxfs from Al Viro:
"This is the VirtualBox guest shared folder support by Hans de Goede,
with fixups for fs_parse folded in to avoid bisection hazards from
those API changes..."
* 'work.vboxsf' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Add VirtualBox guest shared folder (vboxsf) support
Jonas reports that he sometimes sees -97/-22 error returns from
sendmsg, if it gets punted async. This is due to not retaining the
sockaddr_storage between calls. Include that in the state we copy when
going async.
Cc: stable@vger.kernel.org # 5.3+
Reported-by: Jonas Bonn <jonas@norrbonn.se>
Tested-by: Jonas Bonn <jonas@norrbonn.se>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normally we cancel all work we track, but for untracked work we could
leave the async worker behind until that work completes. This is totally
fine, but does leave resources pending after the task is gone until that
work completes.
Cancel work that this task queued up when it goes away.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper that allows the caller to cancel work based on what mm
it belongs to. This allows io_uring to cancel work from a given
task or thread when it exits.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We want to use the cancel functionality for canceling based on not
just the work itself. Instead of matching on the work address
manually, allow a match handler to tell us if we found the right work
item or not.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
VirtualBox hosts can share folders with guests, this commit adds a
VFS driver implementing the Linux-guest side of this, allowing folders
exported by the host to be mounted under Linux.
This driver depends on the guest <-> host IPC functions exported by
the vboxguest driver.
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
One patch in the compat-ioctl series broke 32-bit rootfs for multiple
people testing on 64-bit kernels. Let's fix it in -rc1 before others
run into the same issue.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJePyKSAAoJEGCrR//JCVInFMEP/0fH0rI/M2xL4neSjp50qV7Q
M86gLTmatgNFLZGdC+xbaImaxpZHEMIF93JgOzN+0IRfdc7MbD23JEm5yzCZEc2I
NCAsmjcoA+mq8ntOl8o2J1pgLpVOe4BoBRvAGYFIM2kS63JXloCi9mD/3svDJF8C
WVPOskzoNT1pO9mRUKsAE740qdI86US/ksvrGOQQHVFVm3Iwm3srot4OBQYRaErw
bVEl8vqSOEKtXQwk3r7QcW9Bi83WZuqTbkLxqTtWm/U/8JreA82qH1/6zDbKj7d5
IH/J8D4FJ2LLRP5860bLj0vGl5mKBEJaou2S8ak7sRNu4K8DZJlTYu7YNMTgocQE
kbNylNqftKastVRGS1EPmOPdam8PiPlu3sxOY1X1TKO5+4jtb1WcqtFqmUkM5/Jk
C4iMDXNHXmHcC+ORrjchpIZs53nkfSAdeUzQ7xxwZlfaHD3DXnsFi5DHEZ9PWf9a
UlCOaWFLXRqdso30iobaaLa6JKa50Znjiqfsh6mbUuU9tmDnTffaHMy9eFhfMjqH
eP3acYXkAyK2toHuE6l5qaM6mOlYfrooCw/75CfQB0CR3e5tmXXLdFA319/OHRwX
m+tFby6Fw4KGGZWzXls54rZNa3ZmODj6N3Ymx1lKZfBVwblLNGHJYYVPP1ZUBCTS
43oommoGHO6/roOjLl3j
=uMS2
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-fix' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull compat-ioctl fix from Arnd Bergmann:
"One patch in the compat-ioctl series broke 32-bit rootfs for multiple
people testing on 64-bit kernels. Let's fix it in -rc1 before others
run into the same issue"
* tag 'compat-ioctl-fix' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
compat_ioctl: fix FIONREAD on devices
Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.
New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.
And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.
Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"
* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...
Pull misc vfs updates from Al Viro:
- bmap series from cmaiolino
- getting rid of convolutions in copy_mount_options() (use a couple of
copy_from_user() instead of the __get_user() crap)
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
saner copy_mount_options()
fibmap: Reject negative block numbers
fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
ecryptfs: drop direct calls to ->bmap
cachefiles: drop direct usage of ->bmap method.
fs: Enable bmap() function to properly return errors
As in the previous patch, make openat*_prep() and statx_prep() handle
double preparation to avoid resource leakage.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Requests may be prepared multiple times with ->io allocated (i.e. async
prepared). Preparation functions don't handle it and forget about
previously allocated resources. This may happen in case of:
- spurious defer_check
- non-head (i.e. async prepared) request executed in sync (via nxt).
Make the handlers check, whether they already allocated resources, which
is true IFF REQ_F_NEED_CLEANUP is set.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First, io_close() misses filp_close() and io_cqring_add_event(), when
f_op->flush is defined. That's because in this case it will
io_queue_async_work() itself not grabbing files, so the corresponding
chunk in io_close_finish() won't be executed.
Second, when submitted through io_wq_submit_work(), it will do
filp_close() and *_add_event() twice: first inline in io_close(),
and the second one in call to io_close_finish() from io_close().
The second one will also fire, because it was submitted async through
generic path, and so have grabbed files.
And the last nice thing is to remove this weird pilgrimage with checking
work/old_work and casting it to nxt. Just use a helper instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This passes it in to io-wq, so it assumes the right fs_struct when
executing async work that may need to do lookups.
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some work items need this for relative path lookup, make it available
like the other inherited credentials/mm/etc.
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For non-blocking issue, we set IOCB_NOWAIT in the kiocb. However, on a
raw block device, this yields an -EOPNOTSUPP return, as non-blocking
writes aren't supported. Turn this -EOPNOTSUPP into -EAGAIN, so we retry
from blocking context with IOCB_NOWAIT cleared.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
openat() and statx() may have allocated ->open.filename, which should be
be put. Add cleanup handlers for them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allocated iovec is freed only in io_{read,write,send,recv)(), and just
leaves it if an error occured. There are plenty of such cases:
- cancellation of non-head requests
- fail grabbing files in __io_queue_sqe()
- set REQ_F_NOWAIT and returning in __io_queue_sqe()
Add REQ_F_NEED_CLEANUP, which will force such requests with custom
allocated resourses go through cleanup handlers on put.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_uring_poll() we must flush overflowed CQ events before to
check if there are CQ events available, to avoid missing events.
We call the io_cqring_events() that checks and flushes any overflow
and returns the number of CQ events available.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All of these opcodes take a directory file descriptor. We can't easily
support fixed files for these operations, and the use case for that
probably isn't all that clear (or sensible) anyway.
Disable IOSQE_FIXED_FILE for these operations.
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge thundering herd avoidance on pipe IO.
This would have been applied for 5.5 already, but got delayed because of
a user-space race condition in the GNU make jobserver code. Now that
there's a new GNU make 4.3 release, and most distributions seem to have
at least applied the (almost three year old) fix for the problem, let's
see if people notice.
And it might have been just bad random timing luck on my machine.
If you do hit the race condition, things will still work, but the
symptom is that you don't get nearly the expected parallelism when using
"make -j<N>".
The jobserver bug can definitely happen without this patch too, but
seems to be easier to trigger when we no longer wake up pipe waiters
unnecessarily.
* pipe-exclusive-wakeup:
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My final cleanup patch for sys_compat_ioctl() introduced a regression on
the FIONREAD ioctl command, which is used for both regular and special
files, but only works on regular files after my patch, as I had missed
the warning that Al Viro put into a comment right above it.
Change it back so it can work on any file again by moving the implementation
to do_vfs_ioctl() instead.
Fixes: 77b9040195 ("compat_ioctl: simplify the implementation")
Reported-and-tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Reported-and-tested-by: youling257 <youling257@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXj182AAKCRDh3BK/laaZ
PIiWAQCprdMIBe0u9Rd9cqQYXClOI7PI9oIcpLmkIlHDuUWDgQD/Y4c1UMsN8yQY
d8cYZXMivKKhyY2nRitR1mbv0RPoGwE=
=8hFo
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
- Fix a regression introduced in v5.1 that triggers WARNINGs for some
fuse filesystems
- Fix an xfstest failure
- Allow overlayfs to be used on top of fuse/virtiofs
- Code and documentation cleanups
* tag 'fuse-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use true,false for bool variable
Documentation: filesystems: convert fuse to RST
fuse: Support RENAME_WHITEOUT flag
fuse: don't overflow LLONG_MAX with end offset
fix up iter on short count in fuse_direct_io()
- Fix a bug in Abhi Das's journal head lookup improvements that can cause a
valid journal to be rejected.
- Fix an O_SYNC write handling bug reported by Christoph Hellwig.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl49TKgUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqG4RAAmXJfWI7UskM3ZRSVivbBwc0R3P9h
PgdgeXlr/k4mJ2I421/btln614l4nD+JuG/FplOvj2m3YCzeKj2W+OiSxqahmwGm
7yGb0sFHjx6wZcID37mig5aqo2MJYI0xTAR6rSI2tAA7E/B7KX8SfFIS2EWB2vLz
+rzCawqUGWKCiS/tZTe6JlB0Aeg20oy0Y00p3gewMN0ILNqy0w9kA+LVOGQYyUmW
rrad638czIXi7kugWvNo82vcoU140m3A6OTaINf3EaT8AmOtw+e4qyej+f82BiVS
RIXWKI+uRSfKFE+aYkwTxQn0BCMor63QIs3aaDXyLZBqnhTywRsckK6O7iBLFVDb
NQc86wxiHzDoWubXstV0lrlel5m5dHT0dUq8mVogj4kOtwOvejiDRopqtdGkdxA5
j8zFV6O8BRPUw5g7MS37n9myuTNRj5q3L2vmN2xZ6fZmHykorwbwLbw0WMeAQ/4i
pQh4NGE6h/lSPZIa/ZVCH8hwj2b41ZtAQK3k8G5IhNUhMHNkU8iynTbddoHFPsOF
67hj/aAycDx2NA6j/suS8PGt7PPU+VwwoDLKIG64bTHFTpwnRDNzZDt9wPIB9sGu
P4QlSQYPrLf/u8+TD8yv2VJsyAEMFl1UQWbxqf6EzLtKjRM/TIoQsdRUF7upqUnN
k5CM8FgmfYpIiFQ=
=+6PA
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
- Fix a bug in Abhi Das's journal head lookup improvements that can
cause a valid journal to be rejected.
- Fix an O_SYNC write handling bug reported by Christoph Hellwig.
* tag 'gfs2-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: fix O_SYNC write handling
gfs2: move setting current->backing_dev_info
gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0
Vasliy Averin noticed that "if seq_file .next function does not change
position index, read after some lseek can generate unexpected output."
and sent in this fix.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJePJJTAAoJEM9EDqnrzg2+S8YQAIYo2nEphfAjWufSYdPwVY0f
OQ3xAwX69gpzlQPwNTcTmoxQ/A/E8r2lwOmL5H7wYPlqUHtDjiS5EAhnCvtCIRS1
mHR35qsdlW0o4yKWQtYnfcuY6p3ucSASBaKYGO5NrKuVdsqASJT/aFTixVFmO8V2
eo8VZlhwYARTgGWJQpvIvSbQJRi6BDnE3kWimuRdq5TMnH0piMm6Vpef2P5vTaE3
L5D63iCuJbg7nBR4KS2Mc77Ou4lj3RCQGBQBz1hRGs0zJbqNrlExGaCE+FkabJ92
WWITYrh4DaA+wAhmkMbxh8g7DThZFkkK+fsNOZIlSVry3wFuZ1E+6UFUHVICrgxG
y99wGKV816SfbKD0DZ9cer6slMnNXRRNWzbXqOzey85UmNQSuCNZrVokM8TE1Pve
q7OOxoqg6N/S46gc+3nrv0kNZ9eg1rh4jyvCdtL4we1AgPPb6O+ihuxlrj9oFPlW
zo9wHn7Zt1xu+jtQJxR5tChpD5FmO1im0LOYeaJHv76zhOakRTcQ19OnQMUlQ/gk
ez+5T5LxfmYUHew6kTnGvvp76zZF5QkylsUVOLPke8KChXvLP06ib9Nuya7lBX7+
WtvLKVRNYMmwose1HsYjQo2DgD0p2fdJhEe61Q5VDFCBI8X8wkv7fV+Hm2ud3/hz
b9RfP2CnwJbCKzHTolKE
=tjs3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs fix from Mike Marshall:
"Debugfs fix for orangefs.
Vasliy Averin noticed that 'if seq_file .next function does not change
position index, read after some lseek can generate unexpected output'
and sent in this fix"
* tag 'for-linus-5.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
help_next should increase position index
- Server-to-server copy code from Olga. To use it, client and
both servers must have support, the target server must be able
to access the source server over NFSv4.2, and the target
server must have the inter_copy_offload_enable module
parameter set.
- Improvements and bugfixes for the new filehandle cache,
especially in the container case, from Trond
- Also from Trond, better reporting of write errors.
- Y2038 work from Arnd.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl490mAVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+HkkP/33CsYXp0wvfNrfxCY3zHRxHpfw+
T9Ownxxw0RAJc/dRluC/2PIKJ20uVqtLrplU63bMBqJn84WF7OALq9twZ79a3fVF
mvdmnZbNq9B3ncKJlT7akkEelyJCRap7NgG/oTyubE8MlPl6gKpD8c+G7XdW/uN+
r0fprQz4rW4CYCBGSHq7HusEKqY4Gw+gbyAfJ6A79TMjF1ei51PG+9c8rkIsI5CO
1TQ3gY1gSJmGf2DoF86Q9WTVb+DvRTEs+t7QkxY/Vlo+QXY8CZyu+qSxN7i/F20m
gv2GrSpQMS9DEK/ZaG6cxaH+sM18Db4KLvcl3koL6lONHDR2OafSdKLyy0I60jhO
WfDSHhfDCrAdASTjNlTPrjBrdK3gafiaJVL9vy901ZJjPaNb3EH0nMQ5bEvOBECq
TCqPcQUcbku+qUVIcFwzSK1hXQFQHNh8WIuqXvNviZIzFDoipwsHVnQK02Owj89L
R2tbZue1O8voacg/9xw3tWAT7pI+SaBb0EvJuqRxBshiZEU8kKKtMchOwSECRDcu
k4lcqC5EFW7e4EzGlr6Wx8sI5lwCapva8ccjmPXX+R/vyM81oxWGB84GqWjjwubH
3Fcok23F9rW2IQJkqgPlNj/9hAjTn2+vM13UbfMlnchGNsQ2gbkc5CDGC/J6Wwpo
tHVristV9Gu5bJym
=FxLY
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Server-to-server copy code from Olga.
To use it, client and both servers must have support, the target
server must be able to access the source server over NFSv4.2, and
the target server must have the inter_copy_offload_enable module
parameter set.
- Improvements and bugfixes for the new filehandle cache, especially
in the container case, from Trond
- Also from Trond, better reporting of write errors.
- Y2038 work from Arnd"
* tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux: (55 commits)
sunrpc: expiry_time should be seconds not timeval
nfsd: make nfsd_filecache_wq variable static
nfsd4: fix double free in nfsd4_do_async_copy()
nfsd: convert file cache to use over/underflow safe refcount
nfsd: Define the file access mode enum for tracing
nfsd: Fix a perf warning
nfsd: Ensure sampling of the write verifier is atomic with the write
nfsd: Ensure sampling of the commit verifier is atomic with the commit
sunrpc: clean up cache entry add/remove from hashtable
sunrpc: Fix potential leaks in sunrpc_cache_unhash()
nfsd: Ensure exclusion between CLONE and WRITE errors
nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()
nfsd: Update the boot verifier on stable writes too.
nfsd: Fix stable writes
nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argument
nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create()
nfsd: Reduce the number of calls to nfsd_file_gc()
nfsd: Schedule the laundrette regularly irrespective of file errors
nfsd: Remove unused constant NFSD_FILE_LRU_RESCAN
nfsd: Containerise filecache laundrette
...
Stable bugfixes:
- Fix memory leaks and corruption in readdir # v2.6.37+
- Directory page cache needs to be locked when read # v2.6.37+
New features:
- Convert NFS to use the new mount API
- Add "softreval" mount option to let clients use cache if server goes down
- Add a config option to compile without UDP support
- Limit the number of inactive delegations the client can cache at once
- Improved readdir concurrency using iterate_shared()
Other bugfixes and cleanups:
- More 64-bit time conversions
- Add additional diagnostic tracepoints
- Check for holes in swapfiles, and add dependency on CONFIG_SWAP
- Various xprtrdma cleanups to prepare for 5.7's changes
- Several fixes for NFS writeback and commit handling
- Fix acls over krb5i/krb5p mounts
- Recover from premature loss of openstateids
- Fix NFS v3 chacl and chmod bug
- Compare creds using cred_fscmp()
- Use kmemdup_nul() in more places
- Optimize readdir cache page invalidation
- Lease renewal and recovery fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl48kMUACgkQ18tUv7Cl
QOs/bA/+KAHaee+1jWdgRS88CnNDfeokU2sGWuyXWrVTmiKZ+IjnIUIWqmeKhVyg
RTbaG4PGTIwiLDFibgzdnc3cTOQEgLnVGWWZ50Xh3b7ubock7+/4JHxqZS+/f3vf
yqwM0dZaXi5Kcx1kEJ+niBxuzkc9mFI+nHh+wLIlin/kaaUdLKu7mP3NXj2cmWxN
NoRaKc2gEvkPHhPSH4Z1DVXTHxvH2REFvt9APPUgfLfqcUVHV9b7V/wI/roiGWMn
53h6f38IdqoNQIpzMog/k/va67NLmEvUZOlpCYPyanPOjuxTrmi8iC2S6gLEOjtc
GGnQnc5skVL31seFR1NbOJiiN3hTLTncnoXza0cKtYxmo7a/FjXApw4jCu3Rkrav
UXpCI4O6+2AVVG+pEPbjQy3/GEImeoGvp+xr57jBSZBHoDZU9LDwag65qvZ1btIq
KOBx2gweQz0aB2heXmfee7qzxFdftHmtMWhIMnJASKNuAWGL23Scqem+d97i2T6H
7y9OJ3aOXiYxFMLYJCsLWjUJxYiaIANNBmHMjf27mZzcdDuxGFms277CMpNPr3SU
WZk6/oKw9jaRSzHzaKgVDXiULLXQE1/xZ/mvgR/zk1QAusyeXPvVnMdxoRdxFdXb
QGZHgUqvFvYi8Lufvs+ZLGS4sAp7oD/Q+lNPXn7cniSwfY4uJiw=
=b6+F
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Puyll NFS client updates from Anna Schumaker:
"Stable bugfixes:
- Fix memory leaks and corruption in readdir # v2.6.37+
- Directory page cache needs to be locked when read # v2.6.37+
New features:
- Convert NFS to use the new mount API
- Add "softreval" mount option to let clients use cache if server goes down
- Add a config option to compile without UDP support
- Limit the number of inactive delegations the client can cache at once
- Improved readdir concurrency using iterate_shared()
Other bugfixes and cleanups:
- More 64-bit time conversions
- Add additional diagnostic tracepoints
- Check for holes in swapfiles, and add dependency on CONFIG_SWAP
- Various xprtrdma cleanups to prepare for 5.7's changes
- Several fixes for NFS writeback and commit handling
- Fix acls over krb5i/krb5p mounts
- Recover from premature loss of openstateids
- Fix NFS v3 chacl and chmod bug
- Compare creds using cred_fscmp()
- Use kmemdup_nul() in more places
- Optimize readdir cache page invalidation
- Lease renewal and recovery fixes"
* tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits)
NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals
NFSv4: try lease recovery on NFS4ERR_EXPIRED
NFS: Fix memory leaks
nfs: optimise readdir cache page invalidation
NFS: Switch readdir to using iterate_shared()
NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()
NFS: Directory page cache pages need to be locked when read
NFS: Fix memory leaks and corruption in readdir
SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()
NFS: Replace various occurrences of kstrndup() with kmemdup_nul()
NFSv4: Limit the total number of cached delegations
NFSv4: Add accounting for the number of active delegations held
NFSv4: Try to return the delegation immediately when marked for return on close
NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
NFS: nfs_find_open_context() should use cred_fscmp()
NFS: nfs_access_get_cached_rcu() should use cred_fscmp()
NFSv4: pnfs_roc() must use cred_fscmp() to compare creds
NFS: remove unused macros
nfs: Return EINVAL rather than ERANGE for mount parse errors
...
Don't bother with "mixed" options that would allow both the
form with and without argument (i.e. both -o foo and -o foo=bar).
Rather than trying to shove both into a single fs_parameter_spec,
allow having with-argument and no-argument specs with the same
name and teach fs_parse to handle that.
There are very few options of that sort, and they are actually
easier to handle that way - callers end up with less postprocessing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Unused now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... turning it into struct p_log embedded into fs_context. Initialize
the prefix with fs_type->name, turning fs_parse() into a trivial
inline wrapper for __fs_parse().
This makes fs_parameter_description->name completely unused.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and now errorf() et.al. are never called with NULL fs_context,
so we can get rid of conditional in those.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs_parse() analogue taking p_log instead of fs_context.
fs_parse() turned into a wrapper, callers in ceph_common and rbd
switched to __fs_parse().
As the result, fs_parse() never gets NULL fs_context and neither
do fs_context-based logging primitives
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Its behaviour is identical to that of fs_value_is_filename.
It makes no sense, anyway - LOOKUP_EMPTY affects nothing
whatsoever once the pathname has been imported from userland.
And both fs_value_is_filename and fs_value_is_filename_empty
carry an already imported pathname.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Have the arrays of constant_table self-terminated (by NULL ->name
in the final entry). Simplifies lookup_constant() and allows to
reuse the search for enum params as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix sparse warning:
fs/nfsd/filecache.c:55:25: warning:
symbol 'nfsd_filecache_wq' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
zonefs is a very simple file system exposing each zone of a zoned block
device as a file. Unlike a regular file system with zoned block device
support (e.g. f2fs), zonefs does not hide the sequential write
constraint of zoned block devices to the user. Files representing
sequential write zones of the device must be written sequentially
starting from the end of the file (append only writes).
As such, zonefs is in essence closer to a raw block device access
interface than to a full featured POSIX file system. The goal of zonefs
is to simplify the implementation of zoned block device support in
applications by replacing raw block device file accesses with a richer
file API, avoiding relying on direct block device file ioctls which may
be more obscure to developers. One example of this approach is the
implementation of LSM (log-structured merge) tree structures (such as
used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
to be stored in a zone file similarly to a regular file system rather
than as a range of sectors of a zoned device. The introduction of the
higher level construct "one file is one zone" can help reducing the
amount of changes needed in the application as well as introducing
support for different application programming languages.
Zonefs on-disk metadata is reduced to an immutable super block to
persistently store a magic number and optional feature flags and
values. On mount, zonefs uses blkdev_report_zones() to obtain the device
zone configuration and populates the mount point with a static file tree
solely based on this information. E.g. file sizes come from the device
zone type and write pointer offset managed by the device itself.
The zone files created on mount have the following characteristics.
1) Files representing zones of the same type are grouped together
under a common sub-directory:
* For conventional zones, the sub-directory "cnv" is used.
* For sequential write zones, the sub-directory "seq" is used.
These two directories are the only directories that exist in zonefs.
Users cannot create other directories and cannot rename nor delete
the "cnv" and "seq" sub-directories.
2) The name of zone files is the number of the file within the zone
type sub-directory, in order of increasing zone start sector.
3) The size of conventional zone files is fixed to the device zone size.
Conventional zone files cannot be truncated.
4) The size of sequential zone files represent the file's zone write
pointer position relative to the zone start sector. Truncating these
files is allowed only down to 0, in which case, the zone is reset to
rewind the zone write pointer position to the start of the zone, or
up to the zone size, in which case the file's zone is transitioned
to the FULL state (finish zone operation).
5) All read and write operations to files are not allowed beyond the
file zone size. Any access exceeding the zone size is failed with
the -EFBIG error.
6) Creating, deleting, renaming or modifying any attribute of files and
sub-directories is not allowed.
7) There are no restrictions on the type of read and write operations
that can be issued to conventional zone files. Buffered, direct and
mmap read & write operations are accepted. For sequential zone files,
there are no restrictions on read operations, but all write
operations must be direct IO append writes. mmap write of sequential
files is not allowed.
Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional
zones can be aggregated into a single larger file instead of the
default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0
(root) but can be changed to any valid UID/GID.
* File access permissions: the default 640 access permissions can be
changed.
The mkzonefs tool is used to format zoned block devices for use with
zonefs. This tool is available on Github at:
git@github.com:damien-lemoal/zonefs-tools.git.
zonefs-tools also includes a test suite which can be run against any
zoned block device, including null_blk block device created with zoned
mode.
Example: the following formats a 15TB host-managed SMR HDD with 256 MB
zones with the conventional zones aggregation feature enabled.
$ sudo mkzonefs -o aggr_cnv /dev/sdX
$ sudo mount -t zonefs /dev/sdX /mnt
$ ls -l /mnt/
total 0
dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a
single file).
$ ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0
This aggregated conventional zone file can be used as a regular file.
$ sudo mkfs.ext4 /mnt/cnv/0
$ sudo mount -o loop /mnt/cnv/0 /data
The "seq" sub-directory grouping files for sequential write zones has
in this example 55356 zones.
$ ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355
For sequential write zone files, the file size changes as data is
appended at the end of the file, similarly to any regular file system.
$ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
The written file can be truncated to the zone size, preventing any
further write operation.
$ truncate -s 268435456 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
Truncation to 0 size allows freeing the file zone storage space and
restart append-writes to the file.
$ truncate -s 0 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
Since files are statically mapped to zones on the disk, the number of
blocks of a file as reported by stat() and fstat() indicates the size
of the file zone.
$ stat /mnt/seq/0
File: /mnt/seq/0
Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
Device: 870h/2160d Inode: 50431 Links: 1
Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
Birth: -
The number of blocks of the file ("Blocks") in units of 512B blocks
gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
to the device zone size in this example. Of note is that the "IO block"
field always indicates the minimum IO size for writes and corresponds
to the device physical sector size.
This code contains contributions from:
* Johannes Thumshirn <jthumshirn@suse.de>,
* Darrick J. Wong <darrick.wong@oracle.com>,
* Christoph Hellwig <hch@lst.de>,
* Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
* Ting Yao <tingyao@hust.edu.cn>.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Don't do a single array; attach them to fsparam_enum() entry
instead. And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.
Simplifies validation as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
As it is, vfs_parse_fs_string() makes "foo" and "foo=" indistinguishable;
both get fs_value_is_string for ->type and NULL for ->string. To make
it even more unpleasant, that combination is impossible to produce with
fsconfig().
Much saner rules would be
"foo" => fs_value_is_flag, NULL
"foo=" => fs_value_is_string, ""
"foo=bar" => fs_value_is_string, "bar"
All cases are distinguishable, all results are expressable by fsconfig(),
->has_value checks are much simpler that way (to the point of the field
being useless) and quite a few regressions go away (gfs2 has no business
accepting -o nodebug=, for example).
Partially based upon patches from Miklos.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
See MS-FSCC 2.4.43. Valid to be quried from most
Windows servers (among others).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
SMB3.1.1 POSIX Context processing is not complete yet - so print warning
(once) if server returns it on open.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
We didn't have a dynamic trace point for catching errors in
file_write_and_wait_range error cases in cifs_strict_fsync.
Since not all apps check for write behind errors, it can be
important for debugging to be able to trace these error
paths.
Suggested-and-reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When mounting with -o modefromsid, the mode bits are stored in an
ACE. Directory enumeration (e.g. ls -l /mnt) triggers an SMB Query Dir
which does not include ACEs in its response. The mode bits in this
case are silently set to a default value of 755 instead.
This patch marks the dentry created during the directory enumeration
as needing re-evaluation (i.e. additional Query Info with ACEs) so
that the mode bits can be properly extracted.
Quick repro:
$ mount.cifs //win19.test/data /mnt -o ...,modefromsid
$ touch /mnt/foo && chmod 751 /mnt/foo
$ stat /mnt/foo
# reports 751 (OK)
$ sleep 2
# dentry older than 1s by default get invalidated
$ ls -l /mnt
# since dentry invalid, ls does a Query Dir
# and reports foo as 755 (WRONG)
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
After defer, a request will be prepared, that includes allocating iovec
if needed, and then submitted through io_wq_submit_work() but not custom
handler (e.g. io_rw_async()/io_sendrecv_async()). However, it'll leak
iovec, as it's in io-wq and the code goes as follows:
io_read() {
if (!io_wq_current_is_worker())
kfree(iovec);
}
Put all deallocation logic in io_{read,write,send,recv}(), which will
leave the memory, if going async with -EAGAIN.
It also fixes a leak after failed io_alloc_async_ctx() in
io_{recv,send}_msg().
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make bitfields of size 1 bit be unsigned (since there is no room
for the sign bit).
This clears up the sparse warnings:
CHECK ../fs/io_uring.c
../fs/io_uring.c:207:50: error: dubious one-bit signed bitfield
../fs/io_uring.c:208:55: error: dubious one-bit signed bitfield
../fs/io_uring.c:209:63: error: dubious one-bit signed bitfield
../fs/io_uring.c:210:54: error: dubious one-bit signed bitfield
../fs/io_uring.c:211:57: error: dubious one-bit signed bitfield
Found by sight and then verified with sparse.
Fixes: 69b3e54613 ("io_uring: change io_ring_ctx bool fields into bit fields")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fail fast if can't grab mm, so past that requests always have an mm
when required. This allows us to remove req->user altogether.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The server var was accidentally used as an iterator over the global
list of connections, thus overwritten the passed argument. This
resulted in the wrong signing key being returned for extra channels.
Fix this by using a separate var to iterate.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
In gfs2_file_write_iter, for direct writes, the error checking in the buffered
write fallback case is incomplete. This can cause inode write errors to go
undetected. Fix and clean up gfs2_file_write_iter along the way.
Based on a proposed fix by Christoph Hellwig <hch@lst.de>.
Fixes: 967bcc91b0 ("gfs2: iomap direct I/O support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Set current->backing_dev_info just around the buffered write calls to
prepare for the next fix.
Fixes: 967bcc91b0 ("gfs2: iomap direct I/O support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When the first log header in a journal happens to have a sequence
number of 0, a bug in gfs2_find_jhead() causes it to prematurely exit,
and return an uninitialized jhead with seq 0. This can cause failures
in the caller. For instance, a mount fails in one test case.
The correct behavior is for it to continue searching through the journal
to find the correct journal head with the highest sequence number.
Fixes: f4686c26ec ("gfs2: read journal in large chunks")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This frees "copy->nf_src" before and again after the goto.
Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use the 'refcount_t' type instead of 'atomic_t' for improved
refcounting safety.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
perf does not know how to deal with a __builtin_bswap32() call, and
complains. All other functions just store the xid etc in host endian
form, so let's do that in the tracepoint for nfsd_file_acquire too.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fixes coccicheck warning:
fs/fuse/readdir.c:335:1-19: WARNING: Assignment of 0/1 to bool variable
fs/fuse/file.c:1398:2-19: WARNING: Assignment of 0/1 to bool variable
fs/fuse/file.c:1400:2-20: WARNING: Assignment of 0/1 to bool variable
fs/fuse/cuse.c:454:1-20: WARNING: Assignment of 0/1 to bool variable
fs/fuse/cuse.c:455:1-19: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:497:2-17: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:504:2-23: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:511:2-22: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:518:2-23: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:522:2-26: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:526:2-18: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:1000:1-20: WARNING: Assignment of 0/1 to bool variable
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allow fuse to pass RENAME_WHITEOUT to fuse server. Overlayfs on top of
virtiofs uses RENAME_WHITEOUT.
Without this patch renaming a directory in overlayfs (dir is on lower)
fails with -EINVAL. With this patch it works.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Handle the special case of fuse_readpages() wanting to read the last page
of a hugest file possible and overflowing the end offset in the process.
This is basically to unbreak xfstests:generic/525 and prevent filesystems
from doing bad things with an overflowing offset.
Reported-by: Xiao Yang <ice_yangxiao@163.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_direct_io() can end up advancing the iterator by more than the amount
of data read or written. This case is handled by the generic code if going
through ->direct_IO(), but not in the FOPEN_DIRECT_IO case.
Fix by reverting the extra bytes from the iterator in case of error or a
short count.
To test: install lxcfs, then the following testcase
int fd = open("/var/lib/lxcfs/proc/uptime", O_RDONLY);
sendfile(1, fd, NULL, 16777216);
sendfile(1, fd, NULL, 16777216);
will spew WARN_ON() in iov_iter_pipe().
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 3c3db095b6 ("fuse: use iov_iter based generic splice helpers")
Cc: <stable@vger.kernel.org> # v5.1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
A commonly used SMB3 feature is change notification, allowing an
app to be notified about changes to a directory. The SMB3
Notify request blocks until the server detects a change to that
directory or its contents that matches the completion flags
that were passed in and the "watch_tree" flag (which indicates
whether subdirectories under this directory should be also
included). See MS-SMB2 2.2.35 for additional detail.
To use this simply pass in the following structure to ioctl:
struct __attribute__((__packed__)) smb3_notify {
uint32_t completion_filter;
bool watch_tree;
} __packed;
using CIFS_IOC_NOTIFY 0x4005cf09
or equivalently _IOW(CIFS_IOCTL_MAGIC, 9, struct smb3_notify)
SMB3 change notification is supported by all major servers.
The ioctl will block until the server detects a change to that
directory or its subdirectories (if watch_tree is set).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
When no interfaces are returned by the server we cannot open multiple
channels. Make it more obvious by reporting that to the user at the
VFS log level.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
RHBZ: 1795423
This is the SMB1 version of a patch we already have for SMB2
In recent DFS updates we have a new variable controlling how many times we will
retry to reconnect the share.
If DFS is not used, then this variable is initialized to 0 in:
static inline int
dfs_cache_get_nr_tgts(const struct dfs_cache_tgt_list *tl)
{
return tl ? tl->tl_numtgts : 0;
}
This means that in the reconnect loop in smb2_reconnect() we will immediately wrap retries to -1
and never actually get to pass this conditional:
if (--retries)
continue;
The effect is that we no longer reach the point where we fail the commands with -EHOSTDOWN
and basically the kernel threads are virtually hung and unkillable.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
- a set of patches that fixes various corner cases in mount and umount
code (Xiubo Li). This has to do with choosing an MDS, distinguishing
between laggy and down MDSes and parsing the server path.
- inode initialization fixes (Jeff Layton). The one included here
mostly concerns things like open_by_handle() and there is another
one that will come through Al.
- copy_file_range() now uses the new copy-from2 op (Luis Henriques).
The existing copy-from op turned out to be infeasible for generic
filesystem use; we disable the copy offload if OSDs don't support
copy-from2.
- a patch to link "rbd" and "block" devices together in sysfs (Hannes
Reinecke)
And a smattering of cleanups from Xiubo, Jeff and Chengguang.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl47PUcTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi6LoCACmVli5N6bgnBE4sTixi/jz6aCCbk32
ZPlKiSesHnOGkY6KXHJT58JYy0paITBRik5ypdz06J8aCOtWyPLbn3uCemF9CYn2
g6dId2Lf5vGFrgSm4YSiqp9a86IZmYSDG41LbJD/IJWFDWdMWqNPMDqji6yaIO5O
NJI5N0tk+VFXdV+JyjV9X/FnP1r1D2ReZzz21ZiqTJXSmE8YIkioLjkq36QTMMG7
Gm5qdlc1x2r4qfzA1g+OiWgRQCUMgkuYerFzus4mVbW4hrphsavH2DArbOwFmsXF
46hOq+1uGVVyZILLJfKNiktf1GExBF0icbSREJtmjUHbQvNR8BH0C+fV
=vvIc
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
- a set of patches that fixes various corner cases in mount and umount
code (Xiubo Li). This has to do with choosing an MDS, distinguishing
between laggy and down MDSes and parsing the server path.
- inode initialization fixes (Jeff Layton). The one included here
mostly concerns things like open_by_handle() and there is another one
that will come through Al.
- copy_file_range() now uses the new copy-from2 op (Luis Henriques).
The existing copy-from op turned out to be infeasible for generic
filesystem use; we disable the copy offload if OSDs don't support
copy-from2.
- a patch to link "rbd" and "block" devices together in sysfs (Hannes
Reinecke)
... and a smattering of cleanups from Xiubo, Jeff and Chengguang.
* tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
rbd: set the 'device' link in sysfs
ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
ceph: print name of xattr in __ceph_{get,set}xattr() douts
ceph: print r_direct_hash in hex in __choose_mds() dout
ceph: use copy-from2 op in copy_file_range
ceph: close holes in structs ceph_mds_session and ceph_mds_request
rbd: work around -Wuninitialized warning
ceph: allocate the correct amount of extra bytes for the session features
ceph: rename get_session and switch to use ceph_get_mds_session
ceph: remove the extra slashes in the server path
ceph: add possible_max_rank and make the code more readable
ceph: print dentry offset in hex and fix xattr_version type
ceph: only touch the caps which have the subset mask requested
ceph: don't clear I_NEW until inode metadata is fully populated
ceph: retry the same mds later after the new session is opened
ceph: check availability of mds cluster on mount after wait timeout
ceph: keep the session state until it is released
ceph: add __send_request helper
ceph: ensure we have a new cap before continuing in fill_inode
ceph: drop unused ttl_from parameter from fill_inode
...
- Refactor the metadata buffer functions to return the usual int error
value instead of the open coded error checking mess we have now.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl47QScACgkQ+H93GTRK
tOssRw//SysStMKUk0nsQOIB+Y0BqzmMyjuY7CLEOWpQeFh5MRMYH288KSCiI5k2
ljHEXeBUU7AoLQEegL5ivIMa7p4gfzkorRAYrcB7dcPwo4tYqwhfC97yU/5tYuxk
fOfm0ZaxJ0E+KNDBRd6vqe/lbWE24ySyZWxv7kzJs2ndc3RW4kEFzFFDGIVfi256
rPMzTxn7B4D0c359o4P0LGP5e5OeUeLH8FrvkITZCml7zMApdpo+eQzn1YxFRcGo
62daaO2uxtHBVnd30c1BhMPWfXGr+Pqls6QxZKr7YLvGSP5Jb6lRKnB9v3ImmjgH
OmOq+sXsVgKpNKo4lItnNJditAb0kR0UQHjmEccaUKbkAgEnGkSYqOtPbk2nkHw5
Eb05y+36DH20GRCp6lKbmdnFOxwL53pfWm8m3xieU/dE/gYH2bphFJNQokm50yaS
Onoz7zhdvqwLHQafnCLrwZWHVcsEQ1bjKC4nkWZdrcv6UlPYbuTKzJe3OaN79nE2
IFu9ilhX50M6dS2qsF0NDTEXrPAie6YOlikCvZotJIWaqpzEtWj/+t02jCtPhquC
M5yBYo0ljA3kYDUMZdng44FaO1h3E8MQIA/+dycyIQWTYIXYPB8mZni8D+YTVTbE
1jZT8qwBc83mTewYoOV5s+e9ja5hoHZtsZ/KnNssgSbQ66dq/7g=
=APXi
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.6-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull moar xfs updates from Darrick Wong:
"This contains the buffer error code refactoring I mentioned last week,
now that it has had extra time to complete the full xfs fuzz testing
suite to make sure there aren't any obvious new bugs"
* tag 'xfs-5.6-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix xfs_buf_ioerror_alert location reporting
xfs: remove unnecessary null pointer checks from _read_agf callers
xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers
xfs: remove the xfs_btree_get_buf[ls] functions
xfs: make xfs_trans_get_buf return an error code
xfs: make xfs_trans_get_buf_map return an error code
xfs: make xfs_buf_read return an error code
xfs: make xfs_buf_get_uncached return an error code
xfs: make xfs_buf_get return an error code
xfs: make xfs_buf_read_map return an error code
xfs: make xfs_buf_get_map return an error code
xfs: make xfs_buf_alloc return an error code
- Added new "bootconfig".
Looks for a file appended to initrd to add boot config options.
This has been discussed thoroughly at Linux Plumbers.
Very useful for adding kprobes at bootup.
Only enabled if "bootconfig" is on the real kernel command line.
- Created dynamic event creation.
Merges common code between creating synthetic events and
kprobe events.
- Rename perf "ring_buffer" structure to "perf_buffer"
- Rename ftrace "ring_buffer" structure to "trace_buffer"
Had to rename existing "trace_buffer" to "array_buffer"
- Allow trace_printk() to work withing (some) tracing code.
- Sort of tracing configs to be a little better organized
- Fixed bug where ftrace_graph hash was not being protected properly
- Various other small fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXjtAURQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qshOAQDzopQmvAVrrI6oogghr8JQA30Z2yqT
i+Ld7vPWL2MV9wEA1S+zLGDSYrj8f/vsCq6BxRYT1ApO+YtmY6LTXiUejwg=
=WNds
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Added new "bootconfig".
This looks for a file appended to initrd to add boot config options,
and has been discussed thoroughly at Linux Plumbers.
Very useful for adding kprobes at bootup.
Only enabled if "bootconfig" is on the real kernel command line.
- Created dynamic event creation.
Merges common code between creating synthetic events and kprobe
events.
- Rename perf "ring_buffer" structure to "perf_buffer"
- Rename ftrace "ring_buffer" structure to "trace_buffer"
Had to rename existing "trace_buffer" to "array_buffer"
- Allow trace_printk() to work withing (some) tracing code.
- Sort of tracing configs to be a little better organized
- Fixed bug where ftrace_graph hash was not being protected properly
- Various other small fixes and clean ups
* tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits)
bootconfig: Show the number of nodes on boot message
tools/bootconfig: Show the number of bootconfig nodes
bootconfig: Add more parse error messages
bootconfig: Use bootconfig instead of boot config
ftrace: Protect ftrace_graph_hash with ftrace_sync
ftrace: Add comment to why rcu_dereference_sched() is open coded
tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
tracing: Annotate ftrace_graph_hash pointer with __rcu
bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline
tracing: Use seq_buf for building dynevent_cmd string
tracing: Remove useless code in dynevent_arg_pair_add()
tracing: Remove check_arg() callbacks from dynevent args
tracing: Consolidate some synth_event_trace code
tracing: Fix now invalid var_ref_vals assumption in trace action
tracing: Change trace_boot to use synth_event interface
tracing: Move tracing selftests to bottom of menu
tracing: Move mmio tracer config up with the other tracers
tracing: Move tracing test module configs together
tracing: Move all function tracing configs together
tracing: Documentation for in-kernel synthetic event API
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47MicQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpplgD/4wyOfyMQ601AaiXyBmG6lx7UV7kBBaDWAb
tlDEh/EWIioejMlYJC8UslLtrlxJS8jKCVJNOAz5zB9V6McLtHNxXNY5pRr4MRrc
2ztxFHuvy8s+LyztGxBh3DA+bT5UrMR/r6uu6Guh2TatFUZr4IOvBUBb6VeP9O1Z
sECCkzWZcmIq2gNSh7Dpxr31KdMQo7xngyMhFMh3CHBnDVZN6WX4ugNBJNb71MpY
ELH3SRY2uX15dlhatO5UYuAknJOA1VvlulYVWCuBj4UPyH0AAUJQiZJVEPwldCNL
qE4cS80Q5EMAFw32cOW/oyl8Z6oFQO5nwFQ+YPPhaZscjMsRteuqnt6qYSgXHJal
ze4mUBO9Z1byc9Gex1V5SHZSLzVw3HfgznSUfZrm+Tj2UkocJSaYtS9CzXR8x7tE
tD8ev4P3EH+axm4oUSWoA4Bro9eGgkV07ok2mCnxb9rJoV0JNHzUmVSzjF4G9HGK
GosVRRS4I4/nHIZQ3KTKp6apLOAn7SPTUkxqb0/M8qbRXZqQYylWhPsL2Q8aBgvT
8pQ2sIQ5AgOmzGKqKRofxbhIh8G+6Ddz97A+Omt47zLb8ccsoatXfEli7mMjtH4P
W/aUE0O8Kstma8gZN4LUxrnqKGncDVJMolozFyt5dWc9bIpxX0SmpDdiRzqyN1fw
k9L4Ox6hxg==
=RzPL
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Some later fixes for io_uring:
- Small cleanup series from Pavel
- Belt and suspenders build time check of sqe size and layout
(Stefan)
- Addition of ->show_fdinfo() on request of Jann Horn, to aid in
understanding mapped personalities
- eventfd recursion/deadlock fix, for both io_uring and aio
- Fixup for send/recv handling
- Fixup for double deferral of read/write request
- Fix for potential double completion event for close request
- Adjust fadvise advice async/inline behavior
- Fix for shutdown hang with SQPOLL thread
- Fix for potential use-after-free of fixed file table"
* tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block:
io_uring: cleanup fixed file data table references
io_uring: spin for sq thread to idle on shutdown
aio: prevent potential eventfd recursion on poll
io_uring: put the flag changing code in the same spot
io_uring: iterate req cache backwards
io_uring: punt even fadvise() WILLNEED to async context
io_uring: fix sporadic double CQE entry for close
io_uring: remove extra ->file check
io_uring: don't map read/write iovec potentially twice
io_uring: use the proper helpers for io_send/recv
io_uring: prevent potential eventfd recursion on poll
eventfd: track eventfd_signal() recursion depth
io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe
io_uring: add ->show_fdinfo() for the io_uring file descriptor
fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
dax". The reason is that the initial pwrite to an empty file with the
RWF_NOWAIT flag set does not return -EAGAIN. It turns out that
dax_iomap_rw doesn't pass that flag through to iomap_apply.
With this patch applied, generic/471 passes for me.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/x49r1z86e1d.fsf@segfault.boston.devel.redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Makes it easier to debug errors on writeback that happen later,
and are being returned on flush or fsync
For example:
writetest-17829 [002] .... 13583.407859: cifs_flush_err: ino=90 rc=-28
Signed-off-by: Steve French <stfrench@microsoft.com>
We ran into a confusing problem where an application wasn't checking
return code on close and so user didn't realize that the application
ran out of disk space. log a warning message (once) in these
cases. For example:
[ 8407.391909] Out of space writing to \\oleg-server\small-share
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Oleg Kravtsov <oleg@tuxera.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
RHBZ: 1579050
If we have a soft mount we should fail commands for session-setup
failures (such as the password having changed/ account being deleted/ ...)
and return an error back to the application.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Add check for null cifs_sb to create_options helper
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Pull vfs recursive removal updates from Al Viro:
"We have quite a few places where synthetic filesystems do an
equivalent of 'rm -rf', with varying amounts of code duplication,
wrong locking, etc. That really ought to be a library helper.
Only debugfs (and very similar tracefs) are converted here - I have
more conversions, but they'd never been in -next, so they'll have to
wait"
* 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems
syzbot reports a use-after-free in io_ring_file_ref_switch() when it
tries to switch back to percpu mode. When we put the final reference to
the table by calling percpu_ref_kill_and_confirm(), we don't want the
zero reference to queue async work for flushing the potentially queued
up items. We currently do a few flush_work(), but they merely paper
around the issue, since the work item may not have been queued yet
depending on the when the percpu-ref callback gets run.
Coming into the file unregister, we know we have the ring quiesced.
io_ring_file_ref_switch() can check for whether or not the ref is dying
or not, and not queue anything async at that point. Once the ref has
been confirmed killed, flush any potential items manually.
Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As part of io_uring shutdown, we cancel work that is pending and won't
necessarily complete on its own. That includes requests like poll
commands and timeouts.
If we're using SQPOLL for kernel side submission and we shutdown the
ring immediately after queueing such work, we can race with the sqthread
doing the submission. This means we may miss cancelling some work, which
results in the io_uring shutdown hanging forever.
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Currently, each time nfs4_do_fsinfo() is called it will do an implicit
NFS4 lease renewal, which is not compliant with the NFS4 specification.
This can result in a lease being expired by an NFS server.
Commit 83ca7f5ab3 ("NFS: Avoid PUTROOTFH when managing leases")
introduced implicit client lease renewal in nfs4_do_fsinfo(),
which can result in the NFSv4.0 lease to expire on a server side,
and servers returning NFS4ERR_EXPIRED or NFS4ERR_STALE_CLIENTID.
This can easily be reproduced by frequently unmounting a sub-mount,
then stat'ing it to get it mounted again, which will delay or even
completely prevent client from sending RENEW operations if no other
NFS operations are issued. Eventually nfs server will expire client's
lease and return an error on file access or next RENEW.
This can also happen when a sub-mount is automatically unmounted
due to inactivity (after nfs_mountpoint_expiry_timeout), then it is
mounted again via stat(). This can result in a short window during
which client's lease will expire on a server but not on a client.
This specific case was observed on production systems.
This patch removes the implicit lease renewal from nfs4_do_fsinfo().
Fixes: 83ca7f5ab3 ("NFS: Avoid PUTROOTFH when managing leases")
Signed-off-by: Robert Milkowski <rmilkowski@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently, if an nfs server returns NFS4ERR_EXPIRED to open(),
we return EIO to applications without even trying to recover.
Fixes: 272289a3df ("NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid")
Signed-off-by: Robert Milkowski <rmilkowski@gmail.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In _nfs42_proc_copy(), 'res->commit_res.verf' is allocated through
kzalloc() if 'args->sync' is true. In the following code, if
'res->synchronous' is false, handle_async_copy() will be invoked. If an
error occurs during the invocation, the following code will not be executed
and the error will be returned . However, the allocated
'res->commit_res.verf' is not deallocated, leading to a memory leak. This
is also true if the invocation of process_copy_commit() returns an error.
To fix the above leaks, redirect the execution to the 'out' label if an
error is encountered.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the directory is large and it's being modified by one client
while another client is doing the 'ls -l' on the same directory then
the cache page invalidation from nfs_force_use_readdirplus causes
the reading client to keep restarting READDIRPLUS from cookie 0
which causes the 'ls -l' to take a very long time to complete,
possibly never completing.
Currently when nfs_force_use_readdirplus is called to switch from
READDIR to READDIRPLUS, it invalidates all the cached pages of the
directory. This cache page invalidation causes the next nfs_readdir
to re-read the directory content from cookie 0.
This patch is to optimise the cache invalidation in
nfs_force_use_readdirplus by only truncating the cached pages from
last page index accessed to the end the file. It also marks the
inode to delay invalidating all the cached page of the directory
until the next initial nfs_readdir of the next 'ls' instance.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[Anna - Fix conflicts with Trond's readdir patches]
[Anna - Remove redundant call to nfs_zap_mapping()]
[Anna - Replace d_inode(file_dentry(desc->file)) with file_inode(desc->file)]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXjkxvQAKCRDh3BK/laaZ
PAgkAQDEW7kRzagMOd6cwX6uxfR9AIfpy56yjLySnuuVjwAnFAEAyebtop9j5hHk
LGLnG3wA+eOr2ljxlxIOuO49s9cMzQg=
=U1io
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
- Try to preserve holes in sparse files when copying up, thus saving
disk space and improving performance.
- Fix a performance regression introduced in v4.19 by preserving
asynchronicity of IO when fowarding to underlying layers. Add VFS
helpers to submit async iocbs.
- Fix a regression in lseek(2) introduced in v4.19 that breaks >2G
seeks on 32bit kernels.
- Fix a corner case where st_ino/st_dev was not preserved across copy
up.
- Miscellaneous fixes and cleanups.
* tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix lseek overflow on 32bit
ovl: add splice file read write helper
ovl: implement async IO routines
vfs: add vfs_iocb_iter_[read|write] helper functions
ovl: layer is const
ovl: fix corner case of non-constant st_dev;st_ino
ovl: fix corner case of conflicting lower layer uuid
ovl: generalize the lower_fs[] array
ovl: simplify ovl_same_sb() helper
ovl: generalize the lower_layers[] array
ovl: improving copy-up efficiency for big sparse file
ovl: use ovl_inode_lock in ovl_llseek()
ovl: use pr_fmt auto generate prefix
ovl: fix wrong WARN_ON() in ovl_cache_update_ino()
The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth the
missing entry is. Add this is an extra parameter to pte_hole(). When the
depth isn't know (e.g. processing a vma) then -1 is passed.
The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e. any levels where
PTRS_PER_P?D is set to 1 are ignored.
Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.
Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Tested-by: Zong Li <zong.li@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If max_pfn does not fall onto a section boundary, it is possible to
inspect PFNs up to max_pfn, and PFNs above max_pfn, however, max_pfn
itself can't be inspected. We can have a valid (and online) memmap at and
above max_pfn if max_pfn is not aligned to a section boundary. The whole
early section has a memmap and is marked online. Being able to inspect
the state of these PFNs is valuable for debugging, especially because
max_pfn can change on memory hotplug and expose these memmaps.
Also, querying page flags via "./page-types -r -a 0x144001,"
(tools/vm/page-types.c) inside a x86-64 guest with 4160MB under QEMU
results in an (almost) endless loop in user space, because the end is not
detected properly when starting after max_pfn.
Instead, let's allow to inspect all pages in the highest section and
return 0 directly if we try to access pages above that section.
While at it, check the count before adjusting it, to avoid masking user
errors.
Link: http://lkml.kernel.org/r/20191211163201.17179-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we have nested or circular eventfd wakeups, then we can deadlock if
we run them inline from our poll waitqueue wakeup handler. It's also
possible to have very long chains of notifications, to the extent where
we could risk blowing the stack.
Check the eventfd recursion count before calling eventfd_signal(). If
it's non-zero, then punt the signaling to async context. This is always
safe, as it takes us out-of-line in terms of stack and locking context.
Cc: stable@vger.kernel.org # 4.19+
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both iocb_flags() and kiocb_set_rw_flags() are inline and modify
kiocb->ki_flags. Place them close, so they can be potentially better
optimised.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Grab requests from cache-array from the end, so can get by only
free_reqs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andres correctly points out that read-ahead can block, if it needs to
read in meta data (or even just through the page cache page allocations).
Play it safe for now and just ensure WILLNEED is also punted to async
context.
While in there, allow the file settings hints from non-blocking
context. They don't need to start/do IO, and we can safely do them
inline.
Fixes: 4840e418c2 ("io_uring: add IORING_OP_FADVISE")
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We punt close to async for the final fput(), but we log the completion
even before that even in that case. We rely on the request not having
a files table assigned to detect what the final async close should do.
However, if we punt the async queue to __io_queue_sqe(), we'll get
->files assigned and this makes io_close_finish() think it should both
close the filp again (which does no harm) AND log a new CQE event for
this request. This causes duplicate CQEs.
Queue the request up for async manually so we don't grab files
needlessly and trigger this condition.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It won't ever get into io_prep_rw() when req->file haven't been set in
io_req_set_file(), hence remove the check.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have a read/write that is deferred, we already setup the async IO
context for that request, and mapped it. When we later try and execute
the request and we get -EAGAIN, we don't want to attempt to re-map it.
If we do, we end up with garbage in the iovec, which typically leads
to an -EFAULT or -EINVAL completion.
Cc: stable@vger.kernel.org # 5.5
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't use the recvmsg/sendmsg helpers, use the same helpers that the
recv(2) and send(2) system calls use.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have nested or circular eventfd wakeups, then we can deadlock if
we run them inline from our poll waitqueue wakeup handler. It's also
possible to have very long chains of notifications, to the extent where
we could risk blowing the stack.
Check the eventfd recursion count before calling eventfd_signal(). If
it's non-zero, then punt the signaling to async context. This is always
safe, as it takes us out-of-line in terms of stack and locking context.
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
eventfd use cases from aio and io_uring can deadlock due to circular
or resursive calling, when eventfd_signal() tries to grab the waitqueue
lock. On top of that, it's also possible to construct notification
chains that are deep enough that we could blow the stack.
Add a percpu counter that tracks the percpu recursion depth, warn if we
exceed it. The counter is also exposed so that users of eventfd_signal()
can do the right thing if it's non-zero in the context where it is
called.
Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag was missing from
some of the operations.
Change all operations to use the macro cifs_create_options() to
set the backup intent flag if needed.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Now that the page cache locking is repaired, we should be able to
switch to using iterate_shared() for improved concurrency when
doing readdir().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The directory strings stored in the readdir cache may be used with
printk(), so it is better to ensure they are nul-terminated.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When a NFS directory page cache page is removed from the page cache,
its contents are freed through a call to nfs_readdir_clear_array().
To prevent the removal of the page cache entry until after we've
finished reading it, we must take the page lock.
Fixes: 11de3b11e0 ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
nfs_readdir_xdr_to_array() must not exit without having initialised
the array, so that the page cache deletion routines can safely
call nfs_readdir_clear_array().
Furthermore, we should ensure that if we exit nfs_readdir_filler()
with an error, we free up any page contents to prevent a leak
if we try to fill the page again.
Fixes: 11de3b11e0 ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When we already know the string length, it is more efficient to
use kmemdup_nul().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[Anna - Changes to super.c were already made during fscontext conversion]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Delegations can be expensive to return, and can cause scalability issues
for the server. Let's therefore try to limit the number of inactive
delegations we hold.
Once the number of delegations is above a certain threshold, start
to return them on close.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In order to better manage our delegation caching, add a counter
to track the number of active delegations.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add a routine to return the delegation immediately upon close of the
file if it was marked for return-on-close.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If a delegation is marked as needing to be returned when the file is
closed, then don't clear that marking until we're ready to return
it.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In particular, the pnfs return-on-close code will check for that flag,
so ensure we set it appropriately.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We want to find open contexts that match our filesystem access
properties. They don't have to exactly match the cred.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We do not need to have the rcu lookup method fail in the case where
the fsuid/fsgid and supplemental groups match.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When comparing two 'struct cred' for equality w.r.t. behaviour under
filesystem access, we need to use cred_fscmp().
Fixes: a52458b48a ("NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl44NzMACgkQxWXV+ddt
WDvSMA/+MYGBB/+HIET55a70hoLZR1n21joLPYat+RZnY//vBpVQtHewIHNoqBi3
EWY7rB+OWomIMiHRw4gS4nKWpds3Ou8minUnAVmbP86irfu2e1mip9rQWT6EO7ne
KWd52hb41M4ZG2Oq/2XZdMu49IGUIgUBShioJQAN7VimTI8XSX1mQn0N9pvkRk3s
IX/77kmf5jolO3/hZOJDCN6+LsI1inN6TkEH8ODKC+0ounGN+TcQQydJlfjZ+4n/
BH5G9mpm5FmFQWKp14vfyc2jknwoO9ryd7Mez5Vf70xuFCMQw+Z0ZKyLDeMLGhur
dCV3j57/+XUwsSflT/Q/cmIQZyXIdmShOxcHi9QMnax5lf6XrMgkvEjjzfGGzlzU
ey8f8hTkqkH7KM89G8pvla+DN1It4xzs9kLYczS49pWT5qn/15l7FcRMdwOR37mU
QFcDTfXEhQ5wPbpaNYDWGVycFugyyxgxBgpnSbOpgmvVS/i3qJoChUXGJrM3HMyx
Xsej/oLUnYsBEBe20mEaVp/j288NnQdMo58C+BRGtUEMC4QZM/tg3HUmGJCXqGI1
PXJx8qPPs1ZR4U4ciuWwDAim27LlD04NMGlf/r3ABFaPLMAiPKaVR93Ny0nIEkBA
iPHJD5xVKKmhJnRAWVxz3ZyyRGbjG9IB6syawYsSWMu1qH8z150=
=BuZc
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"Fixes that arrived after the merge window freeze, mostly stable
material.
- fix race in tree-mod-log element tracking
- fix bio flushing inside extent writepages
- fix assertion when in-memory tracking of discarded extents finds an
empty tree (eg. after adding a new device)
- update logic of temporary read-only block groups to take into
account overcommit
- fix some fixup worker corner cases:
- page could not go through proper COW cycle and the dirty status
is lost due to page migration
- deadlock if delayed allocation is performed under page lock
- fix send emitting invalid clones within the same file
- fix statfs reporting 0 free space when global block reserve size is
larger than remaining free space but there is still space for new
chunks"
* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not zero f_bavail if we have available space
Btrfs: send, fix emission of invalid clone operations within the same file
btrfs: do not do delalloc reservation under page lock
btrfs: drop the -EBUSY case in __extent_writepage_io
Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker
btrfs: take overcommit into account in inc_block_group_ro
btrfs: fix force usage in inc_block_group_ro
btrfs: Correctly handle empty trees in find_first_clear_extent_bit
btrfs: flush write bio if we loop in extent_write_cache_pages
Btrfs: fix race between adding and putting tree mod seq elements and nodes
In old days, the "host-progs" syntax was used for specifying host
programs. It was renamed to the current "hostprogs-y" in 2004.
It is typically useful in scripts/Makefile because it allows Kbuild to
selectively compile host programs based on the kernel configuration.
This commit renames like follows:
always -> always-y
hostprogs-y -> hostprogs
So, scripts/Makefile will look like this:
always-$(CONFIG_BUILD_BIN2C) += ...
always-$(CONFIG_KALLSYMS) += ...
...
hostprogs := $(always-y) $(always-m)
I think this makes more sense because a host program is always a host
program, irrespective of the kernel configuration. We want to specify
which ones to compile by CONFIG options, so always-y will be handier.
The "always", "hostprogs-y", "hostprogs-m" will be kept for backward
compatibility for a while.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
MNT_fhs_status_sz/MNT_fhandle3_sz are never used after they were
introduced. So better to remove them.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
FIBMAP receives an integer from userspace which is then implicitly converted
into sector_t to be passed to bmap(). No check is made to ensure userspace
didn't send a negative block number, which can end up in an underflow, and
returning to userspace a corrupted block address.
As a side-effect, the underflow caused by a negative block here, will
trigger the WARN() in iomap_bmap_actor(), which is how this issue was
first discovered.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now we have the possibility of proper error return in bmap, use bmap()
function in ioctl_fibmap() instead of calling ->bmap method directly.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace direct ->bmap calls by bmap() method.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace the direct usage of ->bmap method by a bmap() call.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
By now, bmap() will either return the physical block number related to
the requested file offset or 0 in case of error or the requested offset
maps into a hole.
This patch makes the needed changes to enable bmap() to proper return
errors, using the return value as an error return, and now, a pointer
must be passed to bmap() to be filled with the mapped physical block.
It will change the behavior of bmap() on return:
- negative value in case of error
- zero on success or map fell into a hole
In case of a hole, the *block will be zero too
Since this is a prep patch, by now, the only error return is -EINVAL if
->bmap doesn't exist.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ovl_lseek() is using ssize_t to return the value from vfs_llseek(). On a
32-bit kernel ssize_t is a 32-bit signed int, which overflows above 2 GB.
Assign the return value of vfs_llseek() to loff_t to fix this.
Reported-by: Boris Gjenero <boris.gjenero@gmail.com>
Fixes: 9e46b840c7 ("ovl: support stacked SEEK_HOLE/SEEK_DATA")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There was some logic added a while ago to clear out f_bavail in statfs()
if we did not have enough free metadata space to satisfy our global
reserve. This was incorrect at the time, however didn't really pose a
problem for normal file systems because we would often allocate chunks
if we got this low on free metadata space, and thus wouldn't really hit
this case unless we were actually full.
Fast forward to today and now we are much better about not allocating
metadata chunks all of the time. Couple this with d792b0f197 ("btrfs:
always reserve our entire size for the global reserve") which now means
we'll easily have a larger global reserve than our free space, we are
now more likely to trip over this while still having plenty of space.
Fix this by skipping this logic if the global rsv's space_info is not
full. space_info->full is 0 unless we've attempted to allocate a chunk
for that space_info and that has failed. If this happens then the space
for the global reserve is definitely sacred and we need to report
b_avail == 0, but before then we can just use our calculated b_avail.
Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Fixes: ca8a51b3a9 ("btrfs: statfs: report zero available if metadata are exhausted")
CC: stable@vger.kernel.org # 4.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl41xEQACgkQiiy9cAdy
T1E4XQv/UzE0s7redt544+JtirghSwKxdawcz065dnHheWLnQeigO8zgL0iJck7t
+mwcYrDHuATk0GoAE3bGnh5mLH30sKJ3xdF6w76TWDykZVknLFU2QJoz5oWAycLr
5Y1gy647E6774jbOEvypUmm7yIjGXozDcmo3LabkvkVlVb3iL6S5bC/wdiLHneUB
PltwbpHHqrhai0YfiJyLiM220x1mhvFIWGfWICWV75BzVhBDt9hhQ0kE97rXO9DB
Uu1DcEBqe8QuwL5AH6DXnjq3v5M2n9AFP/qdXZFWjloOIwTrgBD9E0nE7BYEooM9
dXCHe3mxlIb/6CfGaEwAxXhNuN6+C/bn87MRhphC95Owwv5g4uczZroX1ChZio5v
t/yWwKoYnydaKQhtBKOGfrFAMzJsAYir0gOHhr+CKiCpYUHRShf7DduhKLFn3txS
1yZbqiRJ6XouJWt1L/wvHvQAYlhGac4Q9ZRmB9PGRRxjq5x79Du7JxTUrkKB/Y8C
GrI19E38
=eYF5
-----END PGP SIGNATURE-----
Merge tag '5.6-rc-small-smb3-fix-for-stable' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"Small SMB3 fix for stable (fixes problem with soft mounts)"
* tag '5.6-rc-small-smb3-fix-for-stable' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
cifs: fix soft mounts hanging in the reconnect code
Brown paperbag time: fetching ->i_uid/->i_mode really should've been
done from nd->inode. I even suggested that, but the reason for that has
slipped through the cracks and I went for dir->d_inode instead - made
for more "obvious" patch.
Analysis:
- at the entry into do_last() and all the way to step_into(): dir (aka
nd->path.dentry) is known not to have been freed; so's nd->inode and
it's equal to dir->d_inode unless we are already doomed to -ECHILD.
inode of the file to get opened is not known.
- after step_into(): inode of the file to get opened is known; dir
might be pointing to freed memory/be negative/etc.
- at the call of may_create_in_sticky(): guaranteed to be out of RCU
mode; inode of the file to get opened is known and pinned; dir might
be garbage.
The last was the reason for the original patch. Except that at the
do_last() entry we can be in RCU mode and it is possible that
nd->path.dentry->d_inode has already changed under us.
In that case we are going to fail with -ECHILD, but we need to be
careful; nd->inode is pointing to valid struct inode and it's the same
as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we
should use that.
Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com
Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Fixes: d0cb50185a ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix some corner cases on filesystems with a block size < page size.
- Fix a corner case that could expose incorrect access times over nfs.
- Revert an otherwise sensible revoke accounting cleanup that causes
assertion failures. The revoke accounting is whacky and needs to be
fixed properly before we can add back this cleanup.
- Various other minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJeNEWRAAoJENW/n+sDE2U6EBIP/1OMCdASk1ykAzhoO3UsI8Fa
9ktBZ7/tnmixHvPbAkPi+FLXFqfAOXRnY8tcWKdpJ5Mdesfhdm9HbSr8BtRf8l9r
LqdQvd4F2lVemtpPIiU8MojSpmoJXs6shZvIxrLzUS5JaDVUxmoLc066otdTFya8
FASuOzPxpCE6Rfdk+f+tBYF+UsBXY8w3w/hOiITKcLkqbCO8xnMuJzspPSs64qDN
LtYJqJLe7THan2wEW20gtqpkZDX+WWYuBhdTWqZYs1Rfg17/ohcBYSB2kuvonLlq
C4P/aS56U4stD+BefvI11iMnKlv6QQ+4KD1A7QWayvIPv9Lu1kvb/L7F/gamlUMM
5LPI/3J0GSgK1fh3nDULYUXGvJqqBp2klabNAOCRA7lhZUBjU/wGTTDll6OM+K2O
0K6HgyVp92xAyWhi/0LVRb3cA7xru5kZK+vzzdloTweCuPFMiF3IlYI4MMr3I1V1
+/4DvpzqMjHx6t9sXb2oDjnvljedB4n3fH9YCVpl6wD/PnJ60gXm4tonrfLzkSII
axfc5bEXaKXp43PuH9wbHsuNeOEGvEY7PG+FYbH5cV2BNGWviXYEGrNHop4gOQp9
2bDZQJGEYvrvnjGe82ne7gmtDwRjp3+ovx/EBNp72G4mMb4igDnAAuSd0uzumatM
9zGAa5+le4ZlpjNrJbWO
=oo06
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix some corner cases on filesystems with a block size < page size.
- Fix a corner case that could expose incorrect access times over nfs.
- Revert an otherwise sensible revoke accounting cleanup that causes
assertion failures. The revoke accounting is whacky and needs to be
fixed properly before we can add back this cleanup.
- Various other minor cleanups.
In addition, please expect to see another pull request from Bob Peterson
about his gfs2 recovery patch queue shortly.
* tag 'gfs2-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
Revert "gfs2: eliminate tr_num_revoke_rm"
gfs2: remove unused LBIT macros
fs/gfs2: remove unused IS_DINODE and IS_LEAF macros
gfs2: Remove GFS2_MIN_LVB_SIZE define
gfs2: Fix incorrect variable name
gfs2: Avoid access time thrashing in gfs2_inode_lookup
gfs2: minor cleanup: remove unneeded variable ret in gfs2_jdata_writepage
gfs2: eliminate ssize parameter from gfs2_struct2blk
gfs2: Another gfs2_find_jhead fix
- Fix an off-by-one error when checking if offset is within inode size
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl4vLo4ACgkQ+H93GTRK
tOtFNg/+I0IVdJgzMPe6PoEhbMc80c1CK7yxjprtq6I1WV5/OVbpNSnXqym8vYPs
atp+8loh9LMMYhi8ln0DyU0clu8p97+7D/aUqkN4DzKJ3t0T/LFR8NPHh7P7H3i/
rI0Ym8pTorueC3qgcVzgP2Xx8+yoURvlRV0VSd8C9Vza9E3g7U+8cAAzZz/iMbpy
QbW32mrEWWtACLqVEj94+sNj9O4wHy1kefkBITWBkdmaWvqP6KBWJsdQrySHNHN4
BXEwOrfxM5w8isd9hOd3U6NPkY0I7EGq7K4aQhRzyoIw9LJ+APjpCvY1KpmFs1Ar
7yOM026MdZKGguL1cZcgXzFpIKbkv4xj4231RD6yKXUk1XhMIs+7JO+ontRekPg3
2oZgPJoz+5QXtcpJifKzwzwr32WFssX/HhYFlzbbq1SRlJBWEyKfPBXY8yoeAJJb
lHK/fjsPwGeTgMG0/n3+r/GhO+3FDzDosFYHpD0IZIX08770tam7nB4hlGrJEel1
74/cxxphNeU1PkjV43V4vxTNucEkFNn+CDD9/9MvEx8ZXtz01HkIbEsANdyZM7yj
SO0wjkHQjkm9YlBP+lh03ign4YQYYVQLN2zdw3VzK2QYo0GKln+o/wWrXhJksZ8t
AT70yqYx2lntnkasvMPO02pgZulorM605mgUGkbwslcmuifvpo0=
=iPhi
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
"A single patch fixing an off-by-one error when we're checking to see
how far we're gotten into an EOF page"
* tag 'iomap-5.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: Fix page_mkwrite off-by-one errors
Pull updates from Andrew Morton:
"Most of -mm and quite a number of other subsystems: hotfixes, scripts,
ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.
MM is fairly quiet this time. Holidays, I assume"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
kcov: ignore fault-inject and stacktrace
include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
execve: warn if process starts with executable stack
reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
init/main.c: fix misleading "This architecture does not have kernel memory protection" message
init/main.c: fix quoted value handling in unknown_bootoption
init/main.c: remove unnecessary repair_env_string in do_initcall_level
init/main.c: log arguments and environment passed to init
fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
fs/binfmt_elf.c: coredump: delete duplicated overflow check
fs/binfmt_elf.c: coredump: allocate core ELF header on stack
fs/binfmt_elf.c: make BAD_ADDR() unlikely
fs/binfmt_elf.c: better codegen around current->mm
fs/binfmt_elf.c: don't copy ELF header around
fs/binfmt_elf.c: fix ->start_code calculation
fs/binfmt_elf.c: smaller code generation around auxv vector fill
lib/find_bit.c: uninline helper _find_next_bit()
lib/find_bit.c: join _find_next_bit{_le}
uapi: rename ext2_swab() to swab() and share globally in swab.h
lib/scatterlist.c: adjust indentation in __sg_alloc_table
...
There were few episodes of silent downgrade to an executable stack over
years:
1) linking innocent looking assembly file will silently add executable
stack if proper linker options is not given as well:
$ cat f.S
.intel_syntax noprefix
.text
.globl f
f:
ret
$ cat main.c
void f(void);
int main(void)
{
f();
return 0;
}
$ gcc main.c f.S
$ readelf -l ./a.out
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RWE 0x10
^^^
2) converting C99 nested function into a closure
https://nullprogram.com/blog/2019/11/15/
void intsort2(int *base, size_t nmemb, _Bool invert)
{
int cmp(const void *a, const void *b)
{
int r = *(int *)a - *(int *)b;
return invert ? -r : r;
}
qsort(base, nmemb, sizeof(*base), cmp);
}
will silently require stack trampolines while non-closure version will
not.
Without doubt this behaviour is documented somewhere, add a warning so
that developers and users can at least notice. After so many years of
x86_64 having proper executable stack support it should not cause too
many problems.
Link: http://lkml.kernel.org/r/20191208171918.GC19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Will Deacon <will@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable inode may be NULL in reiserfs_insert_item(), but there is
no check before accessing the member of inode.
Fix this by adding NULL pointer check before calling reiserfs_debug().
Link: http://lkml.kernel.org/r/79c5135d-ff25-1cc9-4e99-9f572b88cc00@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Cc: zhengbin <zhengbin13@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Cc: Feilong Lin <linfeilong@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unmapping whole address space at once with
munmap(0, (1ULL<<47) - 4096)
or equivalent will create empty coredump.
It is silly way to exit, however registers content may still be useful.
The right to coredump is fundamental right of a process!
Link: http://lkml.kernel.org/r/20191222150137.GA1277@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment says ELF header is "too large to be on stack". 64 bytes on
64-bit is not large by any means.
Link: http://lkml.kernel.org/r/20191222143850.GA24341@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If some mapping goes past TASK_SIZE it will be rejected by kernel which
means no such userspace binaries exist.
Mark every such check as unlikely.
Link: http://lkml.kernel.org/r/20191215124355.GA21124@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"current->mm" pointer is stable in general except few cases one of which
execve(2). Compiler can't treat is as stable but it _is_ stable most of
the time. During ELF loading process ->mm becomes stable right after
flush_old_exec().
Help compiler by caching current->mm, otherwise it continues to refetch
it.
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-141 (-141)
Function old new delta
elf_core_dump 5062 5039 -23
load_elf_binary 5426 5308 -118
Note: other cases are left as is because it is either pessimisation or
no change in binary size.
Link: http://lkml.kernel.org/r/20191215124755.GB21124@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ELF header is read into bprm->buf[] by generic execve code.
Save a memcpy and allocate just one header for the interpreter instead
of two headers (64 bytes instead of 128 on 64-bit).
Link: http://lkml.kernel.org/r/20191208171242.GA19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only executable segments should be accounted to ->start_code just like
they do to ->end_code (correctly).
Link: http://lkml.kernel.org/r/20191208171410.GB19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filling auxv vector as array with index (auxv[i++] = ...) generates
terrible code. "saved_auxv" should be reworked because it is the worst
member of mm_struct by size/usefullness ratio but do it later.
Meanwhile help gcc a little with *auxv++ idiom.
Space savings on x86_64:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-127 (-127)
Function old new delta
load_elf_binary 5470 5343 -127
Link: http://lkml.kernel.org/r/20191208172301.GD19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to benefit from s390 zlib hardware compression support,
increase the btrfs zlib workspace buffer size from 1 to 4 pages (if s390
zlib hardware support is enabled on the machine).
This brings up to 60% better performance in hardware on s390 compared to
the PAGE_SIZE buffer and much more compared to the software zlib
processing in btrfs. In case of memory pressure, fall back to a single
page buffer during workspace allocation.
The data compressed with larger input buffers will still conform to zlib
standard and thus can be decompressed also on a systems that uses only
PAGE_SIZE buffer for btrfs zlib.
Link: http://lkml.kernel.org/r/20200108105103.29028-1-zaslonko@linux.ibm.com
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eduard Shishkin <edward6@linux.ibm.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages. This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.
Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().
In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().
Link: http://lkml.kernel.org/r/20200107224558.2362728-17-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For the uniform format, we use ocfs2_update_inode_fsync_trans() to
access t_tid in handle->h_transaction
Link: http://lkml.kernel.org/r/6ff9a312-5f7d-0e27-fb51-bc4e062fcd97@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are users already and will be more of BITS_TO_BYTES() macro. Move
it to bitops.h for wider use.
In the case of ocfs2 the replacement is identical.
As for bnx2x, there are two places where floor version is used. In the
first case to calculate the amount of structures that can fit one memory
page. In this case obviously the ceiling variant is correct and
original code might have a potential bug, if amount of bits % 8 is not
0. In the second case the macro is used to calculate bytes transmitted
in one microsecond. This will work for all speeds which is multiply of
1Gbps without any change, for the rest new code will give ceiling value,
for instance 100Mbps will give 13 bytes, while old code gives 12 bytes
and the arithmetically correct one is 12.5 bytes. Further the value is
used to setup timer threshold which in any case has its own margins due
to certain resolution. I don't see here an issue with slightly shifting
thresholds for low speed connections, the card is supposed to utilize
highest available rate, which is usually 10Gbps.
Link: http://lkml.kernel.org/r/20200108121316.22411-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses Coverity ("Unused value")
Link: http://lkml.kernel.org/r/20191202164833.62865-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gang He reports the failure of building fs/ocfs2/ as an external module
of the kernel installed on the system:
$ cd fs/ocfs2
$ make -C /lib/modules/`uname -r`/build M=`pwd` modules
If you want to make it work reliably, I'd recommend to remove ccflags-y
from the Makefiles, and to make header paths relative to the C files. I
think this is the correct usage of the #include "..." directive.
Link: http://lkml.kernel.org/r/20191227022950.14804-1-ghe@suse.com
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Gang He <ghe@suse.com>
Reported-by: Gang He <ghe@suse.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the only caller of dlm_migrate_lockres() - dlm_empty_lockres(),
target is checked for O2NM_MAX_NODES. Thus, the assertion in
dlm_migrate_lockres() is unnecessary and can be removed. The patch
eliminates such a check.
Link: http://lkml.kernel.org/r/20191218194111.26041-1-pakki001@umn.edu
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing an incremental send and a file has extents shared with itself
at different file offsets, it's possible for send to emit clone operations
that will fail at the destination because the source range goes beyond the
file's current size. This happens when the file size has increased in the
send snapshot, there is a hole between the shared extents and both shared
extents are at file offsets which are greater the file's size in the
parent snapshot.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/1.snap /mnt/sdb/base
# Create a 320K extent at file offset 512K.
$ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
# Clone part of that 320K extent into a lower file offset (192K).
# This file offset is greater than the file's size in the parent
# snapshot (64K). Also the clone range is a bit behind the offset of
# the 320K extent so that we leave a hole between the shared extents.
$ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/1.snap /mnt/sdc
$ btrfs receive -f /tmp/2.snap /mnt/sdc
ERROR: failed to clone extents to foobar: Invalid argument
The problem is that after processing the extent at file offset 256K, which
refers to the first 128K of the 320K extent created by the buffered write
operations, we have 'cur_inode_next_write_offset' set to 384K, which
corresponds to the end offset of the partially shared extent (256K + 128K)
and to the current file size in the receiver. Then when we process the
extent at offset 512K, we do extent backreference iteration to figure out
if we can clone the extent from some other inode or from the same inode,
and we consider the extent at offset 256K of the same inode as a valid
source for a clone operation, which is not correct because at that point
the current file size in the receiver is 384K, which corresponds to the
end of last processed extent (at file offset 256K), so using a clone
source range from 256K to 256K + 320K is invalid because that goes past
the current size of the file (384K) - this makes the receiver get an
-EINVAL error when attempting the clone operation.
So fix this by excluding clone sources that have a range that goes beyond
the current file size in the receiver when iterating extent backreferences.
A test case for fstests follows soon.
Fixes: 11f2069c11 ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We ran into a deadlock in production with the fixup worker. The stack
traces were as follows:
Thread responsible for the writeout, waiting on the page lock
[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x109/0x1e0
[<0>] extent_write_cache_pages+0x206/0x360
[<0>] extent_writepages+0x40/0x60
[<0>] do_writepages+0x31/0xb0
[<0>] __writeback_single_inode+0x3d/0x350
[<0>] writeback_sb_inodes+0x19d/0x3c0
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x231/0x2c0
[<0>] wb_workfn+0x308/0x3c0
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Thread of the fixup worker who is holding the page lock
[<0>] start_delalloc_inodes+0x241/0x2d0
[<0>] btrfs_start_delalloc_roots+0x179/0x230
[<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
[<0>] btrfs_check_data_free_space+0x53/0xa0
[<0>] btrfs_delalloc_reserve_space+0x20/0x70
[<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
[<0>] normal_work_helper+0x11c/0x360
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Thankfully the stars have to align just right to hit this. First you
have to end up in the fixup worker, which is tricky by itself (my
reproducer does DIO reads into a MMAP'ed region, so not a common
operation). Then you have to have less than a page size of free data
space and 0 unallocated space so you go down the "commit the transaction
to free up pinned space" path. This was accomplished by a random
balance that was running on the host. Then you get this deadlock.
I'm still in the process of trying to force the deadlock to happen on
demand, but I've hit other issues. I can still trigger the fixup worker
path itself so this patch has been tested in that regard, so the normal
case is fine.
Fixes: 87826df0ec ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
do not need this -EBUSY case.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For COW, btrfs expects pages dirty pages to have been through a few setup
steps. This includes reserving space for the new block allocations and marking
the range in the state tree for delayed allocation.
A few places outside btrfs will dirty pages directly, especially when unmapping
mmap'd pages. In order for these to properly go through COW, we run them
through a fixup worker to wait for stable pages, and do the delalloc prep.
87826df0ec added a window where the dirty pages were cleaned, but pending
more action from the fixup worker. We clear_page_dirty_for_io() before
we call into writepage, so the page is no longer dirty. The commit
changed it so now we leave the page clean between unlocking it here and
the fixup worker starting at some point in the future.
During this window, page migration can jump in and relocate the page. Once our
fixup work actually starts, it finds page->mapping is NULL and we end up
freeing the page without ever writing it.
This leads to crc errors and other exciting problems, since it screws up the
whole statemachine for waiting for ordered extents. The fix here is to keep
the page dirty while we're waiting for the fixup worker to get to work.
This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
if we queued the page up for fixup, which will cause the writepage
function to redirty the page.
Because we now expect the page to be dirty once it gets to the fixup
worker we must adjust the error cases to call clear_page_dirty_for_io()
on the page. That is the bulk of the patch, but it is not the fix, the
fix is the -EAGAIN from btrfs_writepage_cow_fixup. We cannot separate
these two changes out because the error conditions change with the new
expectations.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
inc_block_group_ro does a calculation to see if we have enough room left
over if we mark this block group as read only in order to see if it's ok
to mark the block group as read only.
The problem is this calculation _only_ works for data, where our used is
always less than our total. For metadata we will overcommit, so this
will almost always fail for metadata.
Fix this by exporting btrfs_can_overcommit, and then see if we have
enough space to remove the remaining free space in the block group we
are trying to mark read only. If we do then we can mark this block
group as read only.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For some reason we've translated the do_chunk_alloc that goes into
btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
two different things.
force for inc_block_group_ro is used when we are forcing the block group
read only no matter what, for example when the underlying chunk is
marked read only. We need to not do the space check here as this block
group needs to be read only.
btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
we need to pre-allocate a chunk before marking the block group read
only. This has nothing to do with forcing, and in fact we _always_ want
to do the space check in this case, so unconditionally pass false for
force in this case.
Then fixup inc_block_group_ro to honor force as it's expected and
documented to do.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Raviu reported that running his regular fs_trim segfaulted with the
following backtrace:
[ 237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
[ 237.525984] ------------[ cut here ]------------
[ 237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
[ 237.525992] invalid opcode: 0000 [#1] SMP PTI
[ 237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G U OE 5.4.14-8-vanilla #1
[ 237.526001] Hardware name: ASUSTeK COMPUTER INC.
[ 237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
[ 237.526079] Call Trace:
[ 237.526120] find_first_clear_extent_bit+0x13d/0x150 [btrfs]
[ 237.526148] btrfs_trim_fs+0x211/0x3f0 [btrfs]
[ 237.526184] btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
[ 237.526219] btrfs_ioctl+0x129a/0x2ed0 [btrfs]
[ 237.526227] ? filemap_map_pages+0x190/0x3d0
[ 237.526232] ? do_filp_open+0xaf/0x110
[ 237.526238] ? _copy_to_user+0x22/0x30
[ 237.526242] ? cp_new_stat+0x150/0x180
[ 237.526247] ? do_vfs_ioctl+0xa4/0x640
[ 237.526278] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 237.526283] do_vfs_ioctl+0xa4/0x640
[ 237.526288] ? __do_sys_newfstat+0x3c/0x60
[ 237.526292] ksys_ioctl+0x70/0x80
[ 237.526297] __x64_sys_ioctl+0x16/0x20
[ 237.526303] do_syscall_64+0x5a/0x1c0
[ 237.526310] entry_SYSCALL_64_after_hwframe+0x49/0xbe
That was due to btrfs_fs_device::aloc_tree being empty. Initially I
thought this wasn't possible and as a percaution have put the assert in
find_first_clear_extent_bit. Turns out this is indeed possible and could
happen when a file system with SINGLE data/metadata profile has a 2nd
device added. Until balance is run or a new chunk is allocated on this
device it will be completely empty.
In this case find_first_clear_extent_bit should return the full range
[0, -1ULL] and let the caller handle this i.e for trim the end will be
capped at the size of actual device.
Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
Fixes: 45bfcfc168 ("btrfs: Implement find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There exists a deadlock with range_cyclic that has existed forever. If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out. The task traces are as follows
PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5"
#0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
#1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
#2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
#3 [ffffc900297bb708] __lock_page at ffffffff811f145b
#4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
#5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
#6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
#7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
#8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
#9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND:
"aio-dio-invalid"
#0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
#1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
#2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
#3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
#4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
#5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
#6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
#7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
#8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
#9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
I used drgn to find the respective pages we were stuck on
page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148. aio-dio-invalid has 7680, and the kworker epd looks like the
following
crash> struct extent_page_data ffffc900297bbbb0
struct extent_page_data {
bio = 0xffff889f747ed830,
tree = 0xffff889eed6ba448,
extent_locked = 0,
sync_io = 0
}
Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).
Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on
bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
for i in range(0, bio.bi_vcnt.value_()):
bv = bio.bi_io_vec[i]
if bv.bv_page.value_() == 0xffffea00fbfc7500:
print("FOUND IT")
which validated what I suspected.
The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.
Fixes: b293f02e14 ("Btrfs: Add writepages support")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.
Consider the following example that explains how/why the problems happens:
1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;
2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';
3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;
4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;
5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;
6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;
7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.
This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:
[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unfortunately, GCC 9.1 is expected to be be released without support for
MPX. This means that there was only a relatively small window where
folks could have ever used MPX. It failed to gain wide adoption in the
industry, and Linux was the only mainstream OS to ever support it widely.
Support for the feature may also disappear on future processors.
This set completes the process that we started during the 5.4 merge window.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJeK1/pAAoJEGg1lTBwyZKwgC8QAIiVn1d7A9Uj/WpnpgfCChCZ
9XiV6Ak999qD9fbAcrgNfPjieaD4mtokocSRVJuRgJu5iLnIJCINlozLPe4yVl7P
7zebnxkLq0CIA8d56bEUoFlC0J+oWYlDVQePZzNQsSk5KHVGXVLpF6U4vDVzZeQy
cprgvdeY+ehB7G6IIo0MWTg5ylKYAsOAyVvK8NIGpKY2k6/YqCnsptnsVE7bvlHy
TrEOiUWLv+hh0bMkZdP1PwKQKEuMO/IZly0HtviFbMN7T4TB1spfg7ELoBucEq3T
s4EVbYRe+nIE4tuEAveaX3CgxJek8cY5MlticskdaKSEACBwabdOF55qsZy0u+WA
PYC4iUIXfbOH8OgieKWtGX4IuSkRYdQ2nP4BOpe4ZX4+zvU7zOCIyVSKRrwkX8cc
ADtWI5FAtB36KCgUuWnHGHNZpOxPTbTLBuBataFY4Q2uBNJEBJpscZ5H9ObtyGFU
ZjlzqFnM0nFNDKEI1EEtv9jLzgZTU1RQ46s7EFeSeEQ2/s9wJ3+s5sBlVbljsmus
o658bLOEaRWC/aF15dgmEXW9GAO6uifNdmbzGnRn7oEMYyFQPTWbZvi1zGz58QaG
Y6WTtigVtsSrHS4wpYd+p+n1W06VnB6J3BpBM4G1VQv1Vm0dNd1tUOfkqOzPjg7c
33Itmsz2LaW1mb67GlgZ
=g4cC
-----END PGP SIGNATURE-----
Merge tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx
Pull x86 MPX removal from Dave Hansen:
"MPX requires recompiling applications, which requires compiler
support. Unfortunately, GCC 9.1 is expected to be be released without
support for MPX. This means that there was only a relatively small
window where folks could have ever used MPX. It failed to gain wide
adoption in the industry, and Linux was the only mainstream OS to ever
support it widely.
Support for the feature may also disappear on future processors.
This set completes the process that we started during the 5.4 merge
window when the MPX prctl()s were removed. XSAVE support is left in
place, which allows MPX-using KVM guests to continue to function"
* tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx:
x86/mpx: remove MPX from arch/x86
mm: remove arch_bprm_mm_init() hook
x86/mpx: remove bounds exception code
x86/mpx: remove build infrastructure
x86/alternatives: add missing insn.h include
UBI:
- Fixes for memory leaks in error paths
- Fix for an logic error in a fastmap selfcheck
UBIFS:
- Fix for FS_IOC_SETFLAGS related to fscrypt flag
- Support for FS_ENCRYPT_FL
- Fix for a dead lock in bulk-read mode
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl4k1v8WHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wTGuD/9PU3qZJq1w5F499YAJb2qx2hLD
rseg7SZ4rzKwXI5m2g2EP63lgMQkJZC7u5YUv9c1m0gQnNvVXbTdhba3M1V437Kd
F0Ce2SeqbVRi4faGWyEH0TEEmFo2Nk7Uz3iJeaXxUY8BqVrvQZaBYk6GWtj+wWIl
Yc2ONKwzIF2BDTU5pFvP2yubHnTm00M4uP46MqAcaWoehd9L+xZhC0xiXyWFiAQE
g/ITk4vgD3bJRkI0nYNuHxFIgafIweGlyuNfMMpfh2Yqo5/tGnppPE+H+Maokb8V
6Gqmt9XR34ZGH8mOZsMFWxeK6e68DP2AkzL1EsiT2FlUc6hhCr+pOVEN17Y4eb//
IRpy7l8f9BkHvR72roaQusE1UjANC2sw2VtDi4TJO6WpFRx4n94//bf+IxO32os8
0AbIyzYCEo1Kql0wTxhqTZnHJr+zHjcFWOuzZ/95iH5wVQmb3hvlmmozL6ZPV9sG
cqyV1sEcFhkUKgCSTmbtoYBKfEJLj4j3WYvLoI7apLYN014ExNJY7PIVfIUtMfZQ
Sn0sN8+/gpQOOben67IQK9EdcvEhEkY4JdTHpuZEQmh3cS4HNnhzjM2A+n2sWTU8
lxkakeemcO2sV6ue/Vg6Fq4fPEnQtyOVVsSUHjA7hIy5JXwprDvft4gHkI2r40FI
fC9PuCjoUIIJtMyA7g==
=9tMk
-----END PGP SIGNATURE-----
Merge tag 'upstream-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI/UBIFS updates from Miquel Raynal:
"This pull request contains mostly fixes for UBI and UBIFS:
UBI:
- Fixes for memory leaks in error paths
- Fix for an logic error in a fastmap selfcheck
UBIFS:
- Fix for FS_IOC_SETFLAGS related to fscrypt flag
- Support for FS_ENCRYPT_FL
- Fix for a dead lock in bulk-read mode"
Sent on behalf of Richard Weinberger who is traveling.
* tag 'upstream-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: Fix an error pointer dereference in error handling code
ubifs: Fix memory leak from c->sup_node
ubifs: Fix ino_t format warnings in orphan_delete()
ubifs: Fix deadlock in concurrent bulk-read and writepage
ubifs: Fix wrong memory allocation
ubi: Free the normal volumes in error paths of ubi_attach_mtd_dev()
ubi: Check the presence of volume before call ubi_fastmap_destroy_checkmap()
ubifs: Add support for FS_ENCRYPT_FL
ubifs: Fix FS_IOC_SETFLAGS unexpectedly clearing encrypt flag
ubi: wl: Remove set but not used variable 'prev_e'
ubi: fastmap: Fix inverted logic in seen selfcheck
In this series, we've implemented transparent compression experimentally. It
supports LZO and LZ4, but will add more later as we investigate in the field
more. At this point, the feature doesn't expose compressed space to user
directly in order to guarantee potential data updates later to the space.
Instead, the main goal is to reduce data writes to flash disk as much as
possible, resulting in extending disk life time as well as relaxing IO
congestion. Alternatively, we're also considering to add ioctl() to reclaim
compressed space and show it to user after putting the immutable bit.
Enhancement:
- add compression support
- avoid unnecessary locks in quota ops
- harden power-cut scenario for zoned block devices
- use private bio_set to avoid IO congestion
- replace GC mutex with rwsem to serialize callers
Bug fix:
- fix dentry consistency and memory corruption in rename()'s error case
- fix wrong swap extent reports
- fix casefolding bugs
- change lock coverage to avoid deadlock
- avoid GFP_KERNEL under f2fs_lock_op
And, we've cleaned up sysfs entries to prepare no debugfs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl4zInwACgkQQBSofoJI
UNL4Tg/+JBbVEFa3IUBGMdbjfgd/g0Jye++iMAYYGRWT6Ll/IGcHRV9NunITjgWU
mBZqdhI28kXeiGCcewB1ZvivjLx22X4n6yevHk2B5A6PNe9IDCHi0HOAhJJHkjPH
ecv2L+vX3Oj4y0+H7JNz9Fo3OIPJvMPtCQWlg1z+VQyhB85zNP7fZlvvIY4tG8yw
ERo0YNotLqwcF1BxCwNbAhV3aJGDxar+MI//yNzpiwDX7IptVpqestfcoIYc9kKL
4kSWRyEIGwcuIeyoM6aofGS9t4Z/Oe/gdqcxNr6l5n0Q/tMTpb4b/fJFGNr6RRx9
X9NQo8flkQb2DEIOP0DVpO2aPebzsVtzg3LZUOLA83+wCHfwINtHai2Dy2zDJ2my
BrVdou8fe2oxoaYihJg/Tz9cd0nA/6mZArtpYvDImAmX/xuGOvVk9zZkXNwc9nVX
EyVzy0vW4lA6gAIJ95aG6DDhJcAtVoy0MhBRWG92Pufxhn9aW24AV63ChWUf9DRx
/3RqpMAuQ3UC2gOxXKKnr54lsdhUIMn/y9sjROkVvQ1BvgRVxO8I4GFvMHMKv9pR
9KXiVRdzyYERyoL4+MF7A2zTnw+RHL4RVILa85p2ALGy2jQ1UuNUQi0BN9x2u1v8
S1ifNNX8SwOP+83ImFJhhn3HybpFQ45aLO3F7ZjKBQAnufJu+xw=
=zeoY
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this series, we've implemented transparent compression
experimentally. It supports LZO and LZ4, but will add more later as we
investigate in the field more.
At this point, the feature doesn't expose compressed space to user
directly in order to guarantee potential data updates later to the
space. Instead, the main goal is to reduce data writes to flash disk
as much as possible, resulting in extending disk life time as well as
relaxing IO congestion.
Alternatively, we're also considering to add ioctl() to reclaim
compressed space and show it to user after putting the immutable bit.
Enhancements:
- add compression support
- avoid unnecessary locks in quota ops
- harden power-cut scenario for zoned block devices
- use private bio_set to avoid IO congestion
- replace GC mutex with rwsem to serialize callers
Bug fixes:
- fix dentry consistency and memory corruption in rename()'s error case
- fix wrong swap extent reports
- fix casefolding bugs
- change lock coverage to avoid deadlock
- avoid GFP_KERNEL under f2fs_lock_op
And, we've cleaned up sysfs entries to prepare no debugfs"
* tag 'f2fs-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (31 commits)
f2fs: fix race conditions in ->d_compare() and ->d_hash()
f2fs: fix dcache lookup of !casefolded directories
f2fs: Add f2fs stats to sysfs
f2fs: delete duplicate information on sysfs nodes
f2fs: change to use rwsem for gc_mutex
f2fs: update f2fs document regarding to fsync_mode
f2fs: add a way to turn off ipu bio cache
f2fs: code cleanup for f2fs_statfs_project()
f2fs: fix miscounted block limit in f2fs_statfs_project()
f2fs: show the CP_PAUSE reason in checkpoint traces
f2fs: fix deadlock allocating bio_post_read_ctx from mempool
f2fs: remove unneeded check for error allocating bio_post_read_ctx
f2fs: convert inline_dir early before starting rename
f2fs: fix memleak of kobject
f2fs: fix to add swap extent correctly
f2fs: run fsck when getting bad inode during GC
f2fs: support data compression
f2fs: free sysfs kobject
f2fs: declare nested quota_sem and remove unnecessary sems
f2fs: don't put new_page twice in f2fs_rename
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl4zANcACgkQnJ2qBz9k
QNkyBQgA5/ppAhSC7Snc6BDm5PMiOJjN+FhYB1W9bHbkRlKfTetJxQTxbPpokZPq
A+99KuuNb3Uay2XWqan2pwZ90/9SIUZT8HnwNYwEHh33Nt76A1ybqqM0IAk+RWus
KjW7Jg/xCbbFKQX/estngjIlniUQ0WP7VTTwS/NPnvsIYNEpWJQvyIecm2DZhWGS
fmbn5x7PYnyveADd2Tf9z0iOKKI0ysLYksUlx+Ndg3fwPaWsI57tgUZL0Tzf552S
cCsRjQrcnhjuHTDEhH9HOGQlu45U4bBNkXKKoc1HUrp58UyTY2Rnn/QCM8jkTpzB
7NwoFyqPtWguJTFDsUH1rmqQisYoMQ==
=1v6t
-----END PGP SIGNATURE-----
Merge tag 'for_v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF, quota, reiserfs, ext2 fixes and cleanups from Jan Kara:
"A few assorted fixes and cleanups for udf, quota, reiserfs, and ext2"
* tag 'for_v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs/reiserfs: remove unused macros
fs/quota: remove unused macro
udf: Clarify meaning of f_files in udf_statfs
udf: Allow writing to 'Rewritable' partitions
udf: Disallow R/W mode for disk with Metadata partition
udf: Fix meaning of ENTITYID_FLAGS_* macros to be really bitwise-or flags
udf: Fix free space reporting for metadata and virtual partitions
udf: Update header files to UDF 2.60
udf: Move OSTA Identifier Suffix macros from ecma_167.h to osta_udf.h
udf: Fix spelling in EXT_NEXT_EXTENT_ALLOCDESCS
ext2: Adjust indentation in ext2_fill_super
quota: avoid time_t in v1_disk_dqblk definition
reiserfs: Fix spurious unlock in reiserfs_fill_super() error handling
reiserfs: Fix memory leak of journal device string
ext2: set proper errno in error case of ext2_fill_super()
- Get rid of compat_time_t
- Convert time_t to time64_t in quota code
- Remove shadow variables
- Prevent ATTR_ flag misuse in the attrmulti ioctls
- Clean out strlen in the attr code
- Remove some bogus asserts
- Fix various file size limit calculation errors with 32-bit kernels
- Pack xfs_dir2_sf_entry_t to fix build errors on arm oabi
- Fix nowait inode locking calls for directio aio reads.
- Fix memory corruption bugs when invalidating remote xattr value
buffers.
- Streamline remote attr value removal.
- Make the buffer log format size consistent across platforms.
- Strengthen buffer log format size checking.
- Fix messed up return types of xfs_inode_need_cow.
- Fix some unused variable warnings.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl4rGQcACgkQ+H93GTRK
tOv1uhAAo5F38h9ZpdoMK4TPJNaREmLtM+MAA03l3sk+BY5zGDo690WDaxa5cd8h
cJi8pqHpcnroJVptCkbdagx0BIPnHMVpw7CKFpBlCjNHpFeO1ULLpk/RcvwISRZy
OvEOmpn6Z/FQ2jTBjr45KXIiS5DAK3KRMrmGdvh6j73YVhQ1lgadY/6Jj2TRjzT1
FFjnYMGhufBsP0yV9Rg10AcIhNCIAr3RYmfjqFXMA8raEX33gdpsJ6GU6r5iyXRJ
B9dByoBlGCL15ZlxZOqEiN4omqqBLux5jrJr1tg0L0hbfu7UIxMcwXiSTnoaO2SZ
G7GjlEO3wszf3wGEeaaJd/tN58SQDvfz3yY9vZObrlTelN5iDrcziLHbWc+xkwPh
mykly36x8+dZ8kqgBxiF1WFgRheYNQIWnZ9wtCfWvsPtjrklIZNZqKDEAxXnSg0w
8lRXgeNfy6Jh78A817aPQibqkUAtxSk8RJhimkPWyKwUA3jSzgv1LsjgzoRCM8ik
/vmmRKmXviFrQwIgDKYQc/wHx58khKBi2Kna+rZUJrozrBU+P0EfG+AU6Rxjw1ob
cOOSNooI5kp+uTrmlaKKr9mua80m0hACgrFT9Fj+SJWgi313343VVNXubRgK/yV7
WxTcTOZadkQgqgeQSMzwFQuQPjQob9bNoQf5PK01sJ/4v96Zsmg=
=J5Em
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.6-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"In this release we clean out the last of the old 32-bit timestamp
code, fix a number of bugs and memory corruptions on 32-bit platforms,
and a refactoring of some of the extended attribute code.
I think I'll be back next week with some refactoring of how the XFS
buffer code returns error codes, however I prefer to hold onto that
for another week to let it soak a while longer
Summary:
- Get rid of compat_time_t
- Convert time_t to time64_t in quota code
- Remove shadow variables
- Prevent ATTR_ flag misuse in the attrmulti ioctls
- Clean out strlen in the attr code
- Remove some bogus asserts
- Fix various file size limit calculation errors with 32-bit kernels
- Pack xfs_dir2_sf_entry_t to fix build errors on arm oabi
- Fix nowait inode locking calls for directio aio reads
- Fix memory corruption bugs when invalidating remote xattr value
buffers
- Streamline remote attr value removal
- Make the buffer log format size consistent across platforms
- Strengthen buffer log format size checking
- Fix messed up return types of xfs_inode_need_cow
- Fix some unused variable warnings"
* tag 'xfs-5.6-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (24 commits)
xfs: remove unused variable 'done'
xfs: fix uninitialized variable in xfs_attr3_leaf_inactive
xfs: change return value of xfs_inode_need_cow to int
xfs: check log iovec size to make sure it's plausibly a buffer log format
xfs: make struct xfs_buf_log_format have a consistent size
xfs: complain if anyone tries to create a too-large buffer log item
xfs: clean up xfs_buf_item_get_format return value
xfs: streamline xfs_attr3_leaf_inactive
xfs: fix memory corruption during remote attr value buffer invalidation
xfs: refactor remote attr value buffer invalidation
xfs: fix IOCB_NOWAIT handling in xfs_file_dio_aio_read
xfs: Add __packed to xfs_dir2_sf_entry_t definition
xfs: fix s_maxbytes computation on 32-bit kernels
xfs: truncate should remove all blocks, not just to the end of the page cache
xfs: introduce XFS_MAX_FILEOFF
xfs: remove bogus assertion when online repair isn't enabled
xfs: Remove all strlen in all xfs_attr_* functions for attr names.
xfs: fix misuse of the XFS_ATTR_INCOMPLETE flag
xfs: also remove cached ACLs when removing the underlying attr
xfs: reject invalid flags combinations in XFS_IOC_ATTRMULTI_BY_HANDLE
...
handle inode locking in the read/write paths, and improving the
performance of Direct I/O overwrites. We also now record the error
code which caused the first and most recent ext4_error() report in the
superblock, to make it easier to root cause problems in production
systems. There are also many of the usual cleanups and miscellaneous
bug fixes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl4yBf0ACgkQ8vlZVpUN
gaOK8Af9EsY1vyR/IvEosfXJoKIqnTXN1SLt94iAOUh6dNeVNcyv1SIzRGFrpmsg
uHY02EkcTl68b/AjV7ieDpOnOSmlP7NzynuVoar2hrjKX0MzpEu03Vv1a3dUQKuU
zcdchi83EwRjEvegsNK/VF3FFadk3TtC7x+7o6p840V6OAyp5CXhjm1akJqIJwvd
A4gTpruTSRIFg6Jj36HEDNRgSAeILed3wC7Ywtxt51tLK7Lp/qB1EuvYodMQRvGz
d0fRhbNHKepVYfxwpDUDMFnrqDPZ/SZGF73XBxP2zHd6SXy9dBLzGsRL+oj9tTUg
YQJtt4Yxjjg8Q1UrMyMRzQpi4S8dAQ==
=pVeR
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"This merge window, we've added some performance improvements in how we
handle inode locking in the read/write paths, and improving the
performance of Direct I/O overwrites.
We also now record the error code which caused the first and most
recent ext4_error() report in the superblock, to make it easier to
root cause problems in production systems.
There are also many of the usual cleanups and miscellaneous bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (49 commits)
jbd2: clean __jbd2_journal_abort_hard() and __journal_abort_soft()
jbd2: make sure ESHUTDOWN to be recorded in the journal superblock
ext4, jbd2: ensure panic when aborting with zero errno
jbd2: switch to use jbd2_journal_abort() when failed to submit the commit record
jbd2_seq_info_next should increase position index
jbd2: remove pointless assertion in __journal_remove_journal_head
ext4,jbd2: fix comment and code style
jbd2: delete the duplicated words in the comments
ext4: fix extent_status trace points
ext4: fix symbolic enum printing in trace output
ext4: choose hardlimit when softlimit is larger than hardlimit in ext4_statfs_project()
ext4: fix race conditions in ->d_compare() and ->d_hash()
ext4: make dioread_nolock the default
ext4: fix extent_status fragmentation for plain files
jbd2: clear JBD2_ABORT flag before journal_reset to update log tail info when load journal
ext4: drop ext4_kvmalloc()
ext4: Add EXT4_IOC_FSGETXATTR/EXT4_IOC_FSSETXATTR to compat_ioctl
ext4: remove unused macro MPAGE_DA_EXTENT_TAIL
ext4: add missing braces in ext4_ext_drop_refs()
ext4: fix some nonstandard indentation in extents.c
...
RHBZ: 1795429
In recent DFS updates we have a new variable controlling how many times we will
retry to reconnect the share.
If DFS is not used, then this variable is initialized to 0 in:
static inline int
dfs_cache_get_nr_tgts(const struct dfs_cache_tgt_list *tl)
{
return tl ? tl->tl_numtgts : 0;
}
This means that in the reconnect loop in smb2_reconnect() we will immediately wrap retries to -1
and never actually get to pass this conditional:
if (--retries)
continue;
The effect is that we no longer reach the point where we fail the commands with -EHOSTDOWN
and basically the kernel threads are virtually hung and unkillable.
Fixes: a3a53b7603 (cifs: Add support for failover in smb2_reconnect())
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
CC: Stable <stable@vger.kernel.org>
With nesting of anonymous unions and structs it's hard to
review layout changes. It's better to ask the compiler
for these things.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It can be hard to know exactly what is registered with the ring.
Especially for credentials, it'd be handy to be able to see which
ones are registered, what personalities they have, and what the ID
of each of them is.
This adds support for showing information registered in the ring from
the fdinfo of the io_uring fd. Here's an example from a test case that
registers 4 files (two of them sparse), 4 buffers, and 2 personalities:
pos: 0
flags: 02000002
mnt_id: 14
UserFiles: 4
0: file-no-1
1: file-no-2
2: <none>
3: <none>
UserBufs: 4
0: 0x563817c46000/128
1: 0x563817c47000/256
2: 0x563817c48000/512
3: 0x563817c49000/1024
Personalities:
1
Uid: 0 0 0 0
Gid: 0 0 0 0
Groups: 0
CapEff: 0000003fffffffff
2
Uid: 0 0 0 0
Gid: 0 0 0 0
Groups: 0
CapEff: 0000003fffffffff
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXjFo8wAKCRCRxhvAZXjc
omaGAQDVwCHQekqxp2eC8EJH4Pkt+Bn1BLrA25stlTo93YBPHgEAsPVUCRNcrZAl
VncYmxCfpt3Yu0S/MTVXu5xrRiIXPQk=
=uqTN
-----END PGP SIGNATURE-----
Merge tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread management updates from Christian Brauner:
"Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
syscall.
This syscall allows for the retrieval of file descriptors of a process
based on its pidfd. A task needs to have ptrace_may_access()
permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
Andy) on the target.
One of the main use-cases is in combination with seccomp's user
notification feature. As a reminder, seccomp's user notification
feature was made available in v5.0. It allows a task to retrieve a
file descriptor for its seccomp filter. The file descriptor is usually
handed of to a more privileged supervising process. The supervisor can
then listen for syscall events caught by the seccomp filter of the
supervisee and perform actions in lieu of the supervisee, usually
emulating syscalls. pidfd_getfd() is needed to expand its uses.
There are currently two major users that wait on pidfd_getfd() and one
future user:
- Netflix, Sargun said, is working on a service mesh where users
should be able to connect to a dns-based VIP. When a user connects
to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
redirected to an envoy process. This service mesh uses seccomp user
notifications and pidfd to intercept all connect calls and instead
of connecting them to 1.2.3.4:80 connects them to e.g.
127.0.0.1:8080.
- LXD uses the seccomp notifier heavily to intercept and emulate
mknod() and mount() syscalls for unprivileged containers/processes.
With pidfd_getfd() more uses-cases e.g. bridging socket connections
will be possible.
- The patchset has also seen some interest from the browser corner.
Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
broker process. In the future glibc will start blocking all signals
during dlopen() rendering this type of sandbox impossible. Hence,
in the future Firefox will switch to a seccomp-user-nofication
based sandbox which also makes use of file descriptor retrieval.
The thread for this can be found at
https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html
With pidfd_getfd() it is e.g. possible to bridge socket connections
for the supervisee (binding to a privileged port) and taking actions
on file descriptors on behalf of the supervisee in general.
Sargun's first version was using an ioctl on pidfds but various people
pushed for it to be a proper syscall which he duely implemented as
well over various review cycles. Selftests are of course included.
I've also added instructions how to deal with merge conflicts below.
There's also a small fix coming from the kernel mentee project to
correctly annotate struct sighand_struct with __rcu to fix various
sparse warnings. We've received a few more such fixes and even though
they are mostly trivial I've decided to postpone them until after -rc1
since they came in rather late and I don't want to risk introducing
build warnings.
Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
needed to avoid allocation recursions triggerable by storage drivers
that have userspace parts that run in the IO path (e.g. dm-multipath,
iscsi, etc). These allocation recursions deadlock the device.
The new prctl() allows such privileged userspace components to avoid
allocation recursions by setting the PF_MEMALLOC_NOIO and
PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
relevant maintainers and is routed here as part of prctl()
thread-management."
* tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
sched.h: Annotate sighand_struct with __rcu
test: Add test for pidfd getfd
arch: wire up pidfd_getfd syscall
pid: Implement pidfd_getfd syscall
vfs, fdtable: Add fget_task helper
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4yEegQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpn5ZD/4/WlXs2cUDgg1C65bzZFO4qvevm+VkXmsk
GbyrnFstRekvSH01/ZQxlyDVKS8Wux0XIJ6OArCh1047LvL1bEE5dvOW5iIiwa/r
grjQuwFAzIPsE2fgcAO17BKIUzq2Z96+hwDzH7dw0i32yBuLvNmY/1SxcCHKfPut
uzGyp7t3/2dIHbpWILRndMYe0O9j9ubmOMvKyKTwy723yDEafsUoqu2mlpigzTq4
2i+DbYBIAd8qmLqG/m3e+vOt9xodJ2Q0hlO+v6DcP2SKXU64Hb/N98HadR//aWP9
41DBXqs+dvDBcu3Jxb80PFUTiOQZECJivkns5cNcjuSXmNkOuQhDQR5K372AHmR9
m6e6FSBxwej8HselAZCI6yu9uBKd0i+MM4FnFs/O73QGYx2ayXsEXp/Jad9xiYgW
pC5XJTSqJQhPE0AYYEOzHPPcBLBcpvXHkvmGKdjkNb8OLhhgh2S/YG0DNC+8ABXr
j1uIe/n3kJEEmOanUyiitGyLmDq+mXd7aCVKJL/J0KiGD8Gkc1avAZ1ZrTQgjujY
FqqBFawO8gv3g0L4WMI8JI+HJGMnA488obet6UKm9+l/Z/urEpXzDAKf/W/vnx2B
LD0FSA0bCh1tyO6JU+avFwHlwShtV7/rx/OhrmCK7CCYKtZCA2IEctxyr8U+PBIv
DtwIMTYTsA==
=ZZUI
-----END PGP SIGNATURE-----
Merge tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Support for various new opcodes (fallocate, openat, close, statx,
fadvise, madvise, openat2, non-vectored read/write, send/recv, and
epoll_ctl)
- Faster ring quiesce for fileset updates
- Optimizations for overflow condition checking
- Support for max-sized clamping
- Support for probing what opcodes are supported
- Support for io-wq backend sharing between "sibling" rings
- Support for registering personalities
- Lots of little fixes and improvements
* tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits)
io_uring: add support for epoll_ctl(2)
eventpoll: support non-blocking do_epoll_ctl() calls
eventpoll: abstract out epoll_ctl() handler
io_uring: fix linked command file table usage
io_uring: support using a registered personality for commands
io_uring: allow registering credentials
io_uring: add io-wq workqueue sharing
io-wq: allow grabbing existing io-wq
io_uring/io-wq: don't use static creds/mm assignments
io-wq: make the io_wq ref counted
io_uring: fix refcounting with batched allocations at OOM
io_uring: add comment for drain_next
io_uring: don't attempt to copy iovec for READ/WRITE
io_uring: honor IOSQE_ASYNC for linked reqs
io_uring: prep req when do IOSQE_ASYNC
io_uring: use labeled array init in io_op_defs
io_uring: optimise sqe-to-req flags translation
io_uring: remove REQ_F_IO_DRAINED
io_uring: file switch work needs to get flushed on exit
io_uring: hide uring_fd in ctx
...
This series is slightly unusual because it includes Arnd's compat
ioctl tree here:
1c46a2cf2d Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue
Excluding Arnd's changes, this is mostly an update of the usual
drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas. There
are a couple of core and base updates around error propagation and
atomicity in the attribute container base we use for the SCSI
transport classes. The rest is minor changes and updates.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXjHQJyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishZZ8AQC02N+v
iUnTl1YxGPjIWBbnHuUxN2Qbb9D3C6gAT1LkigEArlk163K3A1XEQHF/VNCdAz/f
01XYTd3p1VHuegIBHlk=
=Cn52
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series is slightly unusual because it includes Arnd's compat
ioctl tree here:
1c46a2cf2d Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue
Excluding Arnd's changes, this is mostly an update of the usual
drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas.
There are a couple of core and base updates around error propagation
and atomicity in the attribute container base we use for the SCSI
transport classes.
The rest is minor changes and updates"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (149 commits)
scsi: hisi_sas: Rename hisi_sas_cq.pci_irq_mask
scsi: hisi_sas: Add prints for v3 hw interrupt converge and automatic affinity
scsi: hisi_sas: Modify the file permissions of trigger_dump to write only
scsi: hisi_sas: Replace magic number when handle channel interrupt
scsi: hisi_sas: replace spin_lock_irqsave/spin_unlock_restore with spin_lock/spin_unlock
scsi: hisi_sas: use threaded irq to process CQ interrupts
scsi: ufs: Use UFS device indicated maximum LU number
scsi: ufs: Add max_lu_supported in struct ufs_dev_info
scsi: ufs: Delete is_init_prefetch from struct ufs_hba
scsi: ufs: Inline two functions into their callers
scsi: ufs: Move ufshcd_get_max_pwr_mode() to ufshcd_device_params_init()
scsi: ufs: Split ufshcd_probe_hba() based on its called flow
scsi: ufs: Delete struct ufs_dev_desc
scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails
scsi: ufs-mediatek: enable low-power mode for hibern8 state
scsi: ufs: export some functions for vendor usage
scsi: ufs-mediatek: add dbg_register_dump implementation
scsi: qla2xxx: Fix a NULL pointer dereference in an error path
scsi: qla1280: Make checking for 64bit support consistent
scsi: megaraid_sas: Update driver version to 07.713.01.00-rc1
...
This kunit update for Linux 5.6-rc1 consists of:
-- Support for building kunit as a module from Alan Maguire
-- AppArmor KUnit tests for policy unpack from Mike Salvatore
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl4xz/wACgkQCwJExA0N
Qxyg2A//X0bnhN82oCchkTRW3GyGi5wTR2wGhoNzMZD0XUtCvn+4BlCSP20ttYdT
beiLCiewcuEdvXRyEV9Kikvet/67ovbjA/ce6ZrR7TlIHo8esKcy19/nu1OTvtI1
8eji1q7NSEV9iswz1ZoBAw+MTDHZfOI9qYY2UPcwjy7xWN84z2X1w+8UQ3EamOKd
6BfbohsYuuTTHhA2k1aUzvQcHqNz0YdH4yvNQpdunJXLUI04TeGZA6Ug66u6kWEd
1f5SSAu6r1vnU7DADrb1QwEDuIwL4KBuaMg2Rj5GLxTNp3wxmW9M2Dit+iN7+vNH
TS31kZW6KgxC5XuGVPENJaWlDX5Hm+5W8uiRZLNXsxDy927u53RzwrSZw/FbdbB1
HuPZZCzE1soWHdPIQz44HCCAg9XddypYlC1o4IYL1JkJknqG12ky4xgM8GRNCZAB
oUW3Ax3Lcr0EJALO/kFd/uEbl79PdmDk8uPMU1jtLyx5cs70yC3fsT2GB+DbP802
i/FxTtrOMGjU2OWcYfQcXapvZdgImf9nPsSZe3FJXjHfytNRbVZOZ2rHAMh03Keu
EBthDs6ejm6OUSGUXjngE9NaQKXsNSQ1Qor+6FrGnT4IxUMzWenudqHH7/dgF7Fr
fHlZGBilKMc/EYKb/6hj4kvEChrSIXj6TFknmI28I/epPiOr2gU=
=AFO4
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-5.6-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest kunit updates from Shuah Khan:
"This kunit update consists of:
- Support for building kunit as a module from Alan Maguire
- AppArmor KUnit tests for policy unpack from Mike Salvatore"
* tag 'linux-kselftest-5.6-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: building kunit as a module breaks allmodconfig
kunit: update documentation to describe module-based build
kunit: allow kunit to be loaded as a module
kunit: remove timeout dependence on sysctl_hung_task_timeout_seconds
kunit: allow kunit tests to be loaded as a module
kunit: hide unexported try-catch interface in try-catch-impl.h
kunit: move string-stream.h to lib/kunit
apparmor: add AppArmor KUnit tests for policy unpack
These are updates to device drivers and file systems that for some reason
or another were not included in the kernel in the previous y2038 series.
I've gone through all users of time_t again to make sure the kernel is
in a long-term maintainable state, replacing all remaining references
to time_t with safe alternatives.
Some related parts of the series were picked up into the nfsd, xfs,
alsa and v4l2 trees. A final set of patches in linux-mm removes the now
unused time_t/timeval/timespec types and helper functions after all five
branches are merged for linux-5.6, ensuring that no new users get merged.
As a result, linux-5.6, or my backport of the patches to 5.4 [1], should
be the first release that can serve as a base for a 32-bit system designed
to run beyond year 2038, with a few remaining caveats:
- All user space must be compiled with a 64-bit time_t, which will be
supported in the coming musl-1.2 and glibc-2.32 releases, along with
installed kernel headers from linux-5.6 or higher.
- Applications that use the system call interfaces directly need to be
ported to use the time64 syscalls added in linux-5.1 in place of the
existing system calls. This impacts most users of futex() and seccomp()
as well as programming languages that have their own runtime environment
not based on libc.
- Applications that use a private copy of kernel uapi header files or
their contents may need to update to the linux-5.6 version, in
particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
linux/elfcore.h, linux/sockios.h, linux/timex.h and linux/can/bcm.h.
- A few remaining interfaces cannot be changed to pass a 64-bit time_t
in a compatible way, so they must be configured to use CLOCK_MONOTONIC
times or (with a y2106 problem) unsigned 32-bit timestamps. Most
importantly this impacts all users of 'struct input_event'.
- All y2038 problems that are present on 64-bit machines also apply to
32-bit machines. In particular this affects file systems with on-disk
timestamps using signed 32-bit seconds: ext4 with ext3-style small
inodes, ext2, xfs (to be fixed soon) and ufs.
Changes since v1 [2]:
- Add Acks I received
- Rebase to v5.5-rc1, dropping patches that got merged already
- Add NFS, XFS and the final three patches from another series
- Rewrite etnaviv patches
- Add one late revert to avoid an etnaviv regression
[1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
[2] https://lore.kernel.org/lkml/20191108213257.3097633-1-arnd@arndb.de/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJeMYy3AAoJEGCrR//JCVInEGwP/0R+S+ok7vw9OdLVT0lFl07D
IcVabgOWf24imN7m7L7Mlt3nDfxIT4tMpiAXq7eMO3spcyViG18O2LXdSQ4/7QBp
+BlhoMjOP9w34Jyd7mnkFr4vqQALvfIqkS8rFObDtDub2Rfj9PC36MRMIu8BPXlv
RK8bigwJeH/DV38yc5/JeUcD+WuewYLsK9XPWN+4yB4vgGsNU3ZQQ6nnzbR3hMsN
DN8WZ68Y7IBs0Kyxkf+s2zmRXtCa2RiFg/2TUsk5olVAJVaenvte69hq5RSbg1vW
vLi6K8cBoPWL59nqCzcNE+TUhSUg3LOj/a/KWyl76yovz7AlJaNjssOf8ZjHw6sL
MhQqz3hXTxiJDS2Jvbf1yojiYGlzrq/gqcRFGe9jPcZdieMc4/yZCx60G/Exa5Pu
YdMcqMyDWPFyUAFQNWEF59HPheOdj6tb1KpJ6bwgCo3P7QqhLrU4z9w3Py4/ZfBO
4sWcWteSsD6MN/ADJ2WQ56nNxzM2AvkeVJKcF6FCkdngXX9T0GExmZz7SqB5Du99
9lNjIiD5E+LBa/Swo/7n49aYa8x06V1pmHYTZVh9Wkl+CZiO21umezQFrWsfaMTp
xt3c6pFdMG5xNMGpreTAXOmf2R+T6O8IO2qQq/TYjzqOLH7QC830P7avkmml+cK1
LjOBE2TfSeO8Ru1dXV4t
=wx0A
-----END PGP SIGNATURE-----
Merge tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull y2038 updates from Arnd Bergmann:
"Core, driver and file system changes
These are updates to device drivers and file systems that for some
reason or another were not included in the kernel in the previous
y2038 series.
I've gone through all users of time_t again to make sure the kernel is
in a long-term maintainable state, replacing all remaining references
to time_t with safe alternatives.
Some related parts of the series were picked up into the nfsd, xfs,
alsa and v4l2 trees. A final set of patches in linux-mm removes the
now unused time_t/timeval/timespec types and helper functions after
all five branches are merged for linux-5.6, ensuring that no new users
get merged.
As a result, linux-5.6, or my backport of the patches to 5.4 [1],
should be the first release that can serve as a base for a 32-bit
system designed to run beyond year 2038, with a few remaining caveats:
- All user space must be compiled with a 64-bit time_t, which will be
supported in the coming musl-1.2 and glibc-2.32 releases, along
with installed kernel headers from linux-5.6 or higher.
- Applications that use the system call interfaces directly need to
be ported to use the time64 syscalls added in linux-5.1 in place of
the existing system calls. This impacts most users of futex() and
seccomp() as well as programming languages that have their own
runtime environment not based on libc.
- Applications that use a private copy of kernel uapi header files or
their contents may need to update to the linux-5.6 version, in
particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
linux/elfcore.h, linux/sockios.h, linux/timex.h and
linux/can/bcm.h.
- A few remaining interfaces cannot be changed to pass a 64-bit
time_t in a compatible way, so they must be configured to use
CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
timestamps. Most importantly this impacts all users of 'struct
input_event'.
- All y2038 problems that are present on 64-bit machines also apply
to 32-bit machines. In particular this affects file systems with
on-disk timestamps using signed 32-bit seconds: ext4 with
ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"
[1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
* tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
y2038: sh: remove timeval/timespec usage from headers
y2038: sparc: remove use of struct timex
y2038: rename itimerval to __kernel_old_itimerval
y2038: remove obsolete jiffies conversion functions
nfs: fscache: use timespec64 in inode auxdata
nfs: fix timstamp debug prints
nfs: use time64_t internally
sunrpc: convert to time64_t for expiry
drm/etnaviv: avoid deprecated timespec
drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
drm/msm: avoid using 'timespec'
hfs/hfsplus: use 64-bit inode timestamps
hostfs: pass 64-bit timestamps to/from user space
packet: clarify timestamp overflow
tsacct: add 64-bit btime field
acct: stop using get_seconds()
um: ubd: use 64-bit time_t where possible
xtensa: ISS: avoid struct timeval
dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
...
Also make it available outside of epoll, along with the helper that
decides if we need to copy the passed in epoll_event.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're not consistent in how the file table is grabbed and assigned if we
have a command linked that requires the use of it.
Add ->file_table to the io_op_defs[] array, and use that to determine
when to grab the table instead of having the handlers set it if they
need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
flag. We always initialize work->files, so io-wq can just check for
that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- fix an out-of-bound read access introduced in v5.3,
which could rarely cause data corruption;
- various cleanup patches.
-----BEGIN PGP SIGNATURE-----
iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXjCB9xYcZ2FveGlhbmcy
NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEW+0BALDobAWvVw7Bxkz0tkFkmoZKF4eG
Otdzi/EhLq6baeyGAQCtSbTblu9/hAL53p6++RsTXazzzWDUyZZKtZj8MkR5Ag==
=c1Fs
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"A regression fix, several cleanups and (maybe) plus an upcoming new
mount api convert patch as a part of vfs update are considered
available for this cycle.
All commits have been in linux-next and tested with no smoke out.
Summary:
- fix an out-of-bound read access introduced in v5.3, which could
rarely cause data corruption
- various cleanup patches"
* tag 'erofs-for-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: clean up z_erofs_submit_queue()
erofs: fold in postsubmit_is_all_bypassed()
erofs: fix out-of-bound read for shifted uncompressed block
erofs: remove void tagging/untagging of workgroup pointers
erofs: remove unused tag argument while registering a workgroup
erofs: remove unused tag argument while finding a workgroup
erofs: correct indentation of an assigned structure inside a function
Pull openat2 support from Al Viro:
"This is the openat2() series from Aleksa Sarai.
I'm afraid that the rest of namei stuff will have to wait - it got
zero review the last time I'd posted #work.namei, and there had been a
leak in the posted series I'd caught only last weekend. I was going to
repost it on Monday, but the window opened and the odds of getting any
review during that... Oh, well.
Anyway, openat2 part should be ready; that _did_ get sane amount of
review and public testing, so here it comes"
From Aleksa's description of the series:
"For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown
flags are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road
to being added to openat(2).
Furthermore, the need for some sort of control over VFS's path
resolution (to avoid malicious paths resulting in inadvertent
breakouts) has been a very long-standing desire of many userspace
applications.
This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
(which was a variant of David Drysdale's O_BENEATH patchset[4] which
was a spin-off of the Capsicum project[5]) with a few additions and
changes made based on the previous discussion within [6] as well as
others I felt were useful.
In line with the conclusions of the original discussion of
AT_NO_JUMPS, the flag has been split up into separate flags. However,
instead of being an openat(2) flag it is provided through a new
syscall openat2(2) which provides several other improvements to the
openat(2) interface (see the patch description for more details). The
following new LOOKUP_* flags are added:
LOOKUP_NO_XDEV:
Blocks all mountpoint crossings (upwards, downwards, or through
absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is
also blocked (though magic-link jumps on the same vfsmount are
permitted).
LOOKUP_NO_MAGICLINKS:
Blocks resolution through /proc/$pid/fd-style links. This is done
by blocking the usage of nd_jump_link() during resolution in a
filesystem. The term "magic-links" is used to match with the only
reference to these links in Documentation/, but I'm happy to change
the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
LOOKUP_BENEATH:
Disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed.
Conceptually this flag is to ensure you "stay below" a certain
point in the filesystem tree -- but this requires some additional
to protect against various races that would allow escape using
"..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done
as in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
LOOKUP_NO_SYMLINKS:
Does what it says on the tin. No symlink resolution is allowed at
all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
can still be used with NOFOLLOW to open an fd for the symlink as
long as no parent path had a symlink component.
LOOKUP_IN_ROOT:
This is an extension of LOOKUP_BENEATH that, rather than blocking
attempts to move past the root, forces all such movements to be
scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that
chroot(2) is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to
cross magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container.
There is a long list of CVEs that could have bene mitigated by
having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution.
It features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like
RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
programs to be sure they don't hit DoSes though stale NFS handles)"
* 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Documentation: path-lookup: include new LOOKUP flags
selftests: add openat2(2) selftests
open: introduce openat2(2) syscall
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: allow set_root() to produce errors
namei: allow nd_jump_link() to produce errors
nsfs: clean-up ns_get_path() signature to return int
namei: only return -ECHILD from follow_dotdot_rcu()
Here is a small set of changes for 5.6-rc1 for the driver core and some
firmware subsystem changes.
Included in here are:
- device.h splitup like you asked for months ago
- devtmpfs minor cleanups
- firmware core minor changes
- debugfs fix for lockdown mode
- kernfs cleanup fix
- cpu topology minor fix
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXjFL1g8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymzfwCgpHAaFeiF1u2XS1pvfySssOUeQ8sAoMXWWgr4
BHms+ey3TaUnzm38dxk8
=Xv9r
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is a small set of changes for 5.6-rc1 for the driver core and
some firmware subsystem changes.
Included in here are:
- device.h splitup like you asked for months ago
- devtmpfs minor cleanups
- firmware core minor changes
- debugfs fix for lockdown mode
- kernfs cleanup fix
- cpu topology minor fix
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (22 commits)
firmware: Rename FW_OPT_NOFALLBACK to FW_OPT_NOFALLBACK_SYSFS
devtmpfs: factor out common tail of devtmpfs_{create,delete}_node
devtmpfs: initify a bit
devtmpfs: simplify initialization of mount_dev
devtmpfs: factor out setup part of devtmpfsd()
devtmpfs: fix theoretical stale pointer deref in devtmpfsd()
driver core: platform: fix u32 greater or equal to zero comparison
cpu-topology: Don't error on more than CONFIG_NR_CPUS CPUs in device tree
debugfs: Return -EPERM when locked down
driver core: Print device when resources present in really_probe()
driver core: Fix test_async_driver_probe if NUMA is disabled
driver core: platform: Prevent resouce overflow from causing infinite loops
fs/kernfs/dir.c: Clean code by removing always true condition
component: do not dereference opaque pointer in debugfs
drivers/component: remove modular code
debugfs: Fix warnings when building documentation
device.h: move 'struct driver' stuff out to device/driver.h
device.h: move 'struct class' stuff out to device/class.h
device.h: move 'struct bus' stuff out to device/bus.h
device.h: move dev_printk()-like functions to dev_printk.h
...
For personalities previously registered via IORING_REGISTER_PERSONALITY,
allow any command to select them. This is done through setting
sqe->personality to the id returned from registration, and then flagging
sqe->flags with IOSQE_PERSONALITY.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an application wants to use a ring with different kinds of
credentials, it can register them upfront. We don't lookup credentials,
the credentials of the task calling IORING_REGISTER_PERSONALITY is used.
An 'id' is returned for the application to use in subsequent personality
support.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to
be a valid io_uring fd io-wq of which will be shared with the newly
created io_uring instance. If the flag is set but it can't share io-wq,
it fails.
This allows creation of "sibling" io_urings, where we prefer to keep the
SQ/CQ private, but want to share the async backend to minimize the amount
of overhead associated with having multiple rings that belong to the same
backend.
Reported-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Daurnimator <quae@daurnimator.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Export a helper to attach to an existing io-wq, rather than setting up
a new one. This is doable now that we have reference counted io_wq's.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently setup the io_wq with a static set of mm and creds. Even for
a single-use io-wq per io_uring, this is suboptimal as we have may have
multiple enters of the ring. For sharing the io-wq backend, it doesn't
work at all.
Switch to passing in the creds and mm when the work item is setup. This
means that async work is no longer deferred to the io_uring mm and creds,
it is done with the current mm and creds.
Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
they can rely on the current personality (mm and creds) being the same
for direct issue and async issue.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull crypto updates from Herbert Xu:
"API:
- Removed CRYPTO_TFM_RES flags
- Extended spawn grabbing to all algorithm types
- Moved hash descsize verification into API code
Algorithms:
- Fixed recursive pcrypt dead-lock
- Added new 32 and 64-bit generic versions of poly1305
- Added cryptogams implementation of x86/poly1305
Drivers:
- Added support for i.MX8M Mini in caam
- Added support for i.MX8M Nano in caam
- Added support for i.MX8M Plus in caam
- Added support for A33 variant of SS in sun4i-ss
- Added TEE support for Raven Ridge in ccp
- Added in-kernel API to submit TEE commands in ccp
- Added AMD-TEE driver
- Added support for BCM2711 in iproc-rng200
- Added support for AES256-GCM based ciphers for chtls
- Added aead support on SEC2 in hisilicon"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (244 commits)
crypto: arm/chacha - fix build failured when kernel mode NEON is disabled
crypto: caam - add support for i.MX8M Plus
crypto: x86/poly1305 - emit does base conversion itself
crypto: hisilicon - fix spelling mistake "disgest" -> "digest"
crypto: chacha20poly1305 - add back missing test vectors and test chunking
crypto: x86/poly1305 - fix .gitignore typo
tee: fix memory allocation failure checks on drv_data and amdtee
crypto: ccree - erase unneeded inline funcs
crypto: ccree - make cc_pm_put_suspend() void
crypto: ccree - split overloaded usage of irq field
crypto: ccree - fix PM race condition
crypto: ccree - fix FDE descriptor sequence
crypto: ccree - cc_do_send_request() is void func
crypto: ccree - fix pm wrongful error reporting
crypto: ccree - turn errors to debug msgs
crypto: ccree - fix AEAD decrypt auth fail
crypto: ccree - fix typo in comment
crypto: ccree - fix typos in error msgs
crypto: atmel-{aes,sha,tdes} - Retire crypto_platform_data
crypto: x86/sha - Eliminate casts on asm implementations
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl4wd8IACgkQiiy9cAdy
T1FFmQwAuh4efFQuBjV3/0CRHC7ezqW88/zVuRqK5m1xC+VDW2tcV24mcsyVOOMW
wgOdr0NXWRxrPHoJfIGLPFwq7o59PFWR5Cdwjpu/th72W9zftAPZdQCjCO3ssO3u
tGmtavN/rk61A3ALjqtjC6lRx+A1wxYoxgBNlAXwyCktWShQBWHFsRnFq6w7e76H
lvaXkk0ubh7LDnCvwqpMupkD3n//Y9UAHQ1D1oIC29i6F5M8SmGs36e1ptuSCssq
q4Xb48uc+QXeD8NvA0Y9S/Su3NZEy/iqJpFok75tHzbr+ypAthhvxfMW3SnOh2NI
t4zWUSIcNU7b2v+QjrWqJlN8vpYB8R4Awti0FkI1LvjjZwZ1lTcW2o40pAwgQl3o
BoFqlksbOf+457tmcuQ7/psqRxAvgA3z7rSFJbYvKFYR2Go0E6oxpi1mXjph5Bw3
K5s3ieEbEC2mbAa/5Qf4HHit4jL4tGxijo5MhUB9uMymCGyxYIKUJcVnZ/LnTgpC
WZTjeNqT
=7t46
-----END PGP SIGNATURE-----
Merge tag '5.6-smb3-fixes-and-dfs-and-readdir-improvements' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"Various SMB3/CIFS fixes including four for stable.
- Improvement to fallocate (enables 3 additional xfstests)
- Fix for file creation when mounting with modefromsid
- Add ability to backup/restore dos attributes and creation time
- DFS failover and reconnect fixes
- performance optimization for readir
Note that due to the upcoming SMB3 Test Event (at SNIA SDC next week)
there will likely be more changesets near the end of the merge window
(since we will be testing heavily next week, I held off on some
patches and I expect some additional multichannel patches as well as
patches to enable some additional xfstests)"
* tag '5.6-smb3-fixes-and-dfs-and-readdir-improvements' of git://git.samba.org/sfrench/cifs-2.6: (24 commits)
CIFS: Fix task struct use-after-free on reconnect
cifs: use PTR_ERR_OR_ZERO() to simplify code
cifs: add support for fallocate mode 0 for non-sparse files
cifs: fix NULL dereference in match_prepath
smb3: fix default permissions on new files when mounting with modefromsid
CIFS: Add support for setting owner info, dos attributes, and create time
cifs: remove set but not used variable 'server'
cifs: Fix memory allocation in __smb2_handle_cancelled_cmd()
cifs: Fix mount options set in automount
cifs: fix unitialized variable poential problem with network I/O cache lock patch
cifs: Fix return value in __update_cache_entry
cifs: Avoid doing network I/O while holding cache lock
cifs: Fix potential deadlock when updating vol in cifs_reconnect()
cifs: Merge is_path_valid() into get_normalized_path()
cifs: Introduce helpers for finding TCP connection
cifs: Get rid of kstrdup_const()'d paths
cifs: Clean up DFS referral cache
cifs: Don't use iov_iter::type directly
cifs: set correct max-buffer-size for smb2_ioctl_init()
cifs: use compounding for open and first query-dir for readdir()
...
- Optimize fs-verity sequential read performance by implementing
readahead of Merkle tree pages. This allows the Merkle tree to be
read in larger chunks.
- Optimize FS_IOC_ENABLE_VERITY performance in the uncached case by
implementing readahead of data pages.
- Allocate the hash requests from a mempool in order to eliminate the
possibility of allocation failures during I/O.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXi+OuhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK/ZIAP452KKPs6AGXrClZ2l+5nFbkDLN9Or8
w277B0BeRnu5ogEApmKnYsmRsduLZRJbni7VCpkJLAYI2kmFCwGkFfe3tAQ=
=svdR
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity updates from Eric Biggers:
- Optimize fs-verity sequential read performance by implementing
readahead of Merkle tree pages. This allows the Merkle tree to be
read in larger chunks.
- Optimize FS_IOC_ENABLE_VERITY performance in the uncached case by
implementing readahead of data pages.
- Allocate the hash requests from a mempool in order to eliminate the
possibility of allocation failures during I/O.
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: use u64_to_user_ptr()
fs-verity: use mempool for hash requests
fs-verity: implement readahead of Merkle tree pages
fs-verity: implement readahead for FS_IOC_ENABLE_VERITY
- Extend the FS_IOC_ADD_ENCRYPTION_KEY ioctl to allow the raw key to be
provided via a keyring key.
- Prepare for the new dirhash method (SipHash of plaintext name) that
will be used by directories that are both encrypted and casefolded.
- Switch to a new format for "no-key names" that prepares for the new
dirhash method, and also fixes a longstanding bug where multiple
filenames could map to the same no-key name.
- Allow the crypto algorithms used by fscrypt to be built as loadable
modules when the fscrypt-capable filesystems are.
- Optimize fscrypt_zeroout_range().
- Various cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXi+OfxQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK0tJAQDkxUl11NRqVgS06TLWAniKVMWh3kaL
CaonjEmfs1m6kgEA+TP8coTAxUvJNubaHz3J3x9dyx9RPUVFyUzSB0J/TQg=
=XkXJ
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
- Extend the FS_IOC_ADD_ENCRYPTION_KEY ioctl to allow the raw key to be
provided via a keyring key.
- Prepare for the new dirhash method (SipHash of plaintext name) that
will be used by directories that are both encrypted and casefolded.
- Switch to a new format for "no-key names" that prepares for the new
dirhash method, and also fixes a longstanding bug where multiple
filenames could map to the same no-key name.
- Allow the crypto algorithms used by fscrypt to be built as loadable
modules when the fscrypt-capable filesystems are.
- Optimize fscrypt_zeroout_range().
- Various cleanups.
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: (26 commits)
fscrypt: improve format of no-key names
ubifs: allow both hash and disk name to be provided in no-key names
ubifs: don't trigger assertion on invalid no-key filename
fscrypt: clarify what is meant by a per-file key
fscrypt: derive dirhash key for casefolded directories
fscrypt: don't allow v1 policies with casefolding
fscrypt: add "fscrypt_" prefix to fname_encrypt()
fscrypt: don't print name of busy file when removing key
ubifs: use IS_ENCRYPTED() instead of ubifs_crypt_is_encrypted()
fscrypt: document gfp_flags for bounce page allocation
fscrypt: optimize fscrypt_zeroout_range()
fscrypt: remove redundant bi_status check
fscrypt: Allow modular crypto algorithms
fscrypt: include <linux/ioctl.h> in UAPI header
fscrypt: don't check for ENOKEY from fscrypt_get_encryption_info()
fscrypt: remove fscrypt_is_direct_key_policy()
fscrypt: move fscrypt_valid_enc_modes() to policy.c
fscrypt: check for appropriate use of DIRECT_KEY flag earlier
fscrypt: split up fscrypt_supported_policy() by policy version
fscrypt: introduce fscrypt_needs_contents_encryption()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4vEM0ACgkQxWXV+ddt
WDvWFA//eHQZjWQ4FNby9AZwkDS4NIq2v6XGg7Xx9Ioxf1NenuBdyt1uNda2I3Mw
4/0tEnrd3XL3POldx4TTZXlg0itfXeY2un8pGREPwbqGL67296wmENUT8kwXU8aX
4zMuqO8veO7jzqeibX7T0VntnAC4Q+yWUd8Tb07KXaNH6Jeowo2rbU7n2CdI8T91
IZdc/AxnmomVQ3Y0lmQe5fL5z9KRkwud7toXek8yVXyheX1JgnqkzVjCgLyk3ai7
0VP0jfOPtC1qoMPR+buD2VuxaDTrLGkmsaG9l4XZghmmSnAvTMOxIosFA/SLqHwp
LPDZ5RABbxMj1MHTAr2Bq+9uokqSIIdbSWFiMazeJsL/JWOiOQk0BekNJudeL7Ow
PtGIZkKC+c3qKIjLymnd+u56klCe/O4vhx7wOHjoGxC0j+qRN473LK8dmGmgM9Yv
XH5YfWqDpjt5h7SSei94S7JrqaJ29+gY8bMtMXJhx46jHh6AdXAkszU83kLr8Biz
mvV49ECib12RuI4EVB4I3XedCftzMRH5ocN/Y3B2qGn/bU3g9nuExAc24FRBt6QK
e/ucc4jQyRpbvpDwsmdE+kSxUvENuIRnlyke9XdOsN9Mth0X85gvYOS6aFS+4qjq
QiXborlp+fZY00INYjG6Ck9UyfpdSEUJg452xVkGBhd/emAAIDo=
=FGXQ
-----END PGP SIGNATURE-----
Merge tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull fs deduplication fix from David Sterba:
"This is a fix for deduplication bug: the last block of two files is
allowed to deduplicated. This got broken in 5.1 by lifting some
generic checks to VFS layer. The affected filesystems are btrfs and
xfs.
The patches are marked for stable as the bug decreases deduplication
effectivity"
* tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: make deduplication with range including the last block work
fs: allow deduplication of eof block into the end of the destination file
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4vDYkACgkQxWXV+ddt
WDsNJQ//WJEcYoRpN5Y7oOIk/vo5ulF68P3kUh3hl206A13xpaHorvTvZKAD5s2o
C6xACJk839sGEhMdDRWvdeBDCHTedMk7EXjiZ6kJD+7EPpWmDllI5O6DTolT7SR2
b9zId4KCO+m8LiLZccRsxCJbdkJ7nJnz2c5+063TjsS3uq1BFudctRUjW/XnFCCZ
JIE5iOkdXrA+bFqc+l2zKTwgByQyJg+hVKRTZEJBT0QZsyNQvHKzXAmXxGopW8bO
SeuzFkiFTA0raK8xBz6mUwaZbk40Qlzm9v9AitFZx0x2nvQnMu447N3xyaiuyDWd
Li1aMN0uFZNgSz+AemuLfG0Wj70x1HrQisEj958XKzn4cPpUuMcc3lr1PZ2NIX+C
p6pSgaLOEq8Rc0U78/euZX6oyiLJPAmQO1TdkVMHrcMi36esBI6uG11rds+U+xeK
XoP20qXLFVYLLrl3wH9F4yIzydfMYu66Us1AeRPRB14NSSa7tbCOG//aCafOoLM6
518sJCazSWlv1kDewK8dtLiXc8eM6XJN+KI4NygFZrUj2Rq376q5oovUUKKkn3iN
pdHtF/7gAxIx6bZ+jY/gyt/Xe5AdPi7sKggahvrSOL3X+LLINwC4r+vAnnpd6yh4
NfJj5fobvc/mO9PEVMwgJ8PmHw5uNqeMlORGjk7stQs7Oez3tCw=
=4OkE
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features, highlights:
- async discard
- "mount -o discard=async" to enable it
- freed extents are not discarded immediatelly, but grouped
together and trimmed later, with IO rate limiting
- the "sync" mode submits short extents that could have been
ignored completely by the device, for SATA prior to 3.1 the
requests are unqueued and have a big impact on performance
- the actual discard IO requests have been moved out of
transaction commit to a worker thread, improving commit latency
- IO rate and request size can be tuned by sysfs files, for now
enabled only with CONFIG_BTRFS_DEBUG as we might need to
add/delete the files and don't have a stable-ish ABI for
general use, defaults are conservative
- export device state info in sysfs, eg. missing, writeable
- no discard of extents known to be untouched on disk (eg. after
reservation)
- device stats reset is logged with process name and PID that called
the ioctl
Fixes:
- fix missing hole after hole punching and fsync when using NO_HOLES
- writeback: range cyclic mode could miss some dirty pages and lead
to OOM
- two more corner cases for metadata_uuid change after power loss
during the change
- fix infinite loop during fsync after mix of rename operations
Core changes:
- qgroup assign returns ENOTCONN when quotas not enabled, used to
return EINVAL that was confusing
- device closing does not need to allocate memory anymore
- snapshot aware code got removed, disabled for years due to
performance problems, reimplmentation will allow to select wheter
defrag breaks or does not break COW on shared extents
- tree-checker:
- check leaf chunk item size, cross check against number of
stripes
- verify location keys for DIR_ITEM, DIR_INDEX and XATTR items
- new self test for physical -> logical mapping code, used for super
block range exclusion
- assertion helpers/macros updated to avoid objtool "unreachable
code" reports on older compilers or config option combinations"
* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
btrfs: free block groups after free'ing fs trees
btrfs: Fix split-brain handling when changing FSID to metadata uuid
btrfs: Handle another split brain scenario with metadata uuid feature
btrfs: Factor out metadata_uuid code from find_fsid.
btrfs: Call find_fsid from find_fsid_inprogress
Btrfs: fix infinite loop during fsync after rename operations
btrfs: set trans->drity in btrfs_commit_transaction
btrfs: drop log root for dropped roots
btrfs: sysfs, add devid/dev_state kobject and device attributes
btrfs: Refactor btrfs_rmap_block to improve readability
btrfs: Add self-tests for btrfs_rmap_block
btrfs: selftests: Add support for dummy devices
btrfs: Move and unexport btrfs_rmap_block
btrfs: separate definition of assertion failure handlers
btrfs: device stats, log when stats are zeroed
btrfs: fix improper setting of scanned for range cyclic write cache pages
btrfs: safely advance counter when looking up bio csums
btrfs: remove unused member btrfs_device::work
btrfs: remove unnecessary wrapper get_alloc_profile
btrfs: add correction to handle -1 edge case in async discard
...
Pull x86 resource control updates from Ingo Molnar:
"The main change in this tree is the extension of the resctrl procfs
ABI with a new file that helps tooling to navigate from tasks back to
resctrl groups: /proc/{pid}/cpu_resctrl_groups.
Also fix static key usage for certain feature combinations and
simplify the task exit resctrl case"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Add task resctrl information display
x86/resctrl: Check monitoring static key in the MBM overflow handler
x86/resctrl: Do not reconfigure exiting tasks
Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Ftrace is one of the last W^X violators (after this only KLP is
left). These patches move it over to the generic text_poke()
interface and thereby get rid of this oddity. This requires a
surprising amount of surgery, by Peter Zijlstra.
- x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
count certain types of events that have a special, quirky hw ABI
(by Kim Phillips)
- kprobes fixes by Masami Hiramatsu
Lots of tooling updates as well, the following subcommands were
updated: annotate/report/top, c2c, clang, record, report/top TUI,
sched timehist, tests; plus updates were done to the gtk ui, libperf,
headers and the parser"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
perf/x86/amd: Add support for Large Increment per Cycle Events
perf/x86/amd: Constrain Large Increment per Cycle events
perf/x86/intel/rapl: Add Comet Lake support
tracing: Initialize ret in syscall_enter_define_fields()
perf header: Use last modification time for timestamp
perf c2c: Fix return type for histogram sorting comparision functions
perf beauty sockaddr: Fix augmented syscall format warning
perf/ui/gtk: Fix gtk2 build
perf ui gtk: Add missing zalloc object
perf tools: Use %define api.pure full instead of %pure-parser
libperf: Setup initial evlist::all_cpus value
perf report: Fix no libunwind compiled warning break s390 issue
perf tools: Support --prefix/--prefix-strip
perf report: Clarify in help that --children is default
tools build: Fix test-clang.cpp with Clang 8+
perf clang: Fix build with Clang 9
kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
tools lib: Fix builds when glibc contains strlcpy()
perf report/top: Make 'e' visible in the help and make it toggle showing callchains
perf report/top: Do not offer annotation for symbols without samples
...
This reverts commit e955537e32.
Before patch e955537e32, tr_num_revoke tracked the number of revokes
added to the transaction, and tr_num_revoke_rm tracked how many
revokes were removed. But since revokes are queued off the sdp
(superblock) pointer, some transactions could remove more revokes
than they added. (e.g. revokes added by a different process).
Commit e955537e32 eliminated transaction variable tr_num_revoke_rm,
but in order to do so, it changed the accounting to always use
tr_num_revoke for its math. Since you can remove more revokes than
you add, tr_num_revoke could now become a negative value.
This negative value broke the assert in function gfs2_trans_end:
if (gfs2_assert_withdraw(sdp, (nbuf <=3D tr->tr_blocks) &&
(tr->tr_num_revoke <=3D tr->tr_revokes)))
One way to fix this is to simply remove the tr_num_revoke clause
from the assert and allow the value to become negative. Andreas
didn't like that idea, so instead, we decided to revert e955537e32.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
- Rework the smp function call core code to avoid the allocation of an
additional cpumask.
- Remove the not longer required GFP argument from on_each_cpu_cond() and
on_each_cpu_cond_mask() and fixup the callers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vcrATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocr1D/4ptWrZKsgBxGKBP34lvJAjd0KRqVoz
J9dLAN+AAs6YZSnOmRBX1b9d9IL2PrccOEF+J/Ja3ZkB+PAoAQ9W3uCHkZ77WUph
xx5eJahZCo+3nZ6amGgS2cPdG8WjxSK3enxPcU4pJhV/QaaP7R9BZt5YQgreYAQO
kRi0qyt10AExLqLd+077GX5DKcEOXwwVG/qckUQK2h8Kkd68vTbjDxggvsHwmpSE
MHaszv85UpE+YQbT6DyG5Hi4kK3AJeODBy/fKr2VODIBLZpKiuQ5kK4lbNHYPpVB
wXw0umXHLQggrKoPKo58ayoCXD0bAG9JT0rvapjUJIz1/9YejQ6lB/t5f0dPbSrU
al4CJq/pfNky4H6uLWFVbAXJabJuBcB/eG1csaM88Yw0pEXkbnHCOkJAdosoDhhl
qNQYg4yaE9tTuy1chXDMntH0R0Qztqry6+DMsczJxT21TgERsHCRJV+mGLV46/ZN
GXJEoJ/cnjNJlqj8GirjbksPRbxuvmQNHRVrTh8qOSxbPKUQZfZocp9HHNmFsBaN
Q07VgWMHXzYj1L4r3cbJ/ONpOCo66lw7F//MNGk0eIWdeL6H7XZvJQPX+YUrLsZc
tVlZh8mZOGbRiM8g1dN0BSJO7QrVYmJWGb0oQQtv5tVSRN/V8Y9VZ8YX8lpYlF1e
ETkrZLGhTJWp4A==
=M4aK
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core SMP updates from Thomas Gleixner:
"A small set of SMP core code changes:
- Rework the smp function call core code to avoid the allocation of
an additional cpumask
- Remove the not longer required GFP argument from on_each_cpu_cond()
and on_each_cpu_cond_mask() and fixup the callers"
* tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Remove allocation mask from on_each_cpu_cond.*()
smp: Add a smp_cond_func_t argument to smp_call_function_many()
smp: Use smp_cond_func_t as type for the conditional function
- Time namespace support:
If a container migrates from one host to another then it expects that
clocks based on MONOTONIC and BOOTTIME are not subject to
disruption. Due to different boot time and non-suspended runtime these
clocks can differ significantly on two hosts, in the worst case time
goes backwards which is a violation of the POSIX requirements.
The time namespace addresses this problem. It allows to set offsets for
clock MONOTONIC and BOOTTIME once after creation and before tasks are
associated with the namespace. These offsets are taken into account by
timers and timekeeping including the VDSO.
Offsets for wall clock based clocks (REALTIME/TAI) are not provided by
this mechanism. While in theory possible, the overhead and code
complexity would be immense and not justified by the esoteric potential
use cases which were discussed at Plumbers '18.
The overhead for tasks in the root namespace (host time offsets = 0) is
in the noise and great effort was made to ensure that especially in the
VDSO. If time namespace is disabled in the kernel configuration the
code is compiled out.
Kudos to Andrei Vagin and Dmitry Sofanov who implemented this feature
and kept on for more than a year addressing review comments, finding
better solutions. A pleasant experience.
- Overhaul of the alarmtimer device dependency handling to ensure that
the init/suspend/resume ordering is correct.
- A new clocksource/event driver for Microchip PIT64
- Suspend/resume support for the Hyper-V clocksource
- The usual pile of fixes, updates and improvements mostly in the
driver code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vbTcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXT2D/96iJ3G9Snn2khEQP3XS2rYmtDGw7NO
m1n96falwWeGe6zreU80R2Jge5nLxQtNhRoMPLLee1GpHwRC6lvqEqgdZ4LMBrD2
JqV7Gzg8Urmdh+hpDsyTCpeEWEzoMKxiFOX8PxwctqUhM4szEe5iQg2YQsg85Jw2
vG6M93N2xwDILh4rhEMbKjo+5ZmYn7c1RQvpGOSmpKOj940W/N7H2HBsFhdaJ1Kw
FW5pFv1211PaU5RV2YNb2dMeeMTT1N3e2VN4Dkadoxp47pb+725gNHEBEjmV9poG
Lp4IhzGAPnj8zVD88icQZSTaK3gUHMClxprJ0Pf84WEtiH7SeGu8BPYyu77+oNDe
yzcctDJNyCWXkzmaP/fe/HLc0TStbvNAJ5Tagp4BC75gzebeb4/n8RtRT0fKeDYL
pxpDPKDAPU7p1JSjxiWAtshqjBycWNY3Z49bA7/VhKBhnv8BDyBPGlYd7/4xrbGr
RK7DQNXJwaJaiNJ7p5PiaFxGzNyB0B9sThD/slSlEInIKb4h9YzWr0TV+NB62VnB
sDcN+tpLbRPz5/5cHGGfxR0+zKWpfyai8pzbmmaXEaKssjRYwyvcac5EZdgbWpbK
k7CqAjoWLA2P+tGeePNJOf5JYK6Vmdyh4clmuwM0zOiRJ9NlWUyMf3z7QYILs4RO
UAI+6opYlZEPAw==
=x3qT
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The timekeeping and timers departement provides:
- Time namespace support:
If a container migrates from one host to another then it expects
that clocks based on MONOTONIC and BOOTTIME are not subject to
disruption. Due to different boot time and non-suspended runtime
these clocks can differ significantly on two hosts, in the worst
case time goes backwards which is a violation of the POSIX
requirements.
The time namespace addresses this problem. It allows to set offsets
for clock MONOTONIC and BOOTTIME once after creation and before
tasks are associated with the namespace. These offsets are taken
into account by timers and timekeeping including the VDSO.
Offsets for wall clock based clocks (REALTIME/TAI) are not provided
by this mechanism. While in theory possible, the overhead and code
complexity would be immense and not justified by the esoteric
potential use cases which were discussed at Plumbers '18.
The overhead for tasks in the root namespace (ie where host time
offsets = 0) is in the noise and great effort was made to ensure
that especially in the VDSO. If time namespace is disabled in the
kernel configuration the code is compiled out.
Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
feature and kept on for more than a year addressing review
comments, finding better solutions. A pleasant experience.
- Overhaul of the alarmtimer device dependency handling to ensure
that the init/suspend/resume ordering is correct.
- A new clocksource/event driver for Microchip PIT64
- Suspend/resume support for the Hyper-V clocksource
- The usual pile of fixes, updates and improvements mostly in the
driver code"
* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
alarmtimer: Use wakeup source from alarmtimer platform device
alarmtimer: Make alarmtimer platform device child of RTC device
alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
hrtimer: Add missing sparse annotation for __run_timer()
lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
clocksource/drivers/exynos_mct: Rename Exynos to lowercase
clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
clocksource/drivers/bcm2835_timer: Fix memory leak of timer
clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
...
In preparation for sharing an io-wq across different users, add a
reference count that manages destruction of it.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of out of memory the second argument of percpu_ref_put_many() in
io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly
wrong.
Fixes: 2b85edfc0c ("io_uring: batch getting pcpu references")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Draining the middle of a link is tricky, so leave a comment there
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For the non-vectored variant of READV/WRITEV, we don't need to setup an
async io context, and we flag that appropriately in the io_op_defs
array. However, in fixing this for the 5.5 kernel in commit 74566df3a7
we didn't have these opcodes, so the check there was added just for the
READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a
single check for needing async context, that covers all four of these
read/write variants that don't use an iovec.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All of these functions are only called from CephFS, so move them into
ceph.ko, and drop the exports.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It's hard to read, especially when it is:
ceph: __choose_mds 00000000b7bc9c15 is_hash=1 (-271041095) mode 0
At the same time, switch to __func__ to get rid of the checkpatch
warning.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Instead of using the copy-from operation, switch copy_file_range to the
new copy-from2 operation, which allows to send the truncate_seq and
truncate_size parameters.
If an OSD does not support the copy-from2 operation it will return
-EOPNOTSUPP. In that case, the kernel client will stop trying to do
remote object copies for this fs client and will always use the generic
VFS copy_file_range.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Move s_ref up to plug a 4 byte hole, which plugs another.
Move r_kref to shave 8 bytes off per request on x86_64.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The total bytes may potentially be larger than 8.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Just in case the session's refcount reach 0 and is releasing, and
if we get the session without checking it, we may encounter kernel
crash.
Rename get_session to ceph_get_mds_session and make it global.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It's possible to pass the mount helper a server path that has more
than one contiguous slash character. For example:
$ mount -t ceph 192.168.195.165:40176:/// /mnt/cephfs/
In the MDS server side the extra slashes of the server path will be
treated as snap dir, and then we can get the following debug logs:
ceph: mount opening path //
ceph: open_root_inode opening '//'
ceph: fill_trace 0000000059b8a3bc is_dentry 0 is_target 1
ceph: alloc_inode 00000000dc4ca00b
ceph: get_inode created new inode 00000000dc4ca00b 1.ffffffffffffffff ino 1
ceph: get_inode on 1=1.ffffffffffffffff got 00000000dc4ca00b
And then when creating any new file or directory under the mount
point, we can hit the following BUG_ON in ceph_fill_trace():
BUG_ON(ceph_snap(dir) != dvino.snap);
Have the client ignore the extra slashes in the server path when
mounting. This will also canonicalize the path, so that identical mounts
can be consilidated.
1) "//mydir1///mydir//"
2) "/mydir1/mydir"
3) "/mydir1/mydir/"
Regardless of the internal treatment of these paths, the kernel still
stores the original string including the leading '/' for presentation
to userland.
URL: https://tracker.ceph.com/issues/42771
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The m_num_mds here is actually the number for MDSs which are in
up:active status, and it will be duplicated to m_num_active_mds,
so remove it.
Add possible_max_rank to the mdsmap struct and this will be
the correctly possible largest rank boundary.
Remove the special case for one mds in __mdsmap_get_random_mds(),
because the validate mds rank may not always be 0.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In the debug logs about the di->offset or ctx->pos it is in hex
format, but some others are using the dec format. It is a little
hard to read.
For the xattr version, it is u64 type, using a shorter type may
truncate it.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For the caps having no any subset mask requested we shouldn't touch
them.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently, we could have an open-by-handle (or NFS server) call
into the filesystem and start working with an inode before it's
properly filled out.
Don't clear I_NEW until we have filled out the inode, and discard it
properly if that fails. Note that we occasionally take an extra
reference to the inode to ensure that we don't put the last reference in
discard_new_inode, but rather leave it for ceph_async_iput.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If max_mds > 1 and a request is submitted that chooses a random mds
rank, and the relating session is not opened yet, the request will wait
until the session has been opened and resend again.
Every time the request goes through __do_request, it will release the
req->session first and choose a random one again, which may be a
completely different rank than the one it just waited on.
In the worst case, it will open all the mds sessions one by one just
before the request can be successfully sent out.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If all the MDS daemons are down for some reason, then the first mount
attempt will fail with EIO after the mount request times out. A mount
attempt will also fail with EIO if all of the MDS's are laggy.
This patch changes the code to return -EHOSTUNREACH in these situations
and adds a pr_info error message to help the admin determine the cause.
URL: https://tracker.ceph.com/issues/4386
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When reconnecting the session but if it is denied by the MDS due
to client was in blacklist or something else, kclient will receive
a session close reply, and we will never see the important log:
"ceph: mds%d reconnect denied"
And with the confusing log:
"ceph: handle_session mds0 close 0000000085804730 state ??? seq 0"
Let's keep the session state until its memories is released.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the caller passes in a NULL cap_reservation, and we can't allocate
one then ensure that we fail gracefully.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
During umount, if there has no any unsafe request in the mdsc and
some requests still in-flight and not got reply yet, and if the
rest requets are all safe ones, after that even all of them in mdsc
are unregistered, the umount must wait until after mount_timeout
seconds anyway.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Even the MDS is in up:active state, but it also maybe laggy. Here
will skip the laggy MDSs.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In case the max_mds > 1 in MDS cluster and there is no any standby
MDS and all the max_mds MDSs are in up:active state, if one of the
up:active MDSs is dead, the m->m_num_laggy in kclient will be 1.
Then the mount will fail without considering other healthy MDSs.
There manybe some MDSs still "in" the cluster but not in up:active
state, we will ignore them. Only when all the up:active MDSs in
the cluster are laggy will treat the cluster as not be available.
In case decreasing the max_mds, the cluster will not stop the extra
up:active MDSs immediately and there will be a latency. During it
the up:active MDS number will be larger than the max_mds, so later
the m_info memories will 100% be reallocated.
Here will pick out the up:active MDSs as the m_num_mds and allocate
the needed memories once.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_pagelist_encode_string() will not fail in reserved case,
also, we do not check err code here, so remove unnecessary
assignment.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We print session's refcount in debug message inside
ceph_put_mds_session() and get_session(), so we don't have to
print it in con_get()/__ceph_lookup_mds_session()/con_put().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
PTR_ERR_OR_ZERO contains if(IS_ERR(...)) + PTR_ERR, just use
PTR_ERR_OR_ZERO directly.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
RHBZ 1336264
When we extend a file we must also force the size to be updated.
This fixes an issue with holetest in xfs-tests which performs the following
sequence :
1, create a new file
2, use fallocate mode==0 to populate the file
3, mmap the file
4, touch each page by reading the mmapped region.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
RHBZ: 1760879
Fix an oops in match_prepath() by making sure that the prepath string is not
NULL before we pass it into strcmp().
This is similar to other checks we make for example in cifs_root_iget()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When mounting with "modefromsid" mount parm most servers will require
that some default permissions are given to users in the ACL on newly
created files, files created with the new 'sd context' - when passing in
an sd context on create, permissions are not inherited from the parent
directory, so in addition to the ACE with the special SID which contains
the mode, we also must pass in an ACE allowing users to access the file
(GENERIC_ALL for authenticated users seemed like a reasonable default,
although later we could allow a mount option or config switch to make
it GENERIC_ALL for EVERYONE special sid).
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-By: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
This is needed for backup/restore scenarios among others.
Add extended attribute "system.cifs_ntsd" (and alias "system.smb3_ntsd")
to allow for setting owner and DACL in the security descriptor. This is in
addition to the existing "system.cifs_acl" and "system.smb3_acl" attributes
that allow for setting DACL only. Add support for setting creation time and
dos attributes using set_file_info() calls to complement the existing
support for getting these attributes via query_path_info() calls.
Signed-off-by: Boris Protopopov <bprotopopov@hotmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
fs/cifs/smb2pdu.c: In function 'SMB2_query_directory':
fs/cifs/smb2pdu.c:4444:26: warning:
variable 'server' set but not used [-Wunused-but-set-variable]
struct TCP_Server_Info *server;
It is not used, so remove it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Starting from 4a367dc044, we must set the mount options based on the
DFS full path rather than the resolved target, that is, cifs_mount()
will be responsible for resolving the DFS link (cached) as well as
performing failover to any other targets in the referral.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reported-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com>
Fixes: 4a367dc044 ("cifs: Add support for failover in cifs_mount()")
Link: https://lore.kernel.org/linux-cifs/39643d7d-2abb-14d3-ced6-c394fab9a777@prodrive-technologies.com
Tested-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
static analysis with Coverity detected an issue with the following
commit:
Author: Paulo Alcantara (SUSE) <pc@cjr.nz>
Date: Wed Dec 4 17:38:03 2019 -0300
cifs: Avoid doing network I/O while holding cache lock
Addresses-Coverity: ("Uninitialized pointer read")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
copy_ref_data() may return error, it should be
returned to upstream caller.
Fixes: 03535b72873b ("cifs: Avoid doing network I/O while holding cache lock")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When creating or updating a cache entry, we need to get an DFS
referral (get_dfs_referral), so avoid holding any locks during such
network operation.
To prevent that, do the following:
* change cache hashtable sync method from RCU sync to a read/write
lock.
* use GFP_ATOMIC in memory allocations.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
We can't acquire volume lock while refreshing the DFS cache because
cifs_reconnect() may call dfs_cache_update_vol() while we are walking
through the volume list.
To prevent that, make vol_info refcounted, create a temp list with all
volumes eligible for refreshing, and then use it without any locks
held.
Besides, replace vol_lock with a spinlock and protect cache_ttl from
concurrent accesses or changes.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Just do the trivial path validation in get_normalized_path().
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add helpers for finding TCP connections that are good candidates for
being used by DFS refresh worker.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The DFS cache API is mostly used with heap allocated strings.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Do some renaming and code cleanup.
No functional changes.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Don't use iov_iter::type directly, but rather use the new accessor
functions that have been added. This allows the .type field to be split
and rearranged without the need to update the filesystems.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix two places where we need to adjust down the max response size for
ioctl when it is used together with compounding.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Combine the initial SMB2_Open and the first SMB2_Query_Directory in a compound.
This shaves one round-trip of each directory listing, changing it from 4 to 3
for small directories.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Fixes coccicheck warning:
fs/cifs/cifssmb.c:4622:3-22: WARNING: Assignment of 0/1 to bool variable
fs/cifs/cifssmb.c:4756:3-22: WARNING: Assignment of 0/1 to bool variable
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes coccicheck warning:
fs/cifs/smb2ops.c:807:2-36: WARNING: Assignment of 0/1 to bool variable
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Instead of passing __func__ to the error reporting function, let's use
the return address builtins so that the messages actually tell you which
higher level function called the buffer functions. This was previously
true for the xfs_buf_read callers, but not for the xfs_trans_read_buf
callers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Drop the null buffer pointer checks in all code that calls
xfs_alloc_read_agf and doesn't pass XFS_ALLOC_FLAG_TRYLOCK because
they're no longer necessary.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor xfs_read_agf and xfs_alloc_read_agf to return EAGAIN if the
caller passed TRYLOCK and we weren't able to get the lock; and change
the callers to recognize this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the xfs_btree_get_bufs and xfs_btree_get_bufl functions, since
they're pretty trivial oneliners.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_trans_get_buf() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_trans_get_buf_map() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_read() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_get_uncached() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_get() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_read_map() to return numeric error codes like most
everywhere else in xfs. This involves moving the open-coded logic that
reports metadata IO read / corruption errors and stales the buffer into
xfs_buf_read_map so that the logic is all in one place.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_get_map() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert _xfs_buf_alloc() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4t79kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjrZD/9l31+WrZhBJf4EDZRntGFdJUAxVe3rZw2Z
k45P7QezZwc4+mY7WeIlV4rgsHqhzPwTZP53PVmgeGw6vG6kjWllBSM5hzS+lfFC
q3mfJLLva7YckLsf6K1vOfNw9Dny26DuENHaDGPejSr2LYnRIHejBJuqiHJZigyl
8y8rbmNdWMS5/qOlGfNDfAII1z13Up30Tt4BXgX2aGITTjvEquirzRs5HrB9e2ci
vHX38uXMJ6DqQJwPDq/er8GXVsVkqd10BByh3KESxgjrQ9c+2BExwdaOtkMdbayx
UM3mu+49Xo/LDR0NHpJBQTeAhhl+wVZhfpyGZzng6TOgnCN/F5NOB18tmC5g8fHx
vTWpBieTujVFLygwgMIoY5Qwo0Q1bYJUi3VydWm956YujhgS76UfeXC8N9Prk7XI
UDnDqAjY7gTVn0EewYKa5Sd//6TqQ+WgwB8LtCiTqLOP1kIiX+Y/rXG8PrdNMskh
zpWJ/lPiTzWSn40NbU+yK09S5zu6fhqlXhjVqPlHLIOreOMD3PwOMxWkmq7MIA6j
/vEK9Of0cHgdaYEJfIu+kqDkoy6Tcde3iwpV+ZluexLdTE/FF5qWIG+a8phyCLz2
KXwgyvx811T7mihlLxuwvAlc//61p9X1XsbusYu/wK/NIbu0lBZx0eHkZWGlE+ko
tL0Tdx7cCQ==
=5jvb
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.5-2020-01-26' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Fix for two regressions in this cycle, both reported by the postgresql
use case.
One removes the added restriction on who can submit IO, making it
possible for rings shared across forks to do so. The other fixes an
issue for the same kind of use case, where one exiting process would
cancel all IO"
* tag 'io_uring-5.5-2020-01-26' of git://git.kernel.dk/linux-block:
io_uring: don't cancel all work on process exit
Revert "io_uring: only allow submit from owning task"
Pull vfs fix from Al Viro:
"Fix a use-after-free in do_last() handling of sysctl_protected_...
checks.
The use-after-free normally doesn't happen there, but race with
rename() and it becomes possible"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_last(): fetch directory ->i_mode and ->i_uid before it's too late
If we're sharing the ring across forks, then one process exiting means
that we cancel ALL work and prevent future work. This is overly
restrictive. As long as we cancel the work associated with the files
from the current task, it's safe to let others persist. Normal fd close
on exit will still wait (and cancel) pending work.
Fixes: fcb323cc53 ("io_uring: io_uring: add support for async work inheriting files")
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This ends up being too restrictive for tasks that willingly fork and
share the ring between forks. Andres reports that this breaks his
postgresql work. Since we're close to 5.5 release, revert this change
for now.
Cc: stable@vger.kernel.org
Fixes: 44d282796f ("io_uring: only allow submit from owning task")
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The afs filesystem needs to prohibit certain characters from cell names,
such as '/', as these are used to form filenames in procfs, leading to
the following warning being generated:
WARNING: CPU: 0 PID: 3489 at fs/proc/generic.c:178
Fix afs_alloc_cell() to disallow nonprintable characters, '/', '@' and
names that begin with a dot.
Remove the check for "@cell" as that is then redundant.
This can be tested by running:
echo add foo/.bar 1.2.3.4 >/proc/fs/afs/cells
Note that we will also need to deal with:
- Names ending in ".invalid" shouldn't be passed to the DNS.
- Names that contain non-valid domainname chars shouldn't be passed to
the DNS.
- DNS replies that say "your-dns-needs-immediate-attention.<gTLD>" and
replies containing A records that say 127.0.53.53 should be
considered invalid.
[https://www.icann.org/en/system/files/files/name-collision-mitigation-01aug14-en.pdf]
but these need to be dealt with by the kafs-client DNS program rather
than the kernel.
Reported-by: syzbot+b904ba7c947a37b4b291@syzkaller.appspotmail.com
Cc: stable@kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
may_create_in_sticky() call is done when we already have dropped the
reference to dir.
Fixes: 30aba6656f (namei: allow restricted O_CREAT of FIFOs and regular files)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4sLasACgkQxWXV+ddt
WDsegg/8CBQ1/pGj+8mvf+ws6f71Av8jspY2Ebr+HCjaGhD2MG3HI1kA5gC9Qnbb
fQVd12M5ma2BTrIcszxwm+VMIMlDotRFzfAp8uuFJtW0aAEGMCboX6VRYWa/4I0o
SmgJg0RYh926VL73qSe3S72pfIYjar30RwjVIVTmsHxL/D/lEkrHg6IGKRCe/MaN
eQipth3iuFtcWmGm1+DxEySsOs7AMPg3wL8KVnQcYoDI2kg3BXFH9a4wTE6VmWsU
ZjonJBA/Rl8oA2YOVDum4mL5j2c5RulWEymdVKyo1oH+8kLDOQ8snd7Bxp3qtJ1C
gdVbS8gi7gT5/C+yex+ZWlAdfmCSGWj7dr7jjiELZhTrsBhtS7y+GM52GivSrJ3z
TciNQtF/Y0SrZGprPMgVGAHuIKWWwSmWJPmkRB4zv/5efFFdKg8/UmcRmh6dMo83
IF4VPEBQgJLj3ja9Wns5yvW9asKNcynGeFK7aV+BlGW/wuvBW9o017c4Q04dXSAK
iFpipJaR/6ZGmXlRQLa1uyKWVHNIfSFT47WJqa6Dbo6iWRE/S/MhfkZU42z2A3H9
O2qMWmZikZnPCkha6fWyNJEDxF3imC+/LBsYoEuVPR7kZ/irDnI1cJNsTocOlyj1
kgFtL5MnCBHCop9/tPGiVdin9ilHJs3q2kAkR5BNCSEqhC8mo4g=
=IPUk
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Here's a last minute fix for a regression introduced in this
development cycle.
There's a small chance of a silent corruption when device replace and
NOCOW data writes happen at the same time in one block group. Metadata
or COW data writes are unaffected.
The extra fixup patch is there to silence an unnecessary warning"
* tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: dev-replace: remove warning for unknown return codes when finished
btrfs: scrub: Require mandatory block group RO for dev-replace
This code accidentally returns success, but it should return the
-EIO error code from adfs_fplus_validate_header().
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Fixes: d79288b4f6 ("fs/adfs: bigdir: calculate and validate directory checkbyte")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The fstests btrfs/011 triggered a warning at the end of device replace,
[ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
[ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
[ 1892.059993] ------------[ cut here ]------------
[ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
[ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
[ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
[ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
[ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
[ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
[ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
[ 1892.108642] FS: 00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
[ 1892.111558] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
[ 1892.115814] Call Trace:
[ 1892.116896] btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
[ 1892.118962] btrfs_ioctl+0x1d62/0x2550 [btrfs]
caused by the previous patch ("btrfs: scrub: Require mandatory block
group RO for dev-replace"). Hitting ENOSPC is possible and could happen
when the block group is set read-only, preventing NOCOW writes to the
area that's being accessed by dev-replace.
This has happend with scratch devices of size 12G but not with 5G and
20G, so this is depends on timing and other activity on the filesystem.
The whole replace operation is restartable, the space state should be
examined by the user in any case.
The error code is propagated back to the ioctl caller so the kernel
warning is causing false alerts.
Signed-off-by: David Sterba <dsterba@suse.com>
__jbd2_journal_abort_hard() is no longer used, so now we can merge
__jbd2_journal_abort_hard() and __journal_abort_soft() these two
functions into jbd2_journal_abort() and remove them.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191204124614.45424-5-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit fb7c02445c ("ext4: pass -ESHUTDOWN code to jbd2 layer") want
to allow jbd2 layer to distinguish shutdown journal abort from other
error cases. So the ESHUTDOWN should be taken precedence over any other
errno which has already been recoded after EXT4_FLAGS_SHUTDOWN is set,
but it only update errno in the journal suoerblock now if the old errno
is 0.
Fixes: fb7c02445c ("ext4: pass -ESHUTDOWN code to jbd2 layer")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191204124614.45424-4-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
JBD2_REC_ERR flag used to indicate the errno has been updated when jbd2
aborted, and then __ext4_abort() and ext4_handle_error() can invoke
panic if ERRORS_PANIC is specified. But if the journal has been aborted
with zero errno, jbd2_journal_abort() didn't set this flag so we can
no longer panic. Fix this by always record the proper errno in the
journal superblock.
Fixes: 4327ba52af ("ext4, jbd2: ensure entering into panic after recording an error in superblock")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191204124614.45424-3-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We invoke jbd2_journal_abort() to abort the journal and record errno
in the jbd2 superblock when committing journal transaction besides the
failure on submitting the commit record. But there is no need for the
case and we can also invoke jbd2_journal_abort() instead of
__jbd2_journal_abort_hard().
Fixes: 818d276ceb ("ext4: Add the journal checksum feature")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191204124614.45424-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Only when jh->b_jcount = 0 in jbd2_journal_put_journal_head, we are allowed
to call __journal_remove_journal_head. This assertion is meaningless,
just remove it.
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200123070054.50585-1-luoshijie1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Show pblock only if it has meaningful value.
# before
ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 576460752303423487 H
ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 576460752303423487 HR
# after
ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 0 H
ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 0 HR
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Link: https://lore.kernel.org/r/20191114200147.1073-2-dmonakhov@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Setting softlimit larger than hardlimit seems meaningless
for disk quota but currently it is allowed. In this case,
there may be a bit of comfusion for users when they run
df comamnd to directory which has project quota.
For example, we set 20M softlimit and 10M hardlimit of
block usage limit for project quota of test_dir(project id 123).
[root@hades mnt_ext4]# repquota -P -a
*** Report for project quotas on device /dev/loop0
Block grace time: 7days; Inode grace time: 7days
Block limits File limits
Project used soft hard grace used soft hard grace
----------------------------------------------------------------------
0 -- 13 0 0 2 0 0
123 -- 10237 20480 10240 5 200 100
The result of df command as below:
[root@hades mnt_ext4]# df -h test_dir
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 20M 10M 10M 50% /home/cgxu/test/mnt_ext4
Even though it looks like there is another 10M free space to use,
if we write new data to diretory test_dir(inherit project id),
the write will fail with errno(-EDQUOT).
After this patch, the df result looks like below.
[root@hades mnt_ext4]# df -h test_dir
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 10M 10M 3.0K 100% /home/cgxu/test/mnt_ext4
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191016022501.760-1-cgxu519@mykernel.net
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL. For ext4_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:
int main()
{
if (fork()) {
for (;;) {
mkdir("subdir", 0700);
rmdir("subdir");
}
} else {
for (;;)
access("subdir/file", 0);
}
}
... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.
I couldn't reproduce a crash in ext4_d_compare(), but it appears that a
similar crash is possible there.
Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: b886ee3e77 ("ext4: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.2+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200124041234.159740-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Return EINVAL rather than ERANGE for mount parse errors as the userspace
mount command doesn't necessarily understand what to do with anything other
than EINVAL.
The old code returned -ERANGE as an intermediate error that then get
converted to -EINVAL, whereas the new code returns -ERANGE.
This was induced by passing minorversion=1 to a v4 mount where
CONFIG_NFS_V4_1 was disabled in the kernel build.
Fixes: 68f65ef40e1e ("NFS: Convert mount option parsing to use functionality from fs_parser.h")
Reported-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add a kernel config CONFIG_NFS_DISABLE_UDP_SUPPORT to disallow NFS
UDP mounts and enable it by default.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the server is unavaliable, we want to allow the revalidating
lookup to time out, and to default to validating the cached dentry
if the 'softreval' mount option is set.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The allocation mask is no longer used by on_each_cpu_cond() and
on_each_cpu_cond_mask() and can be removed.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200117090137.1205765-4-bigeasy@linutronix.de
Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL. For f2fs_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:
int main()
{
if (fork()) {
for (;;) {
mkdir("subdir", 0700);
rmdir("subdir");
}
} else {
for (;;)
access("subdir/file", 0);
}
}
... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.
I couldn't reproduce a crash in f2fs_d_compare(), but it appears that a
similar crash is possible there.
Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Do the name comparison for non-casefolded directories correctly.
This is analogous to ext4's commit 66883da1ee ("ext4: fix dcache
lookup of !casefolded directories").
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now overlayfs falls back to use default file splice read
and write, which is not compatiple with overlayfs, returning
EFAULT. xfstests generic/591 can reproduce part of this.
Tested this patch with xfstests auto group tests.
Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.
The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.
The bug is observable after commit b12de52896 ("btrfs: scrub: Don't
check free space before marking a block group RO")
[CAUSE]
Dev-replace has two source of writes:
- Write duplication
All writes to source device will also be duplicated to target device.
Content: Not yet persisted data/meta
- Scrub copy
Dev-replace reused scrub code to iterate through existing extents, and
copy the verified data to target device.
Content: Previously persisted data and metadata
The difference in contents makes the following race possible:
Regular Writer | Dev-replace
-----------------------------------------------------------------
^ |
| Preallocate one data extent |
| at bytenr X, len 1M |
v |
^ Commit transaction |
| Now extent [X, X+1M) is in |
v commit root |
================== Dev replace starts =========================
| ^
| | Scrub extent [X, X+1M)
| | Read [X, X+1M)
| | (The content are mostly garbage
| | since it's preallocated)
^ | v
| Write back happens for |
| extent [X, X+512K) |
| New data writes to both |
| source and target dev. |
v |
| ^
| | Scrub writes back extent [X, X+1M)
| | to target device.
| | This will over write the new data in
| | [X, X+512K)
| v
This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).
This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.
*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
-f collapse=0 -f punch=0 -f resvsp=0"
I didn't expect resvsp ioctl will fallback to fallocate in VFS...
[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().
This patch will mostly revert commit 76a8efa171 ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.
The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171 ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A performance regression was observed since linux v4.19 with aio test using
fio with iodepth 128 on overlayfs. The queue depth of the device was
always 1 which is unexpected.
After investigation, it was found that commit 16914e6fc7 ("ovl: add
ovl_read_iter()") and commit 2a92e07edc ("ovl: add ovl_write_iter()")
resulted in vfs_iter_{read,write} being called on underlying filesystem,
which always results in syncronous IO.
Implement async IO for stacked reading and writing. This resolves the
performance regresion.
This is implemented by allocating a new kiocb for submitting the AIO
request on the underlying filesystem. When the request is completed, the
new kiocb is freed and the completion callback is called on the original
iocb.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This doesn't cause any behavior changes and will be used by overlay async
IO implementation.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On non-samefs overlay without xino, non pure upper inodes should use a
pseudo_dev assigned to each unique lower fs, but if lower layer is on the
same fs and upper layer, it has no pseudo_dev assigned.
In this overlay layers setup:
- two filesystems, A and B
- upper layer is on A
- lower layer 1 is also on A
- lower layer 2 is on B
Non pure upper overlay inode, whose origin is in layer 1 will have the
st_dev;st_ino values of the real lower inode before copy up and the
st_dev;st_ino values of the real upper inode after copy up.
Fix this inconsitency by assigning a unique pseudo_dev also for upper fs,
that will be used as st_dev value along with the lower inode st_dev for
overlay inodes in the case above.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This fixes ovl_lower_uuid_ok() to correctly detect the corner case:
- two filesystems, A and B, both have null uuid
- upper layer is on A
- lower layer 1 is also on A
- lower layer 2 is on B
In this case, bad_uuid would not have been set for B, because the check
only involved the list of lower fs. Hence we'll try to decode a layer 2
origin on layer 1 and fail.
We check for conflicting (and null) uuid among all lower layers, including
those layers that are on the same fs as the upper layer.
Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rename lower_fs[] array to fs[], extend its size by one and use index fsid
(instead of fsid-1) to access the fs[] array.
Initialize fs[0] with upper fs values. fsid 0 is reserved even with lower
only overlay, so fs[0] remains null in this case.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
No code uses the sb returned from this helper, so make it retrun a boolean
and rename it to ovl_same_fs().
The xino mode is irrelevant when all layers are on same fs, so instead of
describing samefs with mode OVL_XINO_OFF, use a new xino_mode state, which
is 0 in the case of samefs, -1 in the case of xino=off and > 0 with xino
enabled.
Create a new helper ovl_same_dev(), to use instead of the common check for
(ovl_same_fs() || xinobits).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fs/xfs/xfs_inode.c: In function 'xfs_itruncate_extents_flags':
fs/xfs/xfs_inode.c:1523:8: warning: unused variable 'done' [-Wunused-variable]
commit 4bbb04abb4 ("xfs: truncate should remove
all blocks, not just to the end of the page cache")
left behind this, so remove it.
Fixes: 4bbb04abb4 ("xfs: truncate should remove all blocks, not just to the end of the page cache")
Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dan Carpenter pointed out that error is uninitialized. While there
never should be an attr leaf block with zero entries, let's not leave
that logic bomb there.
Fixes: 0bb9d159bd ("xfs: streamline xfs_attr3_leaf_inactive")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl4p1+MTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi4YtCACPHyE8aoDTHZF8UZ9bHKNFVt4C1bRx
ihFB6/PzmIfFw4Cbf+yTW85q3zqJ/6eJIOZF4dlwoFWK+osSk8sYRaOvlEovysbR
sYiAbcOxePj9tSPdrWLYB/5ELtwMTloxBo7mPiJYt127UntWlPGfiz4sdHJBt1zI
IBPOIeACJKGe0+Wtj0mGsXk+WhEB3nFk2DINnLuFc4tG6yXkFNq5/fnXrgVTlUTF
4EwDQgHBUIqKDJarSyIBzud6VVshS7VaMAu8h9kwPScN4sG1y4ucgFzXIc4JfqRN
TnEV48hdRQMVuQtsvuzAMPQvsjMlIXUSTGZzs4XPbEBjgAP8+MP+PJvL
=XVg1
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.5-rc8' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a potential use-after-free from Jeff, marked for stable"
* tag 'ceph-for-5.5-rc8' of https://github.com/ceph/ceph-client:
ceph: hold extra reference to r_parent over life of request
From: Dave Hansen <dave.hansen@linux.intel.com>
MPX is being removed from the kernel due to a lack of support
in the toolchain going forward (gcc).
arch_bprm_mm_init() is used at execve() time. The only non-stub
implementation is on x86 for MPX. Remove the hook entirely from
all architectures and generic code.
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
In commit 9f79b78ef7 ("Convert filldir[64]() from __put_user() to
unsafe_put_user()") I changed filldir to not do individual __put_user()
accesses, but instead use unsafe_put_user() surrounded by the proper
user_access_begin/end() pair.
That make them enormously faster on modern x86, where the STAC/CLAC
games make individual user accesses fairly heavy-weight.
However, the user_access_begin() range was not really the exact right
one, since filldir() has the unfortunate problem that it needs to not
only fill out the new directory entry, it also needs to fix up the
previous one to contain the proper file offset.
It's unfortunate, but the "d_off" field in "struct dirent" is _not_ the
file offset of the directory entry itself - it's the offset of the next
one. So we end up backfilling the offset in the previous entry as we
walk along.
But since x86 didn't really care about the exact range, and used to be
the only architecture that did anything fancy in user_access_begin() to
begin with, the filldir[64]() changes did something lazy, and even
commented on it:
/*
* Note! This range-checks 'previous' (which may be NULL).
* The real range was checked in getdents
*/
if (!user_access_begin(dirent, sizeof(*dirent)))
goto efault;
and it all worked fine.
But now 32-bit ppc is starting to also implement user_access_begin(),
and the fact that we faked the range to only be the (possibly not even
valid) previous directory entry becomes a problem, because ppc32 will
actually be using the range that is passed in for more than just "check
that it's user space".
This is a complete rewrite of Christophe's original patch.
By saving off the record length of the previous entry instead of a
pointer to it in the filldir data structures, we can simplify the range
check and the writing of the previous entry d_off field. No need for
any conditionals in the user accesses themselves, although we retain the
conditional EINTR checking for the "was this the first directory entry"
signal handling latency logic.
Fixes: 9f79b78ef7 ("Convert filldir[64]() from __put_user() to unsafe_put_user()")
Link: https://lore.kernel.org/lkml/a02d3426f93f7eb04960a4d9140902d278cab0bb.1579697910.git.christophe.leroy@c-s.fr/
Link: https://lore.kernel.org/lkml/408c90c4068b00ea8f1c41cca45b84ec23d4946b.1579783936.git.christophe.leroy@c-s.fr/
Reported-and-tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8a23eb804c ("Make filldir[64]() verify the directory entry
filename is valid") added some minimal validity checks on the directory
entries passed to filldir[64](). But they really were pretty minimal.
This fleshes out at least the name length check: we used to disallow
zero-length names, but really, negative lengths or oevr-long names
aren't ok either. Both could happen if there is some filesystem
corruption going on.
Now, most filesystems tend to use just an "unsigned char" or similar for
the length of a directory entry name, so even with a corrupt filesystem
you should never see anything odd like that. But since we then use the
name length to create the directory entry record length, let's make sure
it actually is half-way sensible.
Note how POSIX states that the size of a path component is limited by
NAME_MAX, but we actually use PATH_MAX for the check here. That's
because while NAME_MAX is generally the correct maximum name length
(it's 255, for the same old "name length is usually just a byte on
disk"), there's nothing in the VFS layer that really cares.
So the real limitation at a VFS layer is the total pathname length you
can pass as a filename: PATH_MAX.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently f2fs stats are only available from /d/f2fs/status. This patch
adds some of the f2fs stats to sysfs so that they are accessible even
when debugfs is not mounted.
The following sysfs nodes are added:
-/sys/fs/f2fs/<disk>/free_segments
-/sys/fs/f2fs/<disk>/cp_foreground_calls
-/sys/fs/f2fs/<disk>/cp_background_calls
-/sys/fs/f2fs/<disk>/gc_foreground_calls
-/sys/fs/f2fs/<disk>/gc_background_calls
-/sys/fs/f2fs/<disk>/moved_blocks_foreground
-/sys/fs/f2fs/<disk>/moved_blocks_background
-/sys/fs/f2fs/<disk>/avg_vblocks
Signed-off-by: Hridya Valsaraju <hridya@google.com>
[Jaegeuk Kim: allow STAT_FS without DEBUG_FS]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since btrfs was migrated to use the generic VFS helpers for clone and
deduplication, it stopped allowing for the last block of a file to be
deduplicated when the source file size is not sector size aligned (when
eof is somewhere in the middle of the last block). There are two reasons
for that:
1) The generic code always rounds down, to a multiple of the block size,
the range's length for deduplications. This means we end up never
deduplicating the last block when the eof is not block size aligned,
even for the safe case where the destination range's end offset matches
the destination file's size. That rounding down operation is done at
generic_remap_check_len();
2) Because of that, the btrfs specific code does not expect anymore any
non-aligned range length's for deduplication and therefore does not
work if such nona-aligned length is given.
This patch addresses that second part, and it depends on a patch that
fixes generic_remap_check_len(), in the VFS, which was submitted ealier
and has the following subject:
"fs: allow deduplication of eof block into the end of the destination file"
These two patches address reports from users that started seeing lower
deduplication rates due to the last block never being deduplicated when
the file size is not aligned to the filesystem's block size.
Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We always round down, to a multiple of the filesystem's block size, the
length to deduplicate at generic_remap_check_len(). However this is only
needed if an attempt to deduplicate the last block into the middle of the
destination file is requested, since that leads into a corruption if the
length of the source file is not block size aligned. When an attempt to
deduplicate the last block into the end of the destination file is
requested, we should allow it because it is safe to do it - there's no
stale data exposure and we are prepared to compare the data ranges for
a length not aligned to the block (or page) size - in fact we even do
the data compare before adjusting the deduplication length.
After btrfs was updated to use the generic helpers from VFS (by commit
34a28e3d77 ("Btrfs: use generic_remap_file_range_prep() for cloning
and deduplication")) we started to have user reports of deduplication
not reflinking the last block anymore, and whence users getting lower
deduplication scores. The main use case is deduplication of entire
files that have a size not aligned to the block size of the filesystem.
We already allow cloning the last block to the end (and beyond) of the
destination file, so allow for deduplication as well.
Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extents are cached in read_extent_tree_block(); as a result, extents
are not cached for inodes with depth == 0 when we try to find the
extent using ext4_find_extent(). The result of the lookup is cached
in ext4_map_blocks() but is only a subset of the extent on disk. As a
result, the contents of extents status cache can get very badly
fragmented for certain workloads, such as a random 4k read workload.
File size of /mnt/test is 33554432 (8192 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 8191: 40960.. 49151: 8192: last,eof
$ perf record -e 'ext4:ext4_es_*' /root/bin/fio --name=t --direct=0 --rw=randread --bs=4k --filesize=32M --size=32M --filename=/mnt/test
$ perf script | grep ext4_es_insert_extent | head -n 10
fio 131 [000] 13.975421: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [494/1) mapped 41454 status W
fio 131 [000] 13.975939: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6064/1) mapped 47024 status W
fio 131 [000] 13.976467: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6907/1) mapped 47867 status W
fio 131 [000] 13.976937: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3850/1) mapped 44810 status W
fio 131 [000] 13.977440: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3292/1) mapped 44252 status W
fio 131 [000] 13.977931: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6882/1) mapped 47842 status W
fio 131 [000] 13.978376: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3117/1) mapped 44077 status W
fio 131 [000] 13.978957: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [2896/1) mapped 43856 status W
fio 131 [000] 13.979474: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [7479/1) mapped 48439 status W
Fix this by caching the extents for inodes with depth == 0 in
ext4_find_extent().
[ Renamed ext4_es_cache_extents() to ext4_cache_extents() since this
newly added function is not in extents_cache.c, and to avoid
potential visual confusion with ext4_es_cache_extent(). -TYT ]
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Link: https://lore.kernel.org/r/20191106122502.19986-1-dmonakhov@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Sometimes when running generic/475 we would trip the
WARN_ON(cache->reserved) check when free'ing the block groups on umount.
This is because sometimes we don't commit the transaction because of IO
errors and thus do not cleanup the tree logs until at umount time.
These blocks are still reserved until they are cleaned up, but they
aren't cleaned up until _after_ we do the free block groups work. Fix
this by moving the free after free'ing the fs roots, that way all of the
tree logs are cleaned up and we have a properly cleaned fs. A bunch of
loops of generic/475 confirmed this fixes the problem.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current code doesn't correctly handle the situation which arises when
a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
changed to the one in metadata uuid. This causes the incompat flag to
disappear.
In case of a power failure we could end up in a situation where part of
the disks in a multi-disk filesystem are correctly reverted to
METADATA_UUID_INCOMPAT flag unset state, while others have
METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.
This patch corrects the behavior required to handle the case where a
disk of the second type is scanned first, creating the necessary
btrfs_fs_devices. Subsequently, when a disk which has already completed
the transition is scanned it should overwrite the data in
btrfs_fs_devices.
Reported-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is one more cases which isn't handled by the original metadata
uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
the user decides to change the FSID to the original one e.g. have
metadata_uuid and fsid match. In case of power failure while this
operation is in progress we could end up in a situation where some of
the disks have the incompat bit removed and the other half have both
METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.
This patch handles the case where a disk that has successfully changed
its FSID such that it equals METADATA_UUID is scanned first.
Subsequently when a disk with both
METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
This is done by extending find_fsid_changed to correctly find
btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
the metadata_uuid of the currently scanned device.
Fixes: cc5de4e702 ("btrfs: Handle final split-brain possibility during fsid change")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reported-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_fsid became rather hairy with the introduction of metadata uuid
changing feature. Alleviate this by factoring out the metadata uuid
specific code in a dedicated function which deals with finding
correct fsid for a device with changed uuid.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since find_fsid_inprogress should also handle the case in which an fs
didn't change its FSID make it call find_fsid directly. This makes the
code in device_list_add simpler by eliminating a conditional call of
find_fsid. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recently fsstress (from fstests) sporadically started to trigger an
infinite loop during fsync operations. This turned out to be because
support for the rename exchange and whiteout operations was added to
fsstress in fstests. These operations, unlike any others in fsstress,
cause file names to be reused, whence triggering this issue. However
it's not necessary to use rename exchange and rename whiteout operations
trigger this issue, simple rename operations and file creations are
enough to trigger the issue.
The issue boils down to when we are logging inodes that conflict (that
had the name of any inode we need to log during the fsync operation), we
keep logging them even if they were already logged before, and after
that we check if there's any other inode that conflicts with them and
then add it again to the list of inodes to log. Skipping already logged
inodes fixes the issue.
Consider the following example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir # inode 257
$ touch /mnt/testdir/zz # inode 258
$ ln /mnt/testdir/zz /mnt/testdir/zz_link
$ touch /mnt/testdir/a # inode 259
$ sync
# The following 3 renames achieve the same result as a rename exchange
# operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).
$ mv /mnt/testdir/a /mnt/testdir/a/tmp
$ mv /mnt/testdir/zz_link /mnt/testdir/a
$ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link
# The following rename and file creation give the same result as a
# rename whiteout operation (<rename_whiteout> zz to a2).
$ mv /mnt/testdir/zz /mnt/testdir/a2
$ touch /mnt/testdir/zz # inode 260
$ xfs_io -c fsync /mnt/testdir/zz
--> results in the infinite loop
The following steps happen:
1) When logging inode 260, we find that its reference named "zz" was
used by inode 258 in the previous transaction (through the commit
root), so inode 258 is added to the list of conflicting indoes that
need to be logged;
2) After logging inode 258, we find that its reference named "a" was
used by inode 259 in the previous transaction, and therefore we add
inode 259 to the list of conflicting inodes to be logged;
3) After logging inode 259, we find that its reference named "zz_link"
was used by inode 258 in the previous transaction - we add inode 258
to the list of conflicting inodes to log, again - we had already
logged it before at step 3. After logging it again, we find again
that inode 259 conflicts with him, and we add again 259 to the list,
etc - we end up repeating all the previous steps.
So fix this by skipping logging of conflicting inodes that were already
logged.
Fixes: 6b5fc433a7 ("Btrfs: fix fsync after succession of renames of different files")
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we abort a transaction we have the following sequence
if (!trans->dirty && list_empty(&trans->new_bgs))
return;
WRITE_ONCE(trans->transaction->aborted, err);
The idea being if we didn't modify anything with our trans handle then
we don't really need to abort the whole transaction, maybe the other
trans handles are fine and we can carry on.
However in the case of create_snapshot we add a pending_snapshot object
to our transaction and then commit the transaction. We don't actually
modify anything. sync() behaves the same way, attach to an existing
transaction and commit it. This means that if we have an IO error in
the right places we could abort the committing transaction with our
trans->dirty being not set and thus not set transaction->aborted.
This is a problem because in the create_snapshot() case we depend on
pending->error being set to something, or btrfs_commit_transaction
returning an error.
If we are not the trans handle that gets to commit the transaction, and
we're waiting on the commit to happen we get our return value from
cur_trans->aborted. If this was not set to anything because sync() hit
an error in the transaction commit before it could modify anything then
cur_trans->aborted would be 0. Thus we'd return 0 from
btrfs_commit_transaction() in create_snapshot.
This is a problem because we then try to do things with
pending_snapshot->snap, which will be NULL because we didn't create the
snapshot, and then we'll get a NULL pointer dereference like the
following
"BUG: kernel NULL pointer dereference, address: 00000000000001f0"
RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
Call Trace:
? btrfs_mksubvol.isra.31+0x3f2/0x510
btrfs_mksubvol.isra.31+0x4bc/0x510
? __sb_start_write+0xfa/0x200
? mnt_want_write_file+0x24/0x50
btrfs_ioctl_snap_create_transid+0x16c/0x1a0
btrfs_ioctl_snap_create_v2+0x11e/0x1a0
btrfs_ioctl+0x1534/0x2c10
? free_debug_processing+0x262/0x2a3
do_vfs_ioctl+0xa6/0x6b0
? do_sys_open+0x188/0x220
? syscall_trace_enter+0x1f8/0x330
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x4a/0x1b0
In order to fix this we need to make sure anybody who calls
commit_transaction has trans->dirty set so that they properly set the
trans->transaction->aborted value properly so any waiters know bad
things happened.
This was found while I was running generic/475 with my modified
fsstress, it reproduced within a few runs. I ran with this patch all
night and didn't see the problem again.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fsync on a subvolume and create a log root for that volume, and
then later delete that subvolume we'll never clean up its log root. Fix
this by making switch_commit_roots free the log for any dropped roots we
encounter. The extra churn is because we need a btrfs_trans_handle, not
the btrfs_transaction.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
New sysfs attributes that track the filesystem status of devices, stored
in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
a directory for each device, with name corresponding to the numerical
device id.
in_fs_metadata - device is in the list of fs metadata
missing - device is missing (no device node or block device)
replace_target - device is target of replace
writeable - writes from fs are allowed
These attributes reflect the state of the device::dev_state and created
at mount time.
Sample output:
$ pwd
/sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
$ ls
in_fs_metadata missing replace_target writeable
$ cat missing
0
The output from these attributes are 0 or 1. 0 indicates unset and 1
indicates set. These attributes are readonly.
It is observed that the device delete thread and sysfs read thread will
not race because the delete thread calls sysfs kobject_put() which in
turn waits for existing sysfs read to complete.
Note for device replace devid swap:
During the replace the target device temporarily assumes devid 0 before
assigning the devid of the soruce device.
In btrfs_dev_replace_finishing() we remove source sysfs devid using the
function btrfs_sysfs_remove_devices_attr(), so after that call
kobject_rename() to update the devid in the sysfs. This adds and calls
btrfs_sysfs_update_devid() helper function to update the device id.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move variables to appropriate scope. Remove last BUG_ON in the function
and rework error handling accordingly. Make the duplicate detection code
more straightforward. Use in_range macro. And give variables more
descriptive name by explicitly distinguishing between IO stripe size
(size recorded in the chunk item) and data stripe size (the size of
an actual stripe, constituting a logical chunk/block group).
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add RAID1 and single testcases to verify that data stripes are excluded
from super block locations and that the address mapping is valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add basic infrastructure to create and link dummy btrfs_devices. This
will be used in the pending btrfs_rmap_block test which deals with
the block groups.
Calling btrfs_alloc_dummy_device will link the newly created device to
the passed fs_info and the test framework will free them once the test
is finished.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only during initial block group reading to map physical
address of super block to a list of logical ones. Make it private to
block-group.c, add proper kernel doc and ensure it's exported only for
tests.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a report where objtool detects unreachable instructions, eg.:
fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction
This seems to be a false positive due to compiler version. The cause is
in the ASSERT macro implementation that does the conditional check as
IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.
To avoid that, use the ifdefs directly.
There are still 2 reports that aren't fixed:
fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
When an encrypted directory is listed without the key, the filesystem
must show "no-key names" that uniquely identify directory entries, are
at most 255 (NAME_MAX) bytes long, and don't contain '/' or '\0'.
Currently, for short names the no-key name is the base64 encoding of the
ciphertext filename, while for long names it's the base64 encoding of
the ciphertext filename's dirhash and second-to-last 16-byte block.
This format has the following problems:
- Since it doesn't always include the dirhash, it's incompatible with
directories that will use a secret-keyed dirhash over the plaintext
filenames. In this case, the dirhash won't be computable from the
ciphertext name without the key, so it instead must be retrieved from
the directory entry and always included in the no-key name.
Casefolded encrypted directories will use this type of dirhash.
- It's ambiguous: it's possible to craft two filenames that map to the
same no-key name, since the method used to abbreviate long filenames
doesn't use a proper cryptographic hash function.
Solve both these problems by switching to a new no-key name format that
is the base64 encoding of a variable-length structure that contains the
dirhash, up to 149 bytes of the ciphertext filename, and (if any bytes
remain) the SHA-256 of the remaining bytes of the ciphertext filename.
This ensures that each no-key name contains everything needed to find
the directory entry again, contains only legal characters, doesn't
exceed NAME_MAX, is unambiguous unless there's a SHA-256 collision, and
that we only take the performance hit of SHA-256 on very long filenames.
Note: this change does *not* address the existing issue where users can
modify the 'dirhash' part of a no-key name and the filesystem may still
accept the name.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[EB: improved comments and commit message, fixed checking return value
of base64_decode(), check for SHA-256 error, continue to set disk_name
for short names to keep matching simpler, and many other cleanups]
Link: https://lore.kernel.org/r/20200120223201.241390-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
In order to support a new dirhash method that is a secret-keyed hash
over the plaintext filenames (which will be used by encrypted+casefolded
directories on ext4 and f2fs), fscrypt will be switching to a new no-key
name format that always encodes the dirhash in the name.
UBIFS isn't happy with this because it has assertions that verify that
either the hash or the disk name is provided, not both.
Change it to use the disk name if one is provided, even if a hash is
available too; else use the hash.
Link: https://lore.kernel.org/r/20200120223201.241390-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
If userspace provides an invalid fscrypt no-key filename which encodes a
hash value with any of the UBIFS node type bits set (i.e. the high 3
bits), gracefully report ENOENT rather than triggering ubifs_assert().
Test case with kvm-xfstests shell:
. fs/ubifs/config
. ~/xfstests/common/encrypt
dev=$(__blkdev_to_ubi_volume /dev/vdc)
ubiupdatevol $dev -t
mount $dev /mnt -t ubifs
mkdir /mnt/edir
xfs_io -c set_encpolicy /mnt/edir
rm /mnt/edir/_,,,,,DAAAAAAAAAAAAAAAAAAAAAAAAAA
With the bug, the following assertion fails on the 'rm' command:
[ 19.066048] UBIFS error (ubi0:0 pid 379): ubifs_assert_failed: UBIFS assert failed: !(hash & ~UBIFS_S_KEY_HASH_MASK), in fs/ubifs/key.h:170
Fixes: f4f61d2cc6 ("ubifs: Implement encrypted filenames")
Cc: <stable@vger.kernel.org> # v4.10+
Link: https://lore.kernel.org/r/20200120223201.241390-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Now that there's sometimes a second type of per-file key (the dirhash
key), clarify some function names, macros, and documentation that
specifically deal with per-file *encryption* keys.
Link: https://lore.kernel.org/r/20200120223201.241390-4-ebiggers@kernel.org
Reviewed-by: Daniel Rosenberg <drosen@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
When we allow indexed directories to use both encryption and
casefolding, for the dirhash we can't just hash the ciphertext filenames
that are stored on-disk (as is done currently) because the dirhash must
be case insensitive, but the stored names are case-preserving. Nor can
we hash the plaintext names with an unkeyed hash (or a hash keyed with a
value stored on-disk like ext4's s_hash_seed), since that would leak
information about the names that encryption is meant to protect.
Instead, if we can accept a dirhash that's only computable when the
fscrypt key is available, we can hash the plaintext names with a keyed
hash using a secret key derived from the directory's fscrypt master key.
We'll use SipHash-2-4 for this purpose.
Prepare for this by deriving a SipHash key for each casefolded encrypted
directory. Make sure to handle deriving the key not only when setting
up the directory's fscrypt_info, but also in the case where the casefold
flag is enabled after the fscrypt_info was already set up. (We could
just always derive the key regardless of casefolding, but that would
introduce unnecessary overhead for people not using casefolding.)
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[EB: improved commit message, updated fscrypt.rst, squashed with change
that avoids unnecessarily deriving the key, and many other cleanups]
Link: https://lore.kernel.org/r/20200120223201.241390-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Casefolded encrypted directories will use a new dirhash method that
requires a secret key. If the directory uses a v2 encryption policy,
it's easy to derive this key from the master key using HKDF. However,
v1 encryption policies don't provide a way to derive additional keys.
Therefore, don't allow casefolding on directories that use a v1 policy.
Specifically, make it so that trying to enable casefolding on a
directory that has a v1 policy fails, trying to set a v1 policy on a
casefolded directory fails, and trying to open a casefolded directory
that has a v1 policy (if one somehow exists on-disk) fails.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[EB: improved commit message, updated fscrypt.rst, and other cleanups]
Link: https://lore.kernel.org/r/20200120223201.241390-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fname_encrypt() is a global function, due to being used in both fname.c
and hooks.c. So it should be prefixed with "fscrypt_", like all the
other global functions in fs/crypto/.
Link: https://lore.kernel.org/r/20200120071736.45915-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
When an encryption key can't be fully removed due to file(s) protected
by it still being in-use, we shouldn't really print the path to one of
these files to the kernel log, since parts of this path are likely to be
encrypted on-disk, and (depending on how the system is set up) the
confidentiality of this path might be lost by printing it to the log.
This is a trade-off: a single file path often doesn't matter at all,
especially if it's a directory; the kernel log might still be protected
in some way; and I had originally hoped that any "inode(s) still busy"
bugs (which are security weaknesses in their own right) would be quickly
fixed and that to do so it would be super helpful to always know the
file path and not have to run 'find dir -inum $inum' after the fact.
But in practice, these bugs can be hard to fix (e.g. due to asynchronous
process killing that is difficult to eliminate, for performance
reasons), and also not tied to specific files, so knowing a file path
doesn't necessarily help.
So to be safe, for now let's just show the inode number, not the path.
If someone really wants to know a path they can use 'find -inum'.
Fixes: b1c0ec3599 ("fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20200120060732.390362-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
When doing an unstable write, we need to ensure that we sample the
write verifier before releasing the lock, and allowing a commit to
the same file to proceed.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When we have a successful commit, ensure we sample the commit verifier
before releasing the lock.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Ensure that we can distinguish between synchronous CLONE and
WRITE errors, and that we can assign them correctly.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Needed in order to fix exclusion w.r.t. writes.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We don't know if the error returned by the fsync() call is
exclusive to the data written by the stable write, so play it
safe.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Strictly speaking, a stable write error needs to reflect the
write + the commit of that write (and only that write). To
ensure that we don't pick up the write errors from other
writebacks, add a rw_semaphore to provide exclusion.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Needed in order to fix stable writes.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If nfsd_file_mark_find_or_create() keeps winning the race for the
nfsd_file_fsnotify_group->mark_mutex against nfsd_file_mark_put()
then it can soft lock up, since fsnotify_add_inode_mark() ends
up always finding an existing entry.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Don't call nfsd_file_gc() on every put of the reference in nfsd_file_put().
Instead, do it only when we're expecting the refcount to go to 1.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Emsure we schedule the laundrette even if the struct file is carrying
file errors.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Ensure that if the filecache laundrette gets stuck, it only affects
the knfsd instances of one container.
The notifier callbacks can be called from various contexts so avoid
using synchonous filesystem operations that might deadlock.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If the lookup keeps finding a nfsd_file with an unhashed open file,
then retry once only.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org
Fixes: 65294c1f2c "nfsd: add a new struct file caching facility to nfsd"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
REQ_F_FORCE_ASYNC is checked only for the head of a link. Fix it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Whenever IOSQE_ASYNC is set, requests will be punted to async without
getting into io_issue_req() and without proper preparation done (e.g.
io_req_defer_prep()). Hence they will be left uninitialised.
Prepare them before punting.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename lower_layers[] array to layers[], extend its size by one and
initialize layers[0] with upper layer values. Lower layers are now
addressed with index 1..numlower. layers[0] is reserved even with lower
only overlay.
[SzM: replace ofs->numlower with ofs->numlayer, the latter's value is
incremented by one]
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Current copy-up is not efficient for big sparse file,
It's not only slow but also wasting more disk space
when the target lower file has huge hole inside.
This patch tries to recognize file hole and skip it
during copy-up.
Detail logic of hole detection as below:
When we detect next data position is larger than current
position we will skip that hole, otherwise we copy
data in the size of OVL_COPY_UP_CHUNK_SIZE. Actually,
it may not recognize all kind of holes and sometimes
only skips partial of hole area. However, it will be
enough for most of the use cases.
Additionally, this optimization relies on lseek(2)
SEEK_DATA implementation, so for some specific
filesystems which do not support this feature
will behave as before on copy-up.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In ovl_llseek() we use the overlay inode rwsem to protect against
concurrent modifications to real file f_pos, because we copy the overlay
file f_pos to/from the real file f_pos.
This caused a lockdep warning of locking order violation when the
ovl_llseek() operation was called on a lower nested overlay layer while the
upper layer fs sb_writers is held (with patch improving copy-up efficiency
for big sparse file).
Use the internal ovl_inode_lock() instead of the overlay inode rwsem in
those cases. It is meant to be used for protecting against concurrent
changes to overlay inode internal state changes.
The locking order rules are documented to explain this case.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The WARN_ON() that child entry is always on overlay st_dev became wrong
when we allowed this function to update d_ino in non-samefs setup with xino
enabled.
It is not true in case of xino bits overflow on a non-dir inode. Leave the
WARN_ON() only for directories, where assertion is still true.
Fixes: adbf4f7ea8 ("ovl: consistent d_ino for non-samefs with xino")
Cc: <stable@vger.kernel.org> # v4.17+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4obx8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqNwD/99ae+ezi7LSVj9zQml7y/6ZSV4D3wzD9PJ
7QsUq5kGA0tisZ9q/rd0eja4Fy2Dw/qhX+GXgTYLt9+a66rp0CskWaD9NWFMtFGp
eBgitruw5SqFl8GfNCjd6NB/Af3NGyrmQSPV58K7mma6zQX7ELCrEdipCKj5QpNk
eHO0enZA1KcPliegAbDQfhz7U9frns9nSs0VHB599X9jr5pi8PPejukVEBwK67o6
Dh52CDqjeKksX8PWxhXau9j8DNt85Zs8ocRFvgWD8/UQSLcHAM6DFLdkeGEHQu9H
QouW9JRFzmTksy3KvPnCcCPdsYQrqVmj6fCCg6AXW3yOzcI1IvhO+hcNMQtkLFWI
5JKYZkFhGjCsypmkYpB+5mqcz+fsbkfgN9clU1tvPK8FcmpLolsIQrUJdaRe6r58
odDe9Qs+I46LAYKttkkAlpYg1E9CD0T7g1ENXzcqb5t6fZTW4oU0Wpqen788WQqz
EQqp30kU0FgnFAW8BUpJK5iwrrm3RrS+Br6lhk33BeA423Pt6n3RnXYFVvtAHeuA
jyUVqiMKexi+7fCC2LO1M9xofQMmr6z2nVkZNhDLIr4y9uxD4xTyiaEAFjk6Lws6
lcSWZMHQPKaCqfxhAtnoVZP96k6zMwfEJUb+fANX9SI0+3p9LHFz2Kp/AOs6GvJC
/A5vCFjLWw==
=oNaB
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.5-2020-01-22' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"This was supposed to have gone in last week, but due to a brain fart
on my part, I forgot that we made this struct addition in the 5.5
cycle. So here it is for 5.5, to prevent having a 32 vs 64-bit
compatability issue with the files_update command"
* tag 'io_uring-5.5-2020-01-22' of git://git.kernel.dk/linux-block:
io_uring: fix compat for IORING_REGISTER_FILES_UPDATE
Currently, we just assume that it will stick around by virtue of the
submitter's reference, but later patches will allow the syscall to
return early and we can't rely on that reference at that point.
While I'm not aware of any reports of it, Xiubo pointed out that this
may fix a use-after-free. If the wait for a reply times out or is
canceled via signal, and then the reply comes in after the syscall
returns, the client can end up trying to access r_parent without a
reference.
Take an extra reference to the inode when setting r_parent and release
it when releasing the request.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
these macros are never used from introduced. better to
remove them.
Link: https://lore.kernel.org/r/1579602338-57079-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Bharath Vedartham <linux.bhar@gmail.com>
Cc: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: zhengbin <zhengbin13@huawei.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: reiserfs-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Since commit 223b2b889f ("GFS2: Fix alignment issue and tidy
gfs2_bitfit"), these 3 macros aren't used anymore, so remove them.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since commit 1579343a73 ("GFS2: Remove dirent_first() function"),
these macros aren't used any more, so remove them.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>