We don't use this anywhere inside of shrink_delalloc since 17024ad0a0
("Btrfs: fix early ENOSPC due to delalloc"), remove it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
btrfs_start_delalloc_roots() that takes an int for nr, which makes using
them in conjunction, especially for something like (u64)-1, annoying and
inconsistent. Fix btrfs_start_delalloc_roots() to take a u64 for nr and
adjust start_delalloc_inodes() and it's callers appropriately.
This means we've adjusted start_delalloc_inodes() to take a pointer of
nr since we want to preserve the ability for start-delalloc_inodes() to
return an error, so simply make it do the nr adjusting as necessary.
Part of adjusting the callers to this means changing
btrfs_writeback_inodes_sb_nr() to take a u64 for items. This may be
confusing because it seems unrelated, but the caller of
btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
function variable that needs to be changed.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be accessed from 'fs_devices' as it's identical to
fs_info->fs_devices. Also add a comment about why we are calling the
function. No semantic changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That BUG_ON cannot ever trigger because as the comment there states -
'err' is always set. Simply remove it as it brings no value.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The current trace event always output result like this:
find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)
find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)
T's saying we're allocating data extent for EXTENT tree, which is not
even possible.
It's because we always use EXTENT tree as the owner for
trace_find_free_extent() without using the @root from
btrfs_reserve_extent().
This patch will change the parameter to use proper @root for
trace_find_free_extent():
Now it looks much better:
find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=1(DATA)
find_free_extent: root=5(FS_TREE) len=4096 empty_size=0 flags=1(DATA)
find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
find_free_extent: root=7(CSUM_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
find_free_extent: root=1(ROOT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
Reported-by: Hans van Kranenburg <hans@knorrie.org>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix missing result check of exfat_build_inode().
And use PTR_ERR_OR_ZERO instead of PTR_ERR.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Tetsuo Handa reports that splice() can return 0 before the real EOF, if
the data in the splice source pipe is an empty pipe buffer. That empty
pipe buffer case doesn't happen in any normal situation, but you can
trigger it by doing a write to a pipe that fails due to a page fault.
Tetsuo has a test-case to show the behavior:
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
int pipe_fd[2] = { -1, -1 };
pipe(pipe_fd);
write(pipe_fd[1], NULL, 4096);
/* This splice() should wait unless interrupted. */
return !splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
}
which results in
write(5, NULL, 4096) = -1 EFAULT (Bad address)
splice(4, NULL, 3, NULL, 65536, 0) = 0
and this can confuse splice() users into believing they have hit EOF
prematurely.
The issue was introduced when the pipe write code started pre-allocating
the pipe buffers before copying data from user space.
This is modified verion of Tetsuo's original patch.
Fixes: a194dfe6e6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
Link:https://lore.kernel.org/linux-fsdevel/20201005121339.4063-1-penguin-kernel@I-love.SAKURA.ne.jp/
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are generating incorrect path in case of rename retry because
we are restarting from wrong dentry. We should restart from the
dentry which was received in the call to nfs_path.
CC: stable@vger.kernel.org
Signed-off-by: Ashish Sangwan <ashishsangwan2@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Canonalize to ioctl FS_* flags instead of inode S_* flags.
Note that we do not call the helper vfs_ioc_fssetxattr_check()
for FS_IOC_FSSETXATTR ioctl. The reason is that underlying filesystem
will perform all the checks. We only need to perform the capability
check before overriding credentials.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
[S|G]ETFLAGS and FS[S|G]ETXATTR ioctls are applicable to both files and
directories, so add ioctl operations to dir as well.
We teach ovl_real_fdget() to get the realfile of directories which use
a different type of file->private_data.
Ifdef away compat ioctl implementation to conform to standard practice.
With this change, xfstest generic/079 which tests these ioctls on files
and directories passes.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rejecting non-native endian BTF overlapped with the addition
of support for it.
The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.
Signed-off-by: David S. Miller <davem@davemloft.net>
All remaining callers of bdget() outside of fs/block_dev.c want to get a
reference to the struct block_device for a given struct hd_struct. Add
a helper just for that and then mark bdget static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To perform partial reads, callers of kernel_read_file*() must have a
non-NULL file_size argument and a preallocated buffer. The new "offset"
argument can then be used to seek to specific locations in the file to
fill the buffer to, at most, "buf_size" per call.
Where possible, the LSM hooks can report whether a full file has been
read or not so that the contents can be reasoned about.
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201002173828.2099543-14-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As with the kernel_load_data LSM hook, add a "contents" flag to the
kernel_read_file LSM hook that indicates whether the LSM can expect
a matching call to the kernel_post_read_file LSM hook with the full
contents of the file. With the coming addition of partial file read
support for kernel_read_file*() API, the LSM will no longer be able
to always see the entire contents of a file during the read calls.
For cases where the LSM must read examine the complete file contents,
it will need to do so on its own every time the kernel_read_file
hook is called with contents=false (or reject such cases). Adjust all
existing LSMs to retain existing behavior.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-12-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In preparation for adding partial read support, add an optional output
argument to kernel_read_file*() that reports the file size so callers
can reason more easily about their reading progress.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-8-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In preparation for further refactoring of kernel_read_file*(), rename
the "max_size" argument to the more accurate "buf_size", and correct
its type to size_t. Add kerndoc to explain the specifics of how the
arguments will be used. Note that with buf_size now size_t, it can no
longer be negative (and was never called with a negative value). Adjust
callers to use it as a "maximum size" when *buf is NULL.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-7-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In preparation for refactoring kernel_read_file*(), remove the redundant
"size" argument which is not needed: it can be included in the return
code, with callers adjusted. (VFS reads already cannot be larger than
INT_MAX.)
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-6-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
These routines are used in places outside of exec(2), so in preparation
for refactoring them, move them into a separate source file,
fs/kernel_read_file.c.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-5-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Move kernel_read_file* out of linux/fs.h to its own linux/kernel_read_file.h
include file. That header gets pulled in just about everywhere
and doesn't really need functions not related to the general fs interface.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/r/20200706232309.12010-2-scott.branden@broadcom.com
Link: https://lore.kernel.org/r/20201002173828.2099543-4-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
FIRMWARE_PREALLOC_BUFFER is a "how", not a "what", and confuses the LSMs
that are interested in filtering between types of things. The "how"
should be an internal detail made uninteresting to the LSMs.
Fixes: a098ecd2fa ("firmware: support loading into a pre-allocated buffer")
Fixes: fd90bc559b ("ima: based on policy verify firmware signatures (pre-allocated buffer)")
Fixes: 4f0496d8ff ("ima: based on policy warn about loading firmware (pre-allocated buffer)")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201002173828.2099543-2-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that import_iovec handles compat iovecs, the native vmsplice syscall
can be used for the compat case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that import_iovec handles compat iovecs, the native readv and writev
syscalls can be used for the compat case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that import_iovec handles compat iovecs as well, all the duplicated
code in the compat readv/writev helpers is not needed. Remove them
and switch the compat syscall handlers to use the native helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.
This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Bulk of the genetlink users can use smaller ops, move them.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull epoll fixes from Al Viro:
"Several race fixes in epoll"
* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ep_create_wakeup_source(): dentry name can change under you...
epoll: EPOLL_CTL_ADD: close the race in decision to take fast path
epoll: replace ->visited/visited_list with generation count
epoll: do not insert into poll queues until all sanity checks are done
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl93REAACgkQxWXV+ddt
WDv0/A//XYr1XLC/5sMILHqYZ4ogiFxC3Nfjeyt6vfBPX3J0d2eHnw5Rw+ZHHHdQ
qtoKWom9ZwCxjybghwmvfxJuohy+6Sc764aEj+rYpUcCmmUZsAZZpmwpZqpYG+0H
DEn9p45T0MO+r5lsF/GdNqqsdXZfUlZy7PweIhZucQxENM8cowklqKCo4AU2IEW4
203THU3UxQayn0um6kaiesioh8TtT+R9UVAyyA3n6lGINHKG8AMy0ulS/M2Uzgq5
eAzWne4Opy+wLxubBdeqruPiQrFQp+JV/YhTTEHGKRXykRYXwZnCDYdK27X4UKkt
g3Ne0cEd/JuxZfb3Mzsd7+MF0xr9xKJPziFXv7YZt0LkiHE+B0b/DwA9FksR9sdO
4BY2oe0gztstIMqQ5qnriJMDQxonyUt2G65YW8sCI9b32vRYaHLhCWZRYzbmftEO
W4FJOnAI2It3Ib0CUkBjkPYkmH113Q6g59k015IpoYRGmExhnC59zhuijdmthxFJ
S5PXFymVhxt9iMOKM0jE17Rp/j4hVg/bdFVHJryzlOsldjq63Vukqoo24SQhiqfY
qYn/Ilkc/h1YD/pxehFAhZcbGfEdjD5oo8OkGoKIUXfv35r7JH/5F/x+4DxZNnYk
n0oHJ7WBR01AlHAcuTvsN7z9O2ZX6wZufkkgKYLBvtGtyC71T3A=
=MT2i
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two more fixes.
One is for a lockdep warning/lockup (also caught by syzbot), that one
has been seen in practice. Regarding the other syzbot reports
mentioned last time, they don't seem to be urgent and reliably
reproducible so they'll be fixed later.
The second fix is for a potential corruption when device replace
finishes and the in-memory state of trim is not copied to the new
device"
* tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix filesystem corruption after a device replace
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing
Refactor: Handle this NFS version-specific mapping in the only
place where nfserr_wrongsec is generated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Refactor: I'm about to change the return value from .pc_func. Clear
the way by replacing the RETURN_STATUS() macro with logic that
plants the status code directly into the response structure.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Remove special dispatcher logic for NFSv2 error responses. These are
rare to the point of becoming extinct, but all NFS responses have to
pay the cost of the extra conditional branches.
With this change, the NFSv2 error cases now get proper
xdr_ressize_check() calls.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd_release_fhandle() assumes that rqstp->rq_resp always points to
an nfsd_fhandle struct. In fact, no NFSv2 procedure uses struct
nfsd_fhandle as its response structure.
So far that has been "safe" to do because the res structs put the
resp->fh field at that same offset as struct nfsd_fhandle. I don't
think that's a guarantee, though, and there is certainly nothing
preventing a developer from altering the fields in those structures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd_dispatch() is a hot path. Ensure the compiler takes the
processing of rare error cases out of line.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
For consistency and code legibility, use a similar organization of
variables as svc_generic_dispatch().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a documenting comment for the function. Remove comments that
simply describe obvious aspects of the code, but leave comments
that explain the differences in processing of each NFS version.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reorder the arms so the compiler places checks for the most frequent
case first.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd_dispatch() is a hot path. Let's optimize the XDR method calls
for the by-far common case, which is that the XDR methods are indeed
present.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Follow-up on ten-year-old commit b9081d90f5 ("NFS: kill
off complicated macro 'PROC'") by performing the same conversion in
the NFSACL code. To reduce the chance of error, I copied the original
C preprocessor output and then made some minor edits.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Follow-up on ten-year-old commit b9081d90f5 ("NFS: kill
off complicated macro 'PROC'") by performing the same conversion in
the lockd code. To reduce the chance of error, I copied the original
C preprocessor output and then made some minor edits.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There's no protection in nfsd_dispatch() against a NULL .pc_func
helpers. A malicious NFS client can trigger a crash by invoking the
unused/unsupported NFSv2 ROOT or WRITECACHE procedures.
The current NFSD dispatcher does not support returning a void reply
to a non-NULL procedure, so the reply to both of these is wrong, for
the moment.
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The list_lru_count() returns the pre node count, but the new xattr
shrinkers are memcg aware, so the shrinkers should return per memcg
count by calling list_lru_shrink_count() instead. Otherwise over-shrink
might be experienced. The problem was spotted by visual code
inspection.
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since commit 0e0cb35b41 ("NFSv4: Handle NFS4ERR_OLD_STATEID in
CLOSE/OPEN_DOWNGRADE") the following livelock may occur if a CLOSE races
with the update of the nfs_state:
Process 1 Process 2 Server
========= ========= ========
OPEN file
OPEN file
Reply OPEN (1)
Reply OPEN (2)
Update state (1)
CLOSE file (1)
Reply OLD_STATEID (1)
CLOSE file (2)
Reply CLOSE (-1)
Update state (2)
wait for state change
OPEN file
wake
CLOSE file
OPEN file
wake
CLOSE file
...
...
We can avoid this situation by not issuing an immediate retry with a bumped
seqid when CLOSE/OPEN_DOWNGRADE receives NFS4ERR_OLD_STATEID. Instead,
take the same approach used by OPEN and wait at least 5 seconds for
outstanding stateid updates to complete if we can detect that we're out of
sequence.
Note that after this change it is still possible (though unlikely) that
CLOSE waits a full 5 seconds, bumps the seqid, and retries -- and that
attempt races with another OPEN at the same time. In order to avoid this
race (which would result in the livelock), update
nfs_need_update_open_stateid() to handle the case where:
- the state is NFS_OPEN_STATE, and
- the stateid doesn't match the current open stateid
Finally, nfs_need_update_open_stateid() is modified to be idempotent and
renamed to better suit the purpose of signaling that the stateid passed
is the next stateid in sequence.
Fixes: 0e0cb35b41 ("NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There is no case after the default from which to fallthrough to. Clang
will error in this case (unhelpfully without context, see link below)
and GCC will with -Wswitch-unreachable.
The previous commit should have just replaced the comment with a break
statement.
If we consider implicit fallthrough to be a design mistake of C, then
all case statements should be terminated with one of the following
statements:
* break
* continue
* return
* fallthrough
* goto
* (call of function with __attribute__(__noreturn__))
Fixes: 2a1390c95a69 ("nfs: Convert to use the preferred fallthrough macro")
Link: https://bugs.llvm.org/show_bug.cgi?id=47539
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Output defects can exist in sysfs content using sprintf and snprintf.
sprintf does not know the PAGE_SIZE maximum of the temporary buffer
used for outputting sysfs content and it's possible to overrun the
PAGE_SIZE buffer length.
Add a generic sysfs_emit function that knows that the size of the
temporary buffer and ensures that no overrun is done.
Add a generic sysfs_emit_at function that can be used in multiple
call situations that also ensures that no overrun is done.
Validate the output buffer argument to be page aligned.
Validate the offset len argument to be within the PAGE_SIZE buf.
Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/884235202216d464d61ee975f7465332c86f76b2.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock. So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.
Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.
However, commit 0ddad21d3e ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.
It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:
"I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null. This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.
The box that was hung had the write side stuck here [pipe_write] and
the read side stuck here [splice_from_pipe_next -> pipe_wait].
[ more details about pipe_wait() scenario ]
The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes. In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule"
Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't. And the whole "wait for any pipe state change" model really
isn't very good to begin with.
So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.
Fixes: 0ddad21d3e ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes a race in nodeid2con in cases that we parallel running
a lookup and both will create a connection structure for the same nodeid.
It's a rare case to create a new connection structure to keep reader
lockless we just do a lookup inside the protection area again and drop
previous work if this race happens.
Fixes: a47666eb76 ("fs: dlm: make connection hash lockless")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Calling pipe2() with O_NOTIFICATION_PIPE could results in memory
leaks unless watch_queue_init() is successful.
In case of watch_queue_init() failure in pipe2() we are left
with inode and pipe_inode_info instances that need to be freed. That
failure exit has been introduced in commit c73be61ced ("pipe: Add
general notification queue support") and its handling should've been
identical to nearby treatment of alloc_file_pseudo() failures - it
is dealing with the same situation. As it is, the mainline kernel
leaks in that case.
Another problem is that CONFIG_WATCH_QUEUE and !CONFIG_WATCH_QUEUE
cases are treated differently (and the former leaks just pipe_inode_info,
the latter - both pipe_inode_info and inode).
Fixed by providing a dummy wacth_queue_init() in !CONFIG_WATCH_QUEUE
case and by having failures of wacth_queue_init() handled the same way
we handle alloc_file_pseudo() ones.
Fixes: c73be61ced ("pipe: Add general notification queue support")
Signed-off-by: Qian Cai <cai@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With suitably crafted reiserfs image and mount command reiserfs will
crash when trying to verify that XATTR_ROOT directory can be looked up
in / as that recurses back to xattr code like:
xattr_lookup+0x24/0x280 fs/reiserfs/xattr.c:395
reiserfs_xattr_get+0x89/0x540 fs/reiserfs/xattr.c:677
reiserfs_get_acl+0x63/0x690 fs/reiserfs/xattr_acl.c:209
get_acl+0x152/0x2e0 fs/posix_acl.c:141
check_acl fs/namei.c:277 [inline]
acl_permission_check fs/namei.c:309 [inline]
generic_permission+0x2ba/0x550 fs/namei.c:353
do_inode_permission fs/namei.c:398 [inline]
inode_permission+0x234/0x4a0 fs/namei.c:463
lookup_one_len+0xa6/0x200 fs/namei.c:2557
reiserfs_lookup_privroot+0x85/0x1e0 fs/reiserfs/xattr.c:972
reiserfs_fill_super+0x2b51/0x3240 fs/reiserfs/super.c:2176
mount_bdev+0x24f/0x360 fs/super.c:1417
Fix the problem by bailing from reiserfs_xattr_get() when xattrs are not
yet initialized.
CC: stable@vger.kernel.org
Reported-by: syzbot+9b33c9b118d77ff59b6f@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
All request preparations are done only during submission, reflect it in
the code by moving io_req_prep() much earlier into io_queue_sqe().
That's much cleaner, because it doen't expose bits to async code which
it won't ever use. Also it makes the interface harder to misuse, and
there are potential places for bugs.
For instance, __io_queue() doesn't clear @sqe before proceeding to a
next linked request, that could have been disastrous, but hopefully
there are linked requests IFF sqe==NULL, so not actually a bug.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_issue_sqe() does two things at once, trying to prepare request and
issuing them. Split it in two and deduplicate with io_defer_prep().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All io_*_prep() functions including io_{read,write}_prep() are called
only during submission where @force_nonblock is always true. Don't keep
propagating it and instead remove the @force_nonblock argument
from prep() altogether.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
it's set/cleared in a single place. Also remove @force_nonblock
parameter from io_prep_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
be called only once, so there is no one who may have set the flag
before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Put brackets around bitwise ops in a complex expression
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extract common code from if/else branches. That is cleaner and optimised
even better.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The smart syzbot has found a reproducer for the following issue:
==================================================================
BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
instrument_atomic_write include/linux/instrumented.h:71 [inline]
atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
io_wqe_inc_running fs/io-wq.c:301 [inline]
io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Allocated by task 7768:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
kmalloc_node include/linux/slab.h:572 [inline]
kzalloc_node include/linux/slab.h:677 [inline]
io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
io_init_wq_offload fs/io_uring.c:7432 [inline]
io_sq_offload_start fs/io_uring.c:7504 [inline]
io_uring_create fs/io_uring.c:8625 [inline]
io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 21:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
__cache_free mm/slab.c:3418 [inline]
kfree+0x10e/0x2b0 mm/slab.c:3756
__io_wq_destroy fs/io-wq.c:1138 [inline]
io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
io_finish_async fs/io_uring.c:6836 [inline]
io_ring_ctx_free fs/io_uring.c:7870 [inline]
io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
The buggy address belongs to the object at ffff8882183db000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 140 bytes inside of
1024-byte region [ffff8882183db000, ffff8882183db400)
The buggy address belongs to the page:
page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
flags: 0x57ffe0000000200(slab)
raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
which is down to the comment below,
/* all workers gone, wq exit can proceed */
if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
complete(&wqe->wq->done);
because there might be multiple cases of wqe in a wq and we would wait
for every worker in every wqe to go home before releasing wq's resources
on destroying.
To that end, rework wq's refcount by making it independent of the tracking
of workers because after all they are two different things, and keeping
it balanced when workers come and go. Note the manager kthread, like
other workers, now holds a grab to wq during its lifetime.
Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
and do nothing for exiting wq.
Cc: stable@vger.kernel.org # v5.5+
Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
io_uring instances in a host. Since all sqthreads are named
"io_uring-sq", it's hard to distinguish the relations between
application process and its io_uring sqthread.
With this patch, application can get its corresponding sqthread pid
and cpu through show_fdinfo.
Steps:
1. Get io_uring fd first.
$ ls -l /proc/<pid>/fd | grep -w io_uring
2. Then get io_uring instance related info, including corresponding
sqthread pid and cpu.
$ cat /proc/<pid>/fdinfo/<io_uring_fd>
pos: 0
flags: 02000002
mnt_id: 13
SqThread: 6929
SqThreadCpu: 2
UserFiles: 1
0: testfile
UserBufs: 0
PollList:
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
[axboe: fixed for new shared SQPOLL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We do this for CQ ring wait, in case task_work completions come in. We
should do the same in io_uring_register(), to avoid spurious -EINTR
if the ring quiescing ends up having to process task_work to complete
the operation
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.
For the SQPOLL thread, it will live and attach in the inited cgroup's
context.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring does account any registered buffer as pinned/locked memory, and
checks limit and fails if the given user doesn't have a big enough limit
to register the ranges specified. However, if huge pages are used, we
are potentially under-accounting the memory in terms of what gets pinned
on the vm side.
This patch rectifies that, by ensuring that we account the full size of
a compound page, regardless of how much of it is being registered. Huge
pages are not accounted mulitple times - if multiple sections of a huge
page is registered, then the page is only accounted once.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the spirit of fairness, cap the max number of SQ entries we'll submit
for SQPOLL if we have multiple rings. If we don't do that, we could be
submitting tons of entries for one ring, while others are waiting to get
service.
The value of 8 is somewhat arbitrarily chosen as something that allows
a fair bit of batching, without using an excessive time per ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's really no point in having this union, it just means that we're
always allocating enough room to cater to any command. But that's
pointless, as the ->io field is request type private anyway.
This gets rid of the io_async_ctx structure, and fills in the required
size in the io_op_defs[] instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
because in that case ctx->nr_user_bufs would be zero, and the following
check would fail.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
iorw->iter into itself. Even though it doesn't lead to an error, such a
memcpy()'s aliasing rules violation is considered to be a bad practise.
Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
remapping there, so it's much simpler than the generic implementation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set rw->free_iovec to @iovec, that gives an identical result and stresses
that @iovec param rw->free_iovec play the same role.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't touch iter->iov and iov in between __io_import_iovec() and
io_req_map_rw(), the former function aleady sets it correctly, because it
creates one more case with NULL'ed iov to consider in io_req_map_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When using SQPOLL, applications can run into the issue of running out of
SQ ring entries because the thread hasn't consumed them yet. The only
option for dealing with that is checking later, or busy checking for the
condition.
Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
condition.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We support using IORING_SETUP_ATTACH_WQ to share async backends between
rings created by the same process, this now also allows the same to
happen with SQPOLL. The setup procedure remains the same, the caller
sets io_uring_params->wq_fd to the 'parent' context, and then the newly
created ring will attach to that async backend.
This means that multiple rings can share the same SQPOLL thread, saving
resources.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
data structure we pass in. io_sq_data has a list of ctx's that we can
then iterate over and handle.
As of now we're ready to handle multiple ctx's, though we're still just
handling a single one after this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move all the necessary state out of io_ring_ctx, and into a new
structure, io_sq_data. The latter now deals with any state or
variables associated with the SQPOLL thread itself.
In preparation for supporting more than one io_ring_ctx per SQPOLL
thread.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is done in preparation for handling more than one ctx, but it also
cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
get a get overview on.
__io_sq_thread() is now the main handler, and it returns an enum sq_ret
that tells io_sq_thread() what it ended up doing. The parent then makes
a decision on idle, spinning, or work handling based on that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to decouple the clearing on wakeup from the the inline schedule,
as that is going to be required for handling multiple rings in one
thread.
Wrap our wakeup handler so we can clear it when we get the wakeup, by
definition that is when we no longer need the flag set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation to sharing the poller thread between rings. For
that we need per-ring wait_queue_entry storage, and we can't easily put
that on the stack if one thread is managing multiple rings.
We'll also be sharing the wait_queue_head across rings for the purposes
of wakeups, provide the usual private ring wait_queue_head for now but
make it a pointer so we can easily override it when sharing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're not handling signals by default in kernel threads, and we never
use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
have a signal pending, and we don't need to check for it (nor flush it).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).
The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the
sleeping lock again while scheduling out.
While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.
In order to keep the spinlock_t it would have to be moved after the
tsk_is_pi_blocked() check which would introduce a branch instruction
into the hot path.
The lock is used to maintain the `work_list' and wakes one task up at
most.
Should io_wqe_cancel_pending_work() cause latency spikes, while
searching for a specific item, then it would need to drop the lock
during iterations.
revert_creds() is also invoked under the lock. According to debug
cred::non_rcu is 0. Otherwise it should be moved outside of the locked
section because put_cred_rcu()->free_uid() acquires a sleeping lock.
Convert io_wqe::lock to a raw_spinlock_t.c
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds a new IORING_SETUP_R_DISABLED flag to start the
rings disabled, allowing the user to register restrictions,
buffers, files, before to start processing SQEs.
When IORING_SETUP_R_DISABLED is set, SQE are not processed and
SQPOLL kthread is not started.
The restrictions registration are allowed only when the rings
are disable to prevent concurrency issue while processing SQEs.
The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
opcode with io_uring_register(2).
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
permanently installs a feature allowlist on an io_ring_ctx.
The io_ring_ctx can then be passed to untrusted code with the
knowledge that only operations present in the allowlist can be
executed.
The allowlist approach ensures that new features added to io_uring
do not accidentally become available when an existing application
is launched on a newer kernel version.
Currently is it possible to restrict sqe opcodes, sqe flags, and
register opcodes.
IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
it is not possible to change restrictions anymore.
This prevents untrusted code from removing restrictions.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we don't get and assign the namespace for the async work, then certain
paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
Anything that references the current namespace of the given task should
be assigned for async work on behalf of that task.
Cc: stable@vger.kernel.org # v5.5+
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.
With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.
Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This allows us to selectively flush out pending overflows, depending on
the task and/or files_struct being passed in.
No intended functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Return whether we found and canceled requests or not. This is in
preparation for using this information, no functional changes in this
patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sometimes we assign a weak reference to it, sometimes we grab a
reference to it. Clean this up and make it unconditional, and drop the
flag related to tracking this state.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can grab a reference to the task instead of stashing away the task
files_struct. This is doable without creating a circular reference
between the ring fd and the task itself.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, prep patch for grabbing references
to the files_struct.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently cancel these when the ring exits, and we cancel all of
them. This is in preparation for killing only the ones associated
with a given task.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use a device's allocation state tree to track ranges in a device used
for allocated chunks, and we set ranges in this tree when allocating a new
chunk. However after a device replace operation, we were not setting the
allocated ranges in the new device's allocation state tree, so that tree
is empty after a device replace.
This means that a fitrim operation after a device replace will trim the
device ranges that have allocated chunks and extents, as we trim every
range for which there is not a range marked in the device's allocation
state tree. It is also important during chunk allocation, since the
device's allocation state is used to determine if a range is already
allocated when allocating a new chunk.
This is trivial to reproduce and the following script triggers the bug:
$ cat reproducer.sh
#!/bin/bash
DEV1="/dev/sdg"
DEV2="/dev/sdh"
DEV3="/dev/sdi"
wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null
# Create a raid1 test fs on 2 devices.
mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
mount $DEV1 /mnt/btrfs
xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo
echo "Starting to replace $DEV1 with $DEV3"
btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
echo
echo "Running fstrim"
fstrim /mnt/btrfs
echo
echo "Unmounting filesystem"
umount /mnt/btrfs
echo "Mounting filesystem in degraded mode using $DEV3 only"
wipefs -a $DEV1 $DEV2 &> /dev/null
mount -o degraded $DEV3 /mnt/btrfs
if [ $? -ne 0 ]; then
dmesg | tail
echo
echo "Failed to mount in degraded mode"
exit 1
fi
echo
echo "File foo data (expected all bytes = 0xab):"
od -A d -t x1 /mnt/btrfs/foo
umount /mnt/btrfs
When running the reproducer:
$ ./replace-test.sh
wrote 10485760/10485760 bytes at offset 0
10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
Starting to replace /dev/sdg with /dev/sdi
Running fstrim
Unmounting filesystem
Mounting filesystem in degraded mode using /dev/sdi only
mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
[19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
[19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
[19582.208293] BTRFS info (device sdi): allowing degraded mounts
[19582.208298] BTRFS info (device sdi): disk space caching is enabled
[19582.208301] BTRFS info (device sdi): has skinny extents
[19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
[19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
[19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
[19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
[19582.231576] BTRFS error (device sdi): open_ctree failed
Failed to mount in degraded mode
So fix by setting all allocated ranges in the replace target device when
the replace operation is finishing, when we are holding the chunk mutex
and we can not race with new chunk allocations.
A test case for fstests follows soon.
Fixes: 1c11b63eff ("btrfs: replace pending/pinned chunks lists with io tree")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nathan popped up on #xfs and pointed out that we fail to handle
finobt btree blocks in xlog_recover_get_buf_lsn(). This means they
always fall through the entire magic number matching code to "recover
immediately". Whilst most of the time this is the correct behaviour,
occasionally it will be incorrect and could potentially overwrite
more recent metadata because we don't check the LSN in the on disk
metadata at all.
This bug has been present since the finobt was first introduced, and
is a potential cause of the occasional xfs_iget_check_free_state()
failures we see that indicate that the inode btree state does not
match the on disk inode state.
Fixes: aafc3c2465 ("xfs: support the XFS_BTNUM_FINOBT free inode btree type")
Reported-by: Nathan Scott <nathans@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
autofs got broken in some configurations by commit 13c164b1a1
("autofs: switch to kernel_write") because there is now an extra LSM
permission check done by security_file_permission() in rw_verify_area().
autofs is one if the few places that really does want the much more
limited __kernel_write(), because the write is an internal kernel one
that shouldn't do any user permission checks (it also doesn't need the
file_start_write/file_end_write logic, since it's just a pipe).
There are a couple of other cases like that - accounting, core dumping,
and splice - but autofs stands out because it can be built as a module.
As a result, we need to export this internal __kernel_write() function
again.
We really don't want any other module to use this, but we don't have a
"EXPORT_SYMBOL_FOR_AUTOFS_ONLY()". But we can mark it GPL-only to at
least approximate that "internal use only" for licensing.
While in this area, make autofs pass in NULL for the file position
pointer, since it's always a pipe, and we now use a NULL file pointer
for streaming file descriptors (see file_ppos() and commit 438ab720c6:
"vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files")
This effectively reverts commits 9db9775224 ("fs: unexport
__kernel_write") and 13c164b1a1 ("autofs: switch to kernel_write").
Fixes: 13c164b1a1 ("autofs: switch to kernel_write")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch reworks the current receive handling of dlm. As I tried to
change the send handling to fix reorder issues I took a look into the
receive handling and simplified it, it works as the following:
Each connection has a preallocated receive buffer with a minimum length of
4096. On receive, the upper layer protocol will process all dlm message
until there is not enough data anymore. If there exists "leftover" data at
the end of the receive buffer because the dlm message wasn't fully received
it will be copied to the begin of the preallocated receive buffer. Next
receive more data will be appended to the previous "leftover" data and
processing will begin again.
This will remove a lot of code of the current mechanism. Inside the
processing functionality we will ensure with a memmove() that the dlm
message should be memory aligned. To have a dlm message always started
at the beginning of the buffer will reduce some amount of memmove()
calls because src and dest pointers are the same.
The cluster attribute "buffer_size" becomes a new meaning, it's now the
size of application layer receive buffer size. If this is changed during
runtime the receive buffer will be reallocated. It's important that the
receive buffer size has at minimum the size of the maximum possible dlm
message size otherwise the received message cannot be placed inside
the receive buffer size.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
I observed that the upper layer will not send messages above this value.
As conclusion the application receive buffer should not below that
value, otherwise we are not capable to deliver the dlm message to the
upper layer. This patch forbids to set the receive buffer below the
maximum possible dlm message size.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch adds a callback to CLUSTER_ATTR macro to allow individual
callbacks for attributes which might have a more complex attribute range
checking just than non zero.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch fixes to set per nodeid mark configuration for accepted
sockets as well. Before this patch only the listen socket mark value was
used for all accepted connections. This patch will ensure that the
cluster mark attribute value will be always used for all sockets, if a
per nodeid mark value is specified dlm will use this value for the
specific node.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
During my experiments to make dlm robust against tcpkill application I
was able to run sometimes in a circular lock dependency warning between
clusters_root.subsys.su_mutex and con->sock_mutex. We don't need to
held the sock_mutex when getting the mark value which held the
clusters_root.subsys.su_mutex. This patch moves the specific handling
just before the sock_mutex will be held.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Compressed inode and normal inode has different layout, so we should
disallow enabling compress on non-empty file to avoid race condition
during inode .i_addr array parsing and updating.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: Fix missing condition]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add two slab caches: "f2fs_cic_entry" and "f2fs_dic_entry" for memory
allocation of compress_io_ctx and decompress_io_ctx structure.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Although UDF standard allows it, we don't support sparing table larger
than a single block. Check it during mount so that we don't try to
access memory beyond end of buffer.
Reported-by: syzbot+9991561e714f597095da@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
When we fail to read inode, some data accessed in udf_evict_inode() may
be uninitialized. Move the accesses to !is_bad_inode() branch.
Reported-by: syzbot+91f02b28f9bb5f5f1341@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:
- when doing retry in io_read, not only the IOCB_WAITQ flag but also
the IOCB_NOWAIT flag is still set, which makes it goes to would_block
phase in generic_file_buffered_read() and then return -EAGAIN. After
that, the io-wq thread work is queued, and later doing the async
reads in the old way.
- even if we remove IOCB_NOWAIT when doing retry, the feature is still
not running properly, since in generic_file_buffered_read() it goes to
lock_page_killable() after calling mapping->a_ops->readpage() to do
IO, and thus causing process to sleep.
Fixes: 1a0a7853b9 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439e0 ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As syzbot reported:
BUG: KASAN: slab-out-of-bounds in init_min_max_mtime fs/f2fs/segment.c:4710 [inline]
BUG: KASAN: slab-out-of-bounds in f2fs_build_segment_manager+0x9302/0xa6d0 fs/f2fs/segment.c:4792
Read of size 8 at addr ffff8880a1b934a8 by task syz-executor682/6878
CPU: 1 PID: 6878 Comm: syz-executor682 Not tainted 5.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
init_min_max_mtime fs/f2fs/segment.c:4710 [inline]
f2fs_build_segment_manager+0x9302/0xa6d0 fs/f2fs/segment.c:4792
f2fs_fill_super+0x381a/0x6e80 fs/f2fs/super.c:3633
mount_bdev+0x32e/0x3f0 fs/super.c:1417
legacy_get_tree+0x105/0x220 fs/fs_context.c:592
vfs_get_tree+0x89/0x2f0 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x1387/0x20a0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount fs/namespace.c:3390 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The root cause is: if segs_per_sec is larger than one, and segment count
in last section is less than segs_per_sec, we will suffer out-of-boundary
memory access on sit_i->sentries[] in init_min_max_mtime().
Fix this by adding sanity check among segment count, section count and
segs_per_sec value in sanity_check_raw_super().
Reported-by: syzbot+481a3ffab50fed41dcc0@syzkaller.appspotmail.com
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As syzbot reported:
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x21c/0x280 lib/dump_stack.c:118
kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:122
__msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:219
f2fs_lookup+0xe05/0x1a80 fs/f2fs/namei.c:503
lookup_open fs/namei.c:3082 [inline]
open_last_lookups fs/namei.c:3177 [inline]
path_openat+0x2729/0x6a90 fs/namei.c:3365
do_filp_open+0x2b8/0x710 fs/namei.c:3395
do_sys_openat2+0xa88/0x1140 fs/open.c:1168
do_sys_open fs/open.c:1184 [inline]
__do_compat_sys_openat fs/open.c:1242 [inline]
__se_compat_sys_openat+0x2a4/0x310 fs/open.c:1240
__ia32_compat_sys_openat+0x56/0x70 fs/open.c:1240
do_syscall_32_irqs_on arch/x86/entry/common.c:80 [inline]
__do_fast_syscall_32+0x129/0x180 arch/x86/entry/common.c:139
do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:162
do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:205
entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
In f2fs_lookup(), @res_page could be used before being initialized,
because in __f2fs_find_entry(), once F2FS_I(dir)->i_current_depth was
been fuzzed to zero, then @res_page will never be initialized, causing
this kmsan warning, relocating @res_page initialization place to fix
this bug.
Reported-by: syzbot+0eac6f0bbd558fd866d7@syzkaller.appspotmail.com
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can relocate @res_page assignment in find_in_block() to
its caller, so unneeded parameter could be removed for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Meta area is not included in section_count computation.
So the minimum number of total_sections is 1 meanwhile it cannot be
greater than segment_count_main.
The minimum number of meta segments is 8 (SB + 2 (CP + SIT + NAT) + SSA).
Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
A NULL will not be return by __bitmap_ptr here.
Remove the unused check.
Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Relocate blkzoned feature check into parse_options() like
other feature check.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The type of SM_I(sbi)->reserved_segments is unsigned int,
so change the return value to unsigned int.
The type cast can be removed in reserved_sections as a result.
Signed-off-by: Xiaojun Wang <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When removing the last reference of an inode the size of an auth node
is already part of write_len. So we must not call ubifs_add_auth_dirt().
Call it only when needed.
Cc: <stable@vger.kernel.org>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Kristof Havasi <havasiefr@gmail.com>
Fixes: 6a98bc4614 ("ubifs: Add authentication nodes to journal")
Reported-and-tested-by: Kristof Havasi <havasiefr@gmail.com>
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Dentries that represent no-key names must have a dentry_operations that
includes fscrypt_d_revalidate(). Currently, this is handled by
fscrypt_prepare_lookup() installing fscrypt_d_ops.
However, ceph support for encryption
(https://lore.kernel.org/r/20200914191707.380444-1-jlayton@kernel.org)
can't use fscrypt_d_ops, since ceph already has its own
dentry_operations.
Similarly, ext4 and f2fs support for directories that are both encrypted
and casefolded
(https://lore.kernel.org/r/20200923010151.69506-1-drosen@google.com)
can't use fscrypt_d_ops either, since casefolding requires some dentry
operations too.
To satisfy both users, we need to move the responsibility of installing
the dentry_operations to filesystems.
In preparation for this, export fscrypt_d_revalidate() and give it a
!CONFIG_FS_ENCRYPTION stub.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200924054721.187797-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Highlights include:
Bugfixes:
- NFSv4.2: copy_file_range needs to invalidate caches on success
- NFSv4.2: Fix security label length not being reset
- pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
- pNFS/flexfiles: Fix signed/unsigned type issues with mirror indices
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl9yHBYACgkQZwvnipYK
APLKCA//Sppmzm+kFDmZ6iWplwdoIq7rnIMG7eKKGD754dDvOtYNIw9D9yOIY5G6
eVdvQ10m6vA8Dp8AxaWK9qacMXljmOX8szz+Bf1NcIe2F6X/waO3zMoud8Rd9Ja4
PigAbAW6Gs0gohL3wg+jh5N5JlaDcZ0Dri3QWdqGaHjhrKV9MW9h0BpBCx9YCPkL
FFgk+I+524rGQnkHvCWbclww4428e+MSYdeJE+c4wrIx/HCz3iJ60AFA0SIAw7FV
6qMtxN4/kqfdIrA074xcreMdkucxe3lNl7ujT1T6dum2OwERq+WyzkwoirqNguJM
X71CXU9IE8rw72ATWMoba961i4HITp05ZbVg7yXZrrRkAEljyHhr67R/1RRSlxQm
ZrPOICrCoXKHRFTbNL7Sb+xeTGbuZQkbcwGXnUYdTIO3JQ6PRIEFb/y8yuuT+EPG
KWk2vM+QM9036qfBWjbAZMOpwDB4oiVkBgzNM8FGcebiV1FANQ1by7oMaQsH1NLm
WY0M0KFY2wdv3ovGT7oUOEbtoxD993HuuLdIWxTRHFjRPgg8WKTFnf4BIeZtMjY8
oRvN83hEjWszuTEuuEukUdsqLTftv7rNhxrotoh9WfeSXvJDB6PF0y55UmZ6WuKE
wRQQLxC9Om+E3HidxgOolqKxD6d4OOY3XJWzH3As7sJEgQyE/5o=
=cNi/
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- NFSv4.2: copy_file_range needs to invalidate caches on success
- NFSv4.2: Fix security label length not being reset
- pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly
on read
- pNFS/flexfiles: Fix signed/unsigned type issues with mirror
indices"
* tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS/flexfiles: Be consistent about mirror index types
pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
NFSv4.2: fix client's attribute cache management for copy_file_range
nfs: Fix security label length not being reset
iomap complete routine can deadlock with btrfs_fallocate because of the
call to generic_write_sync().
P0 P1
inode_lock() fallocate(FALLOC_FL_ZERO_RANGE)
__iomap_dio_rw() inode_lock()
<block>
<submits IO>
<completes IO>
inode_unlock()
<gets inode_lock()>
inode_dio_wait()
iomap_dio_complete()
generic_write_sync()
btrfs_file_fsync()
inode_lock()
<deadlock>
inode_dio_end() is used to notify the end of DIO data in order
to synchronize with truncate. Call inode_dio_end() before calling
generic_write_sync(), so filesystems can lock i_rwsem during a sync.
This matches the way it is done in fs/direct-io.c:dio_complete().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This is to avoid the deadlock caused in btrfs because of O_DIRECT |
O_DSYNC.
Filesystems such as btrfs require i_rwsem while performing sync on a
file. iomap_dio_rw() is called under i_rw_sem. This leads to a
deadlock because of:
iomap_dio_complete()
generic_write_sync()
btrfs_sync_file()
Separate out iomap_dio_complete() from iomap_dio_rw(), so filesystems
can call iomap_dio_complete() after unlocking i_rwsem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
For filesystems with block size < page size, we need to set all the
per-block uptodate bits if the page was already uptodate at the time
we create the per-block metadata. This can happen if the page is
invalidated (eg by a write to drop_caches) but ultimately not removed
from the page cache.
This is a data corruption issue as page writeback skips blocks which
are marked !uptodate.
Fixes: 9dc55f1389 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Qian Cai <cai@redhat.com>
Cc: Brian Foster <bfoster@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
KSTAT_QUERY_FLAGS expands to AT_STATX_SYNC_TYPE, which itself already
is a mask. Remove the double name, especially given that the prefix
is a little confusing vs the normal AT_* flags.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function really obsfucates checking for valid flags and setting the
lookup flags. The fact that it returns -EINVAL through and unsigned
return value, which is then used as boolean really doesn't help either.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This allows to keep vfs_statx static in fs/stat.c to prepare for the following
changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_statx_fd is only used to implement vfs_fstat. Remove vfs_statx_fd
and just implement vfs_fstat directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9upV4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuPrD/9K1UQLv38K2nPYclLymOi+GIsukpjgwzdY
SM38GNXU5vYkFhylH/bXfckNQ/gja0/whNpcr/UVCgTWleMnss9UiZaCgysyuIOL
vnBxT4yDZIxtkOwF/790NiwV2FrsmrLFdNZU4LkmfbmmrAlNtjOElKyJsOyNNMzJ
UMzHH2Z1vvUwKz+Yq4fPyZCJbpNHN6ABwkSXY/Nz8oWsKfw728fZztLsP57gOtkl
yYVFO2z1n7VaWp5ZzVFYG51DFuMCIDXgN6mMlaKfnQ6auQZxjR+jg69HSRKLjIx7
ZEG1gl/DzwH1+751P7HnuI3U7BtBYolyErHW4j4a6Ko4XX8PPhG9ODKOmsEMPrEq
gCUGcGgWUsEyz+pyullTEt7ea/oLGJ5N86qtNOdviXETZZTghm47QlzxFWr1/GWy
wH++ctBf/Ekk0dbCBF6mJiqDl/PrVSDSClTVhsGJESEmk4BOoC9zd9zT/EfsiR9m
vA8wLE2g1/5oU+irQ0Dlc/EENVWISiigOFFvTPJJjma9iGXAW3kV2/aYW6DKZSwM
w/va7zTlzt89O+L0AT+Rg8btaiTiaZcs3op1AFa1z5Gut3b5YhWL/e95wlaOI6Nv
Tudm4GX06BaN1QdUDV9g0Pr5iNOaCvuOArjNOU3j7ySusJxiJ8GdA3WFqJ/XUlIV
pne8hC/+7A==
=mBw7
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two fixes for regressions in this cycle, and one that goes to 5.8
stable:
- fix leak of getname() retrieved filename
- remove plug->nowait assignment, fixing a regression with btrfs
- fix for async buffered retry"
* tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
io_uring: ensure async buffered read-retry is setup properly
io_uring: don't unconditionally set plug->nowait = true
io_uring: ensure open/openat2 name is cleaned on cancelation
Since only the v4 code cares about it, maybe it's better to leave
rq_lease_breaker out of the common dispatch code?
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are actually rare races where this is possible (e.g. if a new open
intervenes between the read of i_writecount and the fi_fds).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The nfsd open code has always kept separate read-only, read-write, and
write-only opens as necessary to ensure that when a client closes or
downgrades, we don't retain more access than necessary.
Also, I didn't realize the cache behaved this way when I wrote
94415b06eb "nfsd4: a client's own opens needn't prevent delegations".
There I assumed fi_fds[O_WRONLY] and fi_fds[O_RDWR] would always be
distinct. The violation of that assumption is triggering a
WARN_ON_ONCE() and could also cause the server to give out a delegation
when it shouldn't.
Fixes: 94415b06eb ("nfsd4: a client's own opens needn't prevent delegations")
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
silence nfscache allocation warnings with kvzalloc
Currently nfsd_reply_cache_init attempts hash table allocation through
kmalloc, and manually falls back to vzalloc if that fails. This makes
the code a little larger than needed, and creates a significant amount
of serial console spam if you have enough systems.
Switching to kvzalloc gets rid of the allocation warnings, and makes
the code a little cleaner too as a side effect.
Freeing of nn->drc_hashtbl is already done using kvfree currently.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fixes coccicheck warning:
fs/nfsd/nfs4proc.c:3234:5-29: WARNING: Comparison to bool
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Squelch some sparse warnings:
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:1860:16: warning: incorrect type in assignment (different base types)
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:1860:16: expected int status
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:1860:16: got restricted __be32
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:1862:24: warning: incorrect type in return expression (different base types)
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:1862:24: expected restricted __be32
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:1862:24: got int status
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Squelch some sparse warnings:
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4692:24: warning: incorrect type in return expression (different base types)
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4692:24: expected int
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4692:24: got restricted __be32 [usertype]
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4702:32: warning: incorrect type in return expression (different base types)
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4702:32: expected int
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4702:32: got restricted __be32 [usertype]
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4739:13: warning: incorrect type in assignment (different base types)
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4739:13: expected restricted __be32 [usertype] err
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4739:13: got int
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4891:15: warning: incorrect type in assignment (different base types)
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4891:15: expected unsigned int [assigned] [usertype] count
/home/cel/src/linux/linux/fs/nfsd/nfs4xdr.c:4891:15: got restricted __be32 [usertype]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Squelch some sparse warnings:
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2264:13: warning: incorrect type in assignment (different base types)
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2264:13: expected int err
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2264:13: got restricted __be32
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2266:24: warning: incorrect type in return expression (different base types)
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2266:24: expected restricted __be32
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2266:24: got int err
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2288:13: warning: incorrect type in assignment (different base types)
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2288:13: expected int err
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2288:13: got restricted __be32
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2290:24: warning: incorrect type in return expression (different base types)
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2290:24: expected restricted __be32
/home/cel/src/linux/linux/fs/nfsd/vfs.c:2290:24: got int err
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reserving space for a large READ payload requires special handling when
reserving space in the xdr buffer pages. One problem we can have is use
of the scratch buffer, which is used to get a pointer to a contiguous
region of data up to PAGE_SIZE. When using the scratch buffer, calls to
xdr_commit_encode() shift the data to it's proper alignment in the xdr
buffer. If we've reserved several pages in a vector, then this could
potentially invalidate earlier pointers and result in incorrect READ
data being sent to the client.
I get around this by looking at the amount of space left in the current
page, and never reserve more than that for each entry in the read
vector. This lets us place data directly where it needs to go in the
buffer pages.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now when a read delegation is given, two delegation related traces
will be printed:
nfsd_deleg_open: client 5f45b854:e6058001 stateid 00000030:00000001
nfsd_deleg_none: client 5f45b854:e6058001 stateid 0000002f:00000001
Although the intention is to let developers know two stateid are
returned, the traces are confusing about whether or not a read delegation
is handled out. So renaming trace_nfsd_deleg_none() to trace_nfsd_open()
and trace_nfsd_deleg_open() to trace_nfsd_deleg_read() to make
the intension clearer.
The patched traces will be:
nfsd_deleg_read: client 5f48a967:b55b21cd stateid 00000003:00000001
nfsd_open: client 5f48a967:b55b21cd stateid 00000002:00000001
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In nfsd4_encode_listxattrs(), the variable p is assigned to at one point
but this value is never used before p is reassigned. Fix this.
Addresses-Coverity: ("Unused value")
Signed-off-by: Alex Dewar <alex.dewar90@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The delegation is no longer returnable, so I don't think there's much
point retrying the recall.
(I think it's worth asking why we even need separate CLOSED_DELEG and
REVOKED_DELEG states. But treating them the same would currently cause
nfsd4_free_stateid to call list_del_init(&dp->dl_recall_lru) on a
delegation that the laundromat had unhashed but not revoked, incorrectly
removing it from the laundromat's reaplist or a client's dl_recall_lru.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It was an interesting idea but nobody seems to be using it, it's buggy
at this point, and nfs4state.c is already complicated enough without it.
The new nfsd/clients/ code provides some of the same functionality, and
could probably do more if desired.
This feature has been deprecated since 9d60d93198 ("Deprecate nfsd
fault injection").
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A previous commit for fixing up short reads botched the async retry
path, so we ended up going to worker threads more often than we should.
Fix this up, so retries work the way they originally were intended to.
Fixes: 227c0c9673 ("io_uring: internally retry short reads")
Reported-by: Hao_Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Without this patch efivarfs_alloc_dentry creates dentries with slashes in
their name if the respective EFI variable has slashes in its name. This in
turn causes EIO on getdents64, which prevents a complete directory listing
of /sys/firmware/efi/efivars/.
This patch replaces the invalid shlashes with exclamation marks like
kobject_set_name_vargs does for /sys/firmware/efi/vars/ to have consistently
named dentries under /sys/firmware/efi/vars/ and /sys/firmware/efi/efivars/.
Signed-off-by: Michael Schaller <misch@google.com>
Link: https://lore.kernel.org/r/20200925074502.150448-1-misch@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
These optionr were for Irix compatibility, probably for clustered XFS
clients in a heterogenous cluster which contained both Irix & Linux
machines, so that behavior would be consistent. That doesn't exist anymore
and it's no longer needed.
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: actually state when the sysctls go away]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
ikeep/noikeep was a workaround for old DMAPI code which is no longer
relevant.
attr2/noattr2 - is for controlling upgrade behaviour from fixed attribute
fork sizes in the inode (attr1) and dynamic attribute fork sizes (attr2).
mkfs has defaulted to setting attr2 since 2007, hence just about every
XFS filesystem out there in production right now uses attr2.
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix minor typos]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The current create and mkdir handlers both call the xfs_vn_mknod()
which is a wrapper routine around xfs_generic_create() function.
Actually the create and mkdir handlers can directly call
xfs_generic_create() function and reduce the call chain.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
During code review, I noticed that the rmap code uses the (slower)
shared mappings rmap functions for any extent of a reflinked file, even
if those extents are for the attr fork, which doesn't support sharing.
We can speed up rmap a tiny bit by optimizing out this case.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Since commit 1c1c6ebcf5 ("xfs: Replace per-ag array with a radix
tree"), there is no m_peraglock anymore, so it's hard to understand
the described situation since per-ag is no longer an array and no
need to reallocate, call xfs_filestream_flush() in growfs.
In addition, the race condition for shrink feature is quite confusing
to me currently as well. Get rid of it instead.
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cleanup the typedef usage, the unnecessary parentheses, the unnecessary
backslash and use the open-coded round_up call in
xfs_attr_leaf_entsize_{remote,local}.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We should do the assert for all the log intent-done items if they appear
here. This patch detect intent-done items by the fact that their item ops
don't have iop_unpin and iop_push methods and also move the helper
xlog_item_is_intent to xfs_trans.h.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since we never use the second parameter id, so remove it from
xfs_qm_dqattach_one() function.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We already check whether the crc feature is enabled before calling
xfs_attr3_rmt_verify(), so remove the redundant feature check in that
function.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix the comments to help people understand the code.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
[darrick: fix the indenting problems too]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Since the type prid_t and xfs_dqid_t both are uint32_t, seems the
type cast is unnecessary, so remove it.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We have already defined the project ID type prid_t, so maybe should
use it here.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There are no callers of the SYNCHRONIZE() macro, so remove it.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This lets the compiler inline it into import_iovec() generating
much better code.
Signed-off-by: David Laight <david.laight@aculab.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This causes all the bios to be submitted with REQ_NOWAIT, which can be
problematic on either btrfs or on file systems that otherwise use a mix
of block devices where only some of them support it.
For now, just remove the setting of plug->nowait = true.
Reported-by: Dan Melnic <dmm@fb.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to move the closing of the src_device out of all the device
replace locking, but we definitely want to zero out the superblock
before we commit the last time to make sure the device is properly
removed. Handle this by pushing btrfs_scratch_superblocks into
btrfs_dev_replace_finishing, and then later on we'll move the src_device
closing and freeing stuff where we need it to be.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we cancel these requests, we'll leak the memory associated with the
filename. Add them to the table of ops that need cleaning, if
REQ_F_NEED_CLEANUP is set.
Cc: stable@vger.kernel.org
Fixes: e62753e4e2 ("io_uring: call statx directly")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious. Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.
One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set up a readahead size by default, as very few users have a good
reason to change it. This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c
("NFS: Add fs_context support.")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Client uses static bitmask for GETATTR on CLOSE/WRITE/DELEGRETURN
and ignores the fact that it might have some attributes marked
invalid in its cache. Compared to v3 where all attributes are
retrieved in postop attributes, v4's cache is frequently out of
sync and leads to standalone GETATTRs being sent to the server.
Instead, in addition to the minimum cache consistency attributes
also check cache_validity and adjust the GETATTR request accordingly.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Originally we used the term "encrypted name" or "ciphertext name" to
mean the encoded filename that is shown when an encrypted directory is
listed without its key. But these terms are ambiguous since they also
mean the filename stored on-disk. "Encrypted name" is especially
ambiguous since it could also be understood to mean "this filename is
encrypted on-disk", similar to "encrypted file".
So we've started calling these encoded names "no-key names" instead.
Therefore, rename DCACHE_ENCRYPTED_NAME to DCACHE_NOKEY_NAME to avoid
confusion about what this flag means.
Link: https://lore.kernel.org/r/20200924042624.98439-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Currently we're using the term "ciphertext name" ambiguously because it
can mean either the actual ciphertext filename, or the encoded filename
that is shown when an encrypted directory is listed without its key.
The latter we're now usually calling the "no-key name"; and while it's
derived from the ciphertext name, it's not the same thing.
To avoid this ambiguity, rename fscrypt_name::is_ciphertext_name to
fscrypt_name::is_nokey_name, and update comments that say "ciphertext
name" (or "encrypted name") to say "no-key name" instead when warranted.
Link: https://lore.kernel.org/r/20200924042624.98439-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9q8PUACgkQxWXV+ddt
WDsHZg//YF3Rfeo7/zaRsfUPvNoKDcM69TW+HROJXu4+rYlOukyuh5T+wboRU1Ft
7ymiR18idPYbtOczmH1Pqw+3wyOr39WafcvAnndoUguXJHsUrriBNqkthQICt0CG
hUUiofedaB+j+ti7AYGhF/tkqjd8LkCj8SGEz4cSUFCheIHR+ajFwFmx1Sw6NGJV
h9SdKfbBpIqIpoExFhprNFlxdaKN9rlhYY+zXZYeCBdU6r89CkuLqxZ79GzaU0N7
PG7FxuuJXvyHhta2a6p8hnEp7perOG22OTXJhzXd5JXiNCfZ/w4SfhH/aPO/3t5V
x42hO+FvloVSLS3woZqkBsCgCIe0a3QOT0YxZiM+1cwSgg8mVw4UBEB3PIgkfOVT
LawMbcgSh1evsSazru8gujm4f8RVxpSxxWfhhRwjXtyB8K89e22yBa9Lwfj04SH7
O5O7VrLDDnHsQWinsEf4Rl6byA13jUCgI5eUxZ5B7Au0Pm9uMexDh3lvgE0W0ucY
UvD8qAetu2NNZD68gZp597uHPrwu+Lr+VumIh4wF6doeShlIkbf/d+ntOgW9ey1S
WFSh7sUdKg5pVf6KJQ4yc3aBA6un5lv9LnvPJOwc9HyMUj/cYuxywxWf1YMr5umv
7/6CkufYjTAmEERQeqE1I6UIgUiWkS9nIisB8BLbYkrMR9Wi1bs=
=tFUE
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"syzkaller started to hit us with reports, here's a fix for one type
(stack overflow when printing checksums on read error).
The other patch is a fix for sysfs object, we have a test for that and
it leads to a crash."
* tag 'for-5.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix put of uninitialized kobject after seed device delete
btrfs: fix overflow when copying corrupt csums for a message
Just check the dev_t to help simplifying the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use blkdev_get_by_dev instead of igrab (aka open coded bdgrab) +
blkdev_get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Let's use DIV_ROUND_UP() to calculate log record header
blocks as what did in xlog_get_iclog_buffer_size() and
wrap up a common helper for log recovery.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently, crafted h_len has been blocked for the log
header of the tail block in commit a70f9fe52d ("xfs:
detect and handle invalid iclog size set by mkfs").
However, each log record could still have crafted h_len
and cause log record buffer overrun. So let's check
h_len vs buffer size for each log record as well.
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Nowadays, log recovery will call ->release on the recovered intent items
if recovery fails. Therefore, it's redundant to release them from
inside the ->recover functions when they're about to return an error.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In the bmap intent item recovery code, we must be careful to attach the
inode to its dquots (if quotas are enabled) so that a change in the
shape of the bmap btree doesn't cause the quota counters to be
incorrect.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
During a code inspection, I found a serious bug in the log intent item
recovery code when an intent item cannot complete all the work and
decides to requeue itself to get that done. When this happens, the
item recovery creates a new incore deferred op representing the
remaining work and attaches it to the transaction that it allocated. At
the end of _item_recover, it moves the entire chain of deferred ops to
the dummy parent_tp that xlog_recover_process_intents passed to it, but
fail to log a new intent item for the remaining work before committing
the transaction for the single unit of work.
xlog_finish_defer_ops logs those new intent items once recovery has
finished dealing with the intent items that it recovered, but this isn't
sufficient. If the log is forced to disk after a recovered log item
decides to requeue itself and the system goes down before we call
xlog_finish_defer_ops, the second log recovery will never see the new
intent item and therefore has no idea that there was more work to do.
It will finish recovery leaving the filesystem in a corrupted state.
The same logic applies to /any/ deferred ops added during intent item
recovery, not just the one handling the remaining work.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When xchk_da_btree_block is loading a non-root dabtree block, we know
that the parent block had to have a (hashval, address) pointer to the
block that we just loaded. Check that the hashval in the parent matches
the block we just loaded.
This was found by fuzzing nbtree[3].hashval = ones in xfs/394.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When callers pass XFS_BMAPI_REMAP into xfs_bunmapi, they want the extent
to be unmapped from the given file fork without the extent being freed.
We do this for non-rt files, but we forgot to do this for realtime
files. So far this isn't a big deal since nobody makes a bunmapi call
to a rt file with the REMAP flag set, but don't leave a logic bomb.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In xfs_growfs_rt(), we enlarge bitmap and summary files by allocating
new blocks for both files. For each of the new blocks allocated, we
allocate an xfs_buf, zero the payload, log the contents and commit the
transaction. Hence these buffers will eventually find themselves
appended to list at xfs_ail->ail_buf_list.
Later, xfs_growfs_rt() loops across all of the new blocks belonging to
the bitmap inode to set the bitmap values to 1. In doing so, it
allocates a new transaction and invokes the following sequence of
functions,
- xfs_rtfree_range()
- xfs_rtmodify_range()
- xfs_rtbuf_get()
We pass '&xfs_rtbuf_ops' as the ops pointer to xfs_trans_read_buf().
- xfs_trans_read_buf()
We find the xfs_buf of interest in per-ag hash table, invoke
xfs_buf_reverify() which ends up assigning '&xfs_rtbuf_ops' to
xfs_buf->b_ops.
On the other hand, if xfs_growfs_rt_alloc() had allocated a few blocks
for the bitmap inode and returned with an error, all the xfs_bufs
corresponding to the new bitmap blocks that have been allocated would
continue to be on xfs_ail->ail_buf_list list without ever having a
non-NULL value assigned to their b_ops members. An AIL flush operation
would then trigger the following warning message to be printed on the
console,
XFS (loop0): _xfs_buf_ioapply: no buf ops on daddr 0x58 len 8
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
CPU: 3 PID: 449 Comm: xfsaild/loop0 Not tainted 5.8.0-rc4-chandan-00038-g4d8c2b9de9ab-dirty #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0x57/0x70
_xfs_buf_ioapply+0x37c/0x3b0
? xfs_rw_bdev+0x1e0/0x1e0
? xfs_buf_delwri_submit_buffers+0xd4/0x210
__xfs_buf_submit+0x6d/0x1f0
xfs_buf_delwri_submit_buffers+0xd4/0x210
xfsaild+0x2c8/0x9e0
? __switch_to_asm+0x42/0x70
? xfs_trans_ail_cursor_first+0x80/0x80
kthread+0xfe/0x140
? kthread_park+0x90/0x90
ret_from_fork+0x22/0x30
This message indicates that the xfs_buf had its b_ops member set to
NULL.
This commit fixes the issue by assigning "&xfs_rtbuf_ops" to b_ops
member of each of the xfs_bufs logged by xfs_growfs_rt_alloc().
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
compat_sys_mount is identical to the regular sys_mount now, so remove it
and use the native version everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is no reason the generic fs code should bother with NFS specific
binary mount data - lift the conversion into nfs4_parse_monolithic
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove a level of indentation for the version 1 mount data parsing, and
simplify the NULL data case a little bit as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Issue identified with Coccinelle.
Signed-off-by: Alex Dewar <alex.dewar90@gmail.com>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Two minor conflicts:
1) net/ipv4/route.c, adding a new local variable while
moving another local variable and removing it's
initial assignment.
2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
One pretty prints the port mode differently, whilst another
changes the driver to try and obtain the port mode from
the port node rather than the switch node.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull vfs fixes from Al Viro:
"No common topic, just assorted fixes"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse: fix the ->direct_IO() treatment of iov_iter
fs: fix cast in fsparam_u32hex() macro
vboxsf: Fix the check for the old binary mount-arguments struct
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9qLpQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpk/qD/0dj9STzEMkUsbl2XA5oifF2NVn6VHMidJ3
Ukdhoy4ihh2UFBFO2VZv2UNZ7o4Zt53TA3ha+fB0EL7I23g86XTOItTWd+JHOGpI
M11JejYTxcSUzPVrPfd/2PJ/Tqx+ld4ojTxH8noS4hx7FgueSuRR80UU5gfLGAmr
e7A7vHD8tr9ZoqNcyVVCYa0/80gUbxh1wYOMvqaE6dSPITe96keGKmmk8hRA8kQo
SBfbZeEqf2oErlM0dTVOd34rZbQQyRuMpDmLuc/g6RNMFVPyBqEvQmGwqOtWNe4q
RFS9/imQA1Wi1OD15NoDx0C7BGovmT53xfXpnqI3lXzywxSDGhGVQd0E8Udp6zha
xszrFlQEqS4OFZrHK6B+tnJBFFBZ8jN0K3ZlHpO8QH83OGvyr2k/RokoHFWMTSYh
+5pHRd+6p7o8traQ6h0MJXmacIxZ0hQdJPuawRjAnziBgRhMV2FMLAXgYHtWl0AD
wUiBWUEIV9PP0phu78X2TxvB9L7CPjuv7orJ8Q5dBSkQc7i33ESYMe8Mix85CFm+
SQcazoQE7VLL175TN/FdDDKkBeyAsob9TjeEazb04Vywy0vHW+MGrSOescCBDLF7
RRDRE0E12Ur9BTVTBi/MJsXT2xtufxN2YU368ZX78RYwgI4r9lx4LZZDte3h9/gs
xEPXk5vuzg==
=ImBG
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few fixes - most of them regression fixes from this cycle, but also
a few stable heading fixes, and a build fix for the included demo tool
since some systems now actually have gettid() available"
* tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block:
io_uring: fix openat/openat2 unified prep handling
io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL
tools/io_uring: fix compile breakage
io_uring: don't use retry based buffered reads for non-async bdev
io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there
io_uring: don't run task work on an exiting task
io_uring: drop 'ctx' ref on task work cancelation
io_uring: grab any needed state during defer prep
The following test case leads to NULL kobject free error:
mount seed /mnt
add sprout to /mnt
umount /mnt
mount sprout to /mnt
delete seed
kobject: '(null)' (00000000dd2b87e4): is not initialized, yet kobject_put() is being called.
WARNING: CPU: 1 PID: 15784 at lib/kobject.c:736 kobject_put+0x80/0x350
RIP: 0010:kobject_put+0x80/0x350
::
Call Trace:
btrfs_sysfs_remove_devices_dir+0x6e/0x160 [btrfs]
btrfs_rm_device.cold+0xa8/0x298 [btrfs]
btrfs_ioctl+0x206c/0x22a0 [btrfs]
ksys_ioctl+0xe2/0x140
__x64_sys_ioctl+0x1e/0x29
do_syscall_64+0x96/0x150
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4047c6288b
::
This is because, at the end of the seed device-delete, we try to remove
the seed's devid sysfs entry. But for the seed devices under the sprout
fs, we don't initialize the devid kobject yet. So add a kobject state
check, which takes care of the bug.
Fixes: 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
CC: stable@vger.kernel.org # 5.6+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that there's a library function that calculates the SHA-256 digest
of a buffer in one step, use it instead of sha256_init() +
sha256_update() + sha256_final().
Link: https://lore.kernel.org/r/20200917045341.324996-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_set_test_dummy_encryption() requires that the optional argument
to the test_dummy_encryption mount option be specified as a substring_t.
That doesn't work well with filesystems that use the new mount API,
since the new way of parsing mount options doesn't use substring_t.
Make it take the argument as a 'const char *' instead.
Instead of moving the match_strdup() into the callers in ext4 and f2fs,
make them just use arg->from directly. Since the pattern is
"test_dummy_encryption=%s", the argument will be null-terminated.
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-14-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
In preparation for moving the logic for "get the encryption policy
inherited by new files in this directory" to a single place, make
fscrypt_prepare_symlink() a regular function rather than an inline
function that wraps __fscrypt_prepare_symlink().
This way, the new function fscrypt_policy_to_inherit() won't need to be
exported to filesystems.
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-12-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
The fscrypt UAPI header defines fscrypt_policy to fscrypt_policy_v1,
for source compatibility with old userspace programs.
Internally, the kernel doesn't want that compatibility definition.
Instead, fscrypt_private.h #undefs it and re-defines it to a union.
That works for now. However, in order to add
fscrypt_operations::get_dummy_policy(), we'll need to forward declare
'union fscrypt_policy' in include/linux/fscrypt.h. That would cause
build errors because "fscrypt_policy" is used in ioctl numbers.
To avoid this, modify the UAPI header to make the fscrypt_policy
compatibility definition conditional on !__KERNEL__, and make the ioctls
use fscrypt_policy_v1 instead of fscrypt_policy.
Note that this doesn't change the actual ioctl numbers.
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-11-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_get_encryption_info() has never actually been safe to call in a
context that needs GFP_NOFS, since it calls crypto_alloc_skcipher().
crypto_alloc_skcipher() isn't GFP_NOFS-safe, even if called under
memalloc_nofs_save(). This is because it may load kernel modules, and
also because it internally takes crypto_alg_sem. Other tasks can do
GFP_KERNEL allocations while holding crypto_alg_sem for write.
The use of fscrypt_init_mutex isn't GFP_NOFS-safe either.
So, stop pretending that fscrypt_get_encryption_info() is nofs-safe.
I.e., when it allocates memory, just use GFP_KERNEL instead of GFP_NOFS.
Note, another reason to do this is that GFP_NOFS is deprecated in favor
of using memalloc_nofs_save() in the proper places.
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-10-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Now that all filesystems have been converted to use
fscrypt_prepare_new_inode(), the encryption key for new symlink inodes
is now already set up whenever we try to encrypt the symlink target.
Enforce this rather than try to set up the key again when it may be too
late to do so safely.
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-9-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Now that all filesystems have been converted to use
fscrypt_prepare_new_inode() and fscrypt_set_context(),
fscrypt_inherit_context() is no longer used. Remove it.
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Now that a fscrypt_info may be set up for inodes that are currently
being created and haven't yet had an inode number assigned, avoid
logging confusing messages about "inode 0".
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Convert ubifs to use the new functions fscrypt_prepare_new_inode() and
fscrypt_set_context().
Unlike ext4 and f2fs, this doesn't appear to fix any deadlock bug. But
it does shorten the code slightly and get all filesystems using the same
helper functions, so that fscrypt_inherit_context() can be removed.
It also fixes an incorrect error code where ubifs returned EPERM instead
of the expected ENOKEY.
Link: https://lore.kernel.org/r/20200917041136.178600-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Convert f2fs to use the new functions fscrypt_prepare_new_inode() and
fscrypt_set_context(). This avoids calling
fscrypt_get_encryption_info() from under f2fs_lock_op(), which can
deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe.
For more details about this problem, see the earlier patch
"fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()".
This also fixes a f2fs-specific deadlock when the filesystem is mounted
with '-o test_dummy_encryption' and a file is created in an unencrypted
directory other than the root directory:
INFO: task touch:207 blocked for more than 30 seconds.
Not tainted 5.9.0-rc4-00099-g729e3d0919844 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:touch state:D stack: 0 pid: 207 ppid: 167 flags:0x00000000
Call Trace:
[...]
lock_page include/linux/pagemap.h:548 [inline]
pagecache_get_page+0x25e/0x310 mm/filemap.c:1682
find_or_create_page include/linux/pagemap.h:348 [inline]
grab_cache_page include/linux/pagemap.h:424 [inline]
f2fs_grab_cache_page fs/f2fs/f2fs.h:2395 [inline]
f2fs_grab_cache_page fs/f2fs/f2fs.h:2373 [inline]
__get_node_page.part.0+0x39/0x2d0 fs/f2fs/node.c:1350
__get_node_page fs/f2fs/node.c:35 [inline]
f2fs_get_node_page+0x2e/0x60 fs/f2fs/node.c:1399
read_inline_xattr+0x88/0x140 fs/f2fs/xattr.c:288
lookup_all_xattrs+0x1f9/0x2c0 fs/f2fs/xattr.c:344
f2fs_getxattr+0x9b/0x160 fs/f2fs/xattr.c:532
f2fs_get_context+0x1e/0x20 fs/f2fs/super.c:2460
fscrypt_get_encryption_info+0x9b/0x450 fs/crypto/keysetup.c:472
fscrypt_inherit_context+0x2f/0xb0 fs/crypto/policy.c:640
f2fs_init_inode_metadata+0xab/0x340 fs/f2fs/dir.c:540
f2fs_add_inline_entry+0x145/0x390 fs/f2fs/inline.c:621
f2fs_add_dentry+0x31/0x80 fs/f2fs/dir.c:757
f2fs_do_add_link+0xcd/0x130 fs/f2fs/dir.c:798
f2fs_add_link fs/f2fs/f2fs.h:3234 [inline]
f2fs_create+0x104/0x290 fs/f2fs/namei.c:344
lookup_open.isra.0+0x2de/0x500 fs/namei.c:3103
open_last_lookups+0xa9/0x340 fs/namei.c:3177
path_openat+0x8f/0x1b0 fs/namei.c:3365
do_filp_open+0x87/0x130 fs/namei.c:3395
do_sys_openat2+0x96/0x150 fs/open.c:1168
[...]
That happened because f2fs_add_inline_entry() locks the directory
inode's page in order to add the dentry, then f2fs_get_context() tries
to lock it recursively in order to read the encryption xattr. This
problem is specific to "test_dummy_encryption" because normally the
directory's fscrypt_info would be set up prior to
f2fs_add_inline_entry() in order to encrypt the new filename.
Regardless, the new design fixes this test_dummy_encryption deadlock as
well as potential deadlocks with fs reclaim, by setting up any needed
fscrypt_info structs prior to taking so many locks.
The test_dummy_encryption deadlock was reported by Daniel Rosenberg.
Reported-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Convert ext4 to use the new functions fscrypt_prepare_new_inode() and
fscrypt_set_context(). This avoids calling
fscrypt_get_encryption_info() from within a transaction, which can
deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe.
For more details about this problem, see the earlier patch
"fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()".
Link: https://lore.kernel.org/r/20200917041136.178600-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
To compute a new inode's xattr credits, we need to know whether the
inode will be encrypted or not. When we switch to use the new helper
function fscrypt_prepare_new_inode(), we won't find out whether the
inode will be encrypted until slightly later than is currently the case.
That will require moving the code block that computes the xattr credits.
To make this easier and reduce the length of __ext4_new_inode(), move
this code block into a new function ext4_xattr_credits_for_new_inode().
Link: https://lore.kernel.org/r/20200917041136.178600-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_get_encryption_info() is intended to be GFP_NOFS-safe. But
actually it isn't, since it uses functions like crypto_alloc_skcipher()
which aren't GFP_NOFS-safe, even when called under memalloc_nofs_save().
Therefore it can deadlock when called from a context that needs
GFP_NOFS, e.g. during an ext4 transaction or between f2fs_lock_op() and
f2fs_unlock_op(). This happens when creating a new encrypted file.
We can't fix this by just not setting up the key for new inodes right
away, since new symlinks need their key to encrypt the symlink target.
So we need to set up the new inode's key before starting the
transaction. But just calling fscrypt_get_encryption_info() earlier
doesn't work, since it assumes the encryption context is already set,
and the encryption context can't be set until the transaction.
The recently proposed fscrypt support for the ceph filesystem
(https://lkml.kernel.org/linux-fscrypt/20200821182813.52570-1-jlayton@kernel.org/T/#u)
will have this same ordering problem too, since ceph will need to
encrypt new symlinks before setting their encryption context.
Finally, f2fs can deadlock when the filesystem is mounted with
'-o test_dummy_encryption' and a new file is created in an existing
unencrypted directory. Similarly, this is caused by holding too many
locks when calling fscrypt_get_encryption_info().
To solve all these problems, add new helper functions:
- fscrypt_prepare_new_inode() sets up a new inode's encryption key
(fscrypt_info), using the parent directory's encryption policy and a
new random nonce. It neither reads nor writes the encryption context.
- fscrypt_set_context() persists the encryption context of a new inode,
using the information from the fscrypt_info already in memory. This
replaces fscrypt_inherit_context().
Temporarily keep fscrypt_inherit_context() around until all filesystems
have been converted to use fscrypt_set_context().
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
udf_process_sequence() allocates temporary array for processing
partition descriptors on volume which it fails to free. Free the array
when it is not needed anymore.
Fixes: 7b78fd02fb ("udf: Fix handling of Partition Descriptors")
CC: stable@vger.kernel.org
Reported-by: syzbot+128f4dd6e796c98b3760@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
After commit 9293fcfbc1 ("udf: Remove struct ustr as non-needed
intermediate storage"), the variable ret is being initialized with
'-ENOMEM' that is meaningless. So remove it.
Link: https://lore.kernel.org/r/20200922081322.70535-1-jingxiangfeng@huawei.com
Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The following sequence of commands,
mkfs.xfs -f -m reflink=0 -r rtdev=/dev/loop1,size=10M /dev/loop0
mount -o rtdev=/dev/loop1 /dev/loop0 /mnt
xfs_growfs /mnt
... causes the following call trace to be printed on the console,
XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) || (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 331
Call Trace:
xfs_buf_item_format+0x632/0x680
? kmem_alloc_large+0x29/0x90
? kmem_alloc+0x70/0x120
? xfs_log_commit_cil+0x132/0x940
xfs_log_commit_cil+0x26f/0x940
? xfs_buf_item_init+0x1ad/0x240
? xfs_growfs_rt_alloc+0x1fc/0x280
__xfs_trans_commit+0xac/0x370
xfs_growfs_rt_alloc+0x1fc/0x280
xfs_growfs_rt+0x1a0/0x5e0
xfs_file_ioctl+0x3fd/0xc70
? selinux_file_ioctl+0x174/0x220
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x3e/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurs because the buffer being formatted has the value of
XFS_BLFT_UNKNOWN_BUF assigned to the 'type' subfield of
bip->bli_formats->blf_flags.
This commit fixes the issue by assigning one of XFS_BLFT_RTSUMMARY_BUF
and XFS_BLFT_RTBITMAP_BUF to the 'type' subfield of
bip->bli_formats->blf_flags before committing the corresponding
transaction.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode extent truncate path unmaps extents from the inode block
mapping, finishes deferred ops to free the associated extents and
then explicitly rolls the transaction before processing the next
extent. The latter extent roll is spurious as xfs_defer_finish()
always returns a clean transaction and automatically relogs inodes
attached to the transaction (with lock_flags == 0). This can
unnecessarily increase the number of log ticket regrants that occur
during a long running truncate operation. Remove the explicit
transaction roll.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A mirror index is always of type u32.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pass the full length to iomap_zero() and dax_iomap_zero(), and have
them return how many bytes they actually handled. This is preparatory
work for handling THP, although it looks like DAX could actually take
advantage of it if there's a larger contiguous area.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
iomap_write_end cannot return an error, so switch it to return
size_t instead of int and remove the error checking from the callers.
Also convert the arguments to size_t from unsigned int, in case anyone
ever wants to support a page size larger than 2GB.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of counting bio segments, count the number of bytes submitted.
This insulates us from the block layer's definition of what a 'same page'
is, which is not necessarily clear once THPs are involved.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of counting bio segments, count the number of bytes submitted.
This insulates us from the block layer's definition of what a 'same page'
is, which is not necessarily clear once THPs are involved.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Size the uptodate array dynamically to support larger pages in the
page cache. With a 64kB page, we're only saving 8 bytes per page today,
but with a 2MB maximum page size, we'd have to allocate more than 4kB
per page. Add a few debugging assertions.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that the bitmap is protected by a spinlock, we can use the
more efficient bitmap ops instead of individual test/set bit ops.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We can skip most of the initialisation, although spinlocks still
need explicit initialisation as architectures may use a non-zero
value to indicate unlocked. The comment is no longer useful as
attach_page_private() handles the refcount now.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This helper is useful for both THPs and for supporting block size larger
than page size. Convert all users that I could find (we have a few
different ways of writing this idiom, and I may have missed some).
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
If iomap_unshare_actor() unshares to an inline iomap, the page was
not being flushed. block_write_end() and __iomap_write_end() already
contain flushes, so adding it to iomap_write_end_inline() seems like
the best place. That means we can remove it from iomap_write_actor().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While it is true that reading from an unmirrored source always uses
index 0, that is no longer true for mirrored sources when we fail over.
Fixes: 563c53e73b ("NFS: Fix flexfiles read failover")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The hash_lock field of the cache structure was a leftover
of a previous iteration of the code. It is now unused,
so remove it.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Convert the uses of fallthrough comments to fallthrough macro. Please see
commit 294f69e662 ("compiler_attributes.h: Add 'fallthrough' pseudo
keyword for switch/case use") for detail.
Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The variable error is ssize_t, which is signed and will
cast to unsigned when comapre with variable size, so add
a check to avoid unexpected result in case of negative
value of error.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The pointer clnt is being initialized with a value that is never
read and so this is assignment redundant and can be removed. The
pointer can removed because it is being used as a temporary
variable and it is clearer to make the direct assignment and remove
it completely.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A previous commit unified how we handle prep for these two functions,
but this means that we check the allowed context (SQPOLL, specifically)
later than we should. Move the ring type checking into the two parent
functions, instead of doing it after we've done some setup work.
Fixes: ec65fea5a8 ("io_uring: deduplicate io_openat{,2}_prep()")
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These will naturally fail when attempted through SQPOLL, but either
with -EFAULT or -EBADF. Make it explicit that these are not workable
through SQPOLL and return -EINVAL, just like other ops that need to
use ->files.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some block devices, like dm, bubble back -EAGAIN through the completion
handler. We check for this in io_read(), but don't honor it for when
we have copied the iov. Return -EAGAIN for this case before retrying,
to force punt to io-wq.
Fixes: bcf5a06304 ("io_uring: support true async buffered reads, if file provides it")
Reported-by: Zorro Lang <zlang@redhat.com>
Tested-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we already have mapped the necessary data for retry, then don't set
it up again. It's a pointless operation, and we leak the iovec if it's
a large (non-stack) vec.
Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer. Adjust the
definition of dirtytime_interval_handler to match its prototype in
linux/writeback.h which fixes the following sparse error/warning:
fs/fs-writeback.c:2189:50: warning: incorrect type in argument 3 (different address spaces)
fs/fs-writeback.c:2189:50: expected void *
fs/fs-writeback.c:2189:50: got void [noderef] __user *buffer
fs/fs-writeback.c:2184:5: error: symbol 'dirtytime_interval_handler' redeclared with different type (incompatible argument 3 (different address spaces)):
fs/fs-writeback.c:2184:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )
fs/fs-writeback.c: note: in included file:
./include/linux/writeback.h:374:5: note: previously declared as:
./include/linux/writeback.h:374:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/20200907093140.13434-1-tklauser@distanz.ch
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 0615090c50 ("erofs: convert compressed files from
readpages to readahead"), add_to_page_cache_lru() was moved to mm
code, so that in below call path, no page will be cached into
@pagepool list or grabbed from @pagepool list:
- z_erofs_readpage
- z_erofs_do_read_page
- preload_compressed_pages
- erofs_allocpage
Let's get rid of this unneeded @pagepool parameter.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200917011821.22767-1-yuchao0@huawei.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Don't recheck it since xattr_permission() already
checks CAP_SYS_ADMIN capability.
Just follow 5d3ce4f701 ("f2fs: avoid duplicated permission check for "trusted." xattrs")
Reported-by: Hongyu Jin <hongyu.jin@unisoc.com>
[ Gao Xiang: since it could cause some complex Android overlay
permission issue as well on android-5.4+, it'd be better to
backport to 5.4+ rather than pure cleanup on mainline. ]
Cc: <stable@vger.kernel.org> # 5.4+
Link: https://lore.kernel.org/r/20200811070020.6339-1-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
While it is true that reading from an unmirrored source always uses
index 0, that is no longer true for mirrored sources when we fail over.
Fixes: 563c53e73b ("NFS: Fix flexfiles read failover")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Submounts have their own superblock, which needs to be initialized.
However, they do not have a fuse_fs_context associated with them, and
the root node's attributes should be taken from the mountpoint's node.
Extend fuse_fill_super_common() to work for submounts by making the @ctx
parameter optional, and by adding a @submount_finode parameter.
(There is a plain "unsigned" in an existing code block that is being
indented by this commit. Extend it to "unsigned int" so checkpatch does
not complain.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We want to allow submounts for the same fuse_conn, but with different
superblocks so that each of the submounts has its own device ID. To do
so, we need to split all mount-specific information off of fuse_conn
into a new fuse_mount structure, so that multiple mounts can share a
single fuse_conn.
We need to take care only to perform connection-level actions once (i.e.
when the fuse_conn and thus the first fuse_mount are established, or
when the last fuse_mount and thus the fuse_conn are destroyed). For
example, fuse_sb_destroy() must invoke fuse_send_destroy() until the
last superblock is released.
To do so, we keep track of which fuse_mount is the root mount and
perform all fuse_conn-level actions only when this fuse_mount is
involved.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With the last commit, all functions that handle some existing fuse_req
no longer need to be given the associated fuse_conn, because they can
get it from the fuse_req object.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Every fuse_req belongs to a fuse_conn. Right now, we always know which
fuse_conn that is based on the respective device, but we want to allow
multiple (sub)mounts per single connection, and then the corresponding
filesystem is not going to be so trivial to obtain.
Storing a pointer to the associated fuse_conn in every fuse_req will
allow us to trivially find any request's superblock (and thus
filesystem) even then.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
After unlock_request() pages from the ap->pages[] array may be put (e.g. by
aborting the connection) and the pages can be freed.
Prevent use after free by grabbing a reference to the page before calling
unlock_request().
The original patch was created by Pradeep P V K.
Reported-by: Pradeep P V K <ppvk@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
the callers rely upon having any iov_iter_truncate() done inside
->direct_IO() countered by iov_iter_reexpand().
Reported-by: Qian Cai <cai@redhat.com>
Tested-by: Qian Cai <cai@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes the following W=1 kernel build warning(s):
fs/ubifs/tnc.c:3479: warning: Excess function parameter 'inum' description in 'dbg_check_inode_size'
fs/ubifs/tnc.c:366: warning: Excess function parameter 'node' description in 'lnc_free'
@inum in 'dbg_check_inode_size' should be @inode, fix it.
@node in 'lnc_free' is not in use, Remove it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Fixes the following W=1 kernel build warning(s):
fs/ubifs/replay.c:942: warning: Excess function parameter 'ref_lnum' description in 'validate_ref'
fs/ubifs/replay.c:942: warning: Excess function parameter 'ref_offs' description in 'validate_ref'
They're not in use. Remove them.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Fixes the following W=1 kernel build warning(s):
fs/ubifs/gc.c:70: warning: Excess function parameter 'buf' description in 'switch_gc_head'
fs/ubifs/gc.c:70: warning: Excess function parameter 'len' description in 'switch_gc_head'
fs/ubifs/gc.c:70: warning: Excess function parameter 'lnum' description in 'switch_gc_head'
fs/ubifs/gc.c:70: warning: Excess function parameter 'offs' description in 'switch_gc_head'
They're not in use. Remove them.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Fixes the following W=1 kernel build warning(s):
fs/ubifs/auth.c:66: warning: Excess function parameter 'hash' description in 'ubifs_prepare_auth_node'
Rename hash to inhash.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Following process will trigger ubifs_err:
1. useradd -m freg (Under root)
2. cd /home/freg && mkdir mp (Under freg)
3. mount -t ubifs /dev/ubi0_0 /home/freg/mp (Under root)
4. cd /home/freg && echo 123 > mp/a (Under root)
5. cd mp && chown freg a && chgrp freg a && chmod 777 a (Under root)
6. chattr +i a (Under freg)
UBIFS error (ubi0:0 pid 1723): ubifs_ioctl [ubifs]: can't modify inode
65 attributes
chattr: Operation not permitted while setting flags on a
This is not an UBIFS problem, it was caused by task priviliage checking
on file operations. Remove error message printing from kernel just like
other filesystems (eg. ext4), since we already have enough information
from userspace tools.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Fix some potential memory leaks in error handling branches while
iterating dent entries. For example, function dbg_check_dir()
forgets to free pdent if it exists.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
Fix some potential memory leaks in error handling branches while
iterating xattr entries. For example, function ubifs_tnc_remove_ino()
forgets to free pxent if it exists. Similar problems also exist in
ubifs_purge_xattrs(), ubifs_add_orphan() and ubifs_jnl_write_inode().
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
Fold the misaligned u64 workarounds into the main quotactl flow instead
of implementing a separate compat syscall handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
After client is done with the COPY operation, it needs to invalidate
its pagecache (as it did no reading or writing of the data locally)
and it needs to invalidate it's attributes just like it would have
for a read on the source file and write on the destination file.
Once the linux server started giving out read delegations to
read+write opens, the destination file of the copy_file range
started having delegations and not doing syncup on close of the
file leading to xfstest failures for generic/430,431,432,433,565.
v2: changing cache_validity needs to be protected by the i_lock.
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Fixes: 2e72448b07 ("NFS: Add COPY nfs operation")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs_readdir_page_filler() iterates over entries in a directory, reusing
the same security label buffer, but does not reset the buffer's length.
This causes decode_attr_security_label() to return -ERANGE if an entry's
security label is longer than the previous one's. This error, in
nfs4_decode_dirent(), only gets passed up as -EAGAIN, which causes another
failed attempt to copy into the buffer. The second error is ignored and
the remaining entries do not show up in ls, specifically the getdents64()
syscall.
Reproduce by creating multiple files in NFS and giving one of the later
files a longer security label. ls will not see that file nor any that are
added afterwards, though they will exist on the backend.
In nfs_readdir_page_filler(), reset security label buffer length before
every reuse
Signed-off-by: Jeffrey Mitchell <jeffrey.mitchell@starlab.io>
Fixes: b4487b9354 ("nfs: Fix getxattr kernel panic and memory overflow")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The V4 filesystem format contains known weaknesses in the on-disk format
that make metadata verification diffiult. In addition, the format does
not support dates past 2038 and will not be upgraded to do so. We
should start the process of retiring the old format to close off attack
surfaces and to encourage users to migrate onto V5.
Therefore, make XFS V4 support a configurable option. For the first
period it will be default Y in case some distributors want to withdraw
support early; for the second period it will be default N so that anyone
who wishes to continue support can do so; and after that, support will
be removed from the kernel. Dates for these events have been added to
the upstream kernel.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
While running generic/042 with -drtinherit=1 set in MKFS_OPTIONS, I
observed that the kernel will gladly set the realtime flag on any file
created on the loopback filesystem even though that filesystem doesn't
actually have a realtime device attached. This leads to verifier
failures and doesn't make any sense, so be smarter about this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Make sure that any fallocate operation that requires the range to be
block-aligned also checks that the range is aligned to the realtime
extent size.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Hoist the code that propagates di_flags and di_flags2 from a parent to a
new child into separate functions. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There's an overflow bug in the realtime allocator. If the rt volume is
large enough to handle a single allocation request that is larger than
the maximum bmap extent length and the rt bitmap ends exactly on a
bitmap block boundary, it's possible that the near allocator will try to
check the freeness of a range that extends past the end of the bitmap.
This fails with a corruption error and shuts down the fs.
Therefore, constrain maxlen so that the range scan cannot run off the
end of the rt bitmap.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes coccicheck warning:
fs/xfs/xfs_icache.c:1214:2-3: Unneeded semicolon
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit 5833112df7 tried to make it so that a remap operation would
force the log out to disk if the filesystem is mounted with mandatory
synchronous writes. Unfortunately, that commit failed to handle the
case where the inode or the file descriptor require mandatory
synchronous writes.
Refactor the check into into a helper that will look for all three
conditions, and now we can treat reflink just like any other synchronous
write.
Fixes: 5833112df7 ("xfs: reflink should force the log out if mounted with wsync")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_attr_sf_totsize() requires access to xfs_inode structure, so, once
xfs_attr_shortform_addname() is its only user, move it to xfs_attr.c
instead of playing with more #includes.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
nameval is a variable-size array, so, define it as it, and remove all
the -1 magic number subtractions
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This patch aims to replace kmem_zalloc_large() with global kernel memory
API. So, all its callers are now using kvzalloc() directly, so kmalloc()
fallsback to vmalloc() automatically.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Enable the big timestamp feature.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a couple of tracepoints so that we can check the timestamp limits
being set on inodes and quotas.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Enable the bigtime feature for quota timers. We decrease the accuracy
of the timers to ~4s in exchange for being able to set timers up to the
bigtime maximum.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Redesign the ondisk inode timestamps to be a simple unsigned 64-bit
counter of nanoseconds since 14 Dec 1901 (i.e. the minimum time in the
32-bit unix time epoch). This enables us to handle dates up to 2486,
which solves the y2038 problem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Redefine xfs_ictimestamp_t as a uint64_t typedef in preparation for the
bigtime functionality. Preserve the legacy structure format so that we
can let the compiler take care of the masking and shifting.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Redefine xfs_timestamp_t as a __be64 typedef in preparation for the
bigtime functionality. Preserve the legacy structure format so that we
can let the compiler take care of masking and shifting.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move this function to xfs_inode_item_recover.c since there's only one
caller of it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor quota timestamp encoding and decoding into helper functions so
that we can add extra behavior in the next patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor the code that sets the default quota grace period into a helper
function so that we can override the ondisk behavior later.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Define explicit limits on the range of quota grace period expiration
timeouts and refactor the code that modifies the timeouts into helpers
that clamp the values appropriately. Note that we'll refactor the
default grace period timer separately.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Formally define the inode timestamp ranges that existing filesystems
support, and switch the vfs timetamp ranges to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add the necessary bits to the online repair code to support logging the
inode btree counters when rebuilding the btrees, and to support fixing
the counters when rebuilding the AGI.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add the necessary bits to the online scrub code to check the inode btree
counters when enabled.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Now that we have reliable finobt block counts, use them to speed up the
per-AG block reservation calculations at mount time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add a btree block usage counters for both inode btrees to the AGI header
so that we don't have to walk the entire finobt at mount time to create
the per-AG reservations.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Instead of poking deeply into buffer cache internals when re-reading the
superblock during log recovery just generalize _xfs_buf_read and use it
there. Note that we don't have to explicitly set up the ops as they
must be set from the initial read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge xfs_getsb into its only caller, and clean that one up a little bit
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove the mp argument as this function is only called in transaction
context, and open code xfs_getsb given that the function already accesses
the buffer pointer in the mount point directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The log recovery I/O completion handler does not substancially differ from
the normal one except for the fact that it:
a) never retries failed writes
b) can have log items that aren't on the AIL
c) never has inode/dquot log items attached and thus don't need to
handle them
Add conditionals for (a) and (b) to the ioend code, while (c) doesn't
need special handling anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Clear the flags at the end of xfs_buf_ioend so that they can be used
during the completion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reuse xfs_buf_item_relse instead of duplicating it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that all the actual error handling is in a single place,
xfs_buf_ioend_disposition just needs to return true if took ownership of
the buffer, or false if not instead of the tristate. Also move the
error check back in the caller to optimize for the fast path, and give
the function a better fitting name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep all the error handling code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge xfs_buf_ioerror_retry into its only caller to make the resubmission
flow a little easier to follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_buf_ioerror_fail_without_retry is a somewhat weird function in
that it has two trivial checks that decide the return value, while
the rest implements a ratelimited warning. Just lift the two checks
into the caller, and give the remainder a suitable name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
No need to keep a separate helper for this logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the buffer retry state machine logic to xfs_buf.c and call it once
from xfs_ioend instead of duplicating it three times for the three kinds
of buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the log recovery I/O completion handling entirely into the log
recovery code, and re-arrange the normal I/O completion handler flow
to prepare to lifting more logic into common code in the next commits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Handle the no-error case in xfs_buf_iodone_error as well, and to clarify
the code rename the function, use the actual enum type as return value
and then switch on it in the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
NVMe Zoned Namespace introduced the concept of active zones, which are
zones in the implicit open, explicit open or closed condition. Drives may
have a limit on the number of zones that can be simultaneously active.
This potential limitation translate into a risk for applications to see
write IO errors due to this limit if the zone of a file being written to is
not already active when a write request is issued.
To avoid these potential errors, the zone of a file can explicitly be made
active using an open zone command when the file is open for the first
time. If the zone open command succeeds, the application is then
guaranteed that write requests can be processed. This indirect management
of active zones relies on the maximum number of open zones of a drive,
which is always lower or equal to the maximum number of active zones.
On the first open of a sequential zone file, send a REQ_OP_ZONE_OPEN
command to the block device. Conversely, on the last release of a zone
file and send a REQ_OP_ZONE_CLOSE to the device if the zone is not full or
empty.
As truncating a zone file to 0 or max can deactivate a zone as well, we
need to serialize against truncates and also be careful not to close a
zone as the file may still be open for writing, e.g. the user called
ftruncate(). If the zone file is not open and a process does a truncate(),
then no close operation is needed.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Subsequent patches need to call zonefs_io_error() with the i_truncate_mutex
already held, so factor out the body of zonefs_io_error() into
__zonefs_io_error() which can be called from with the i_truncate_mutex
held.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Introduce a helper function for sending zone management commands to the
block device.
As zone management commands can change a zone write pointer position
reflected in the size of the zone file, this function expects the truncate
mutex to be held.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9fju4ACgkQxWXV+ddt
WDukWw//d3D33s1SnHI2ooub6VllM5xC/VCB5hfpHdeuLfhIR4iFrXlKCwFNuyuF
A1CgKu7W4eMkQ9E6dKIvYhRQj9ELcHWxS1LRUkuXVGojvPVQAbGhUibjcmTy6xoi
t4KkvdI0t/puEwRTrF2ooInWzdbiJvzCuCHLWkX4wK0IAkaVt+IK2ZJuSTCZhUEv
Gx5CSLqP6zq/XD/03yWQvILE3MoM9BOc28EQ345dhEu+SWWpdhzy4W825A8xUe3Q
bThcK8vJgeLUo/KVtUJNLoFTl+2BpxzaiHWqh+3plGjQ1qtZyDgFhZOoP0JEJb/f
qEnyN2XbY26p6nj2J+34Euz5OX65ot6bePx6v/h5jY+UjNa8mpZD6d4Y+OII0XW0
kCLPyaNc8P6IkmAYzSLQPmrjCoRgR/2we8HRBylTWaeEcvpduZs/5YlJNO9q59Et
voclNx6kk4AdFjSxXHwERlSA3v4+pkq41bjnbpv7a9FGRBUJGVfJFCNukuhCx5Oi
nUKg7jBJEZNhE2DOA0MKnrZTBnufkLCiH+f/7pn+ypPdw1dna/ix9Suwbz4+UElx
TtwGLF99CQCX09ieJBVWB90a3xgQBIZLaqu1HtZ1u7dzKkutl+uKMcnfEEtLSfJy
qMLHHsRQj/iwytQHSdWboL6xQCUSkSefPTmmNfstaKSxeq6w0Q0=
=pdp0
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One of the recent lockdep fixes introduced a bug that breaks the
search ioctl, which is used by some applications (bees, compsize). The
patch made it to stable trees so we need this fixup to make it work
again"
* tag 'for-5.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix wrong address when faulting in pages in the search ioctl
After commit 0b6d4ca04a ("f2fs: don't return vmalloc() memory from
f2fs_kmalloc()"), f2fs_k{m,z}alloc() will not return vmalloc()'ed
memory, so clean up to use kfree() instead of kvfree() to free
vmalloc()'ed memory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This isn't safe, and isn't needed either. We are guaranteed that any
work we queue is on a live task (and will be run), or it goes to
our backup io-wq threads if the task is exiting.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If task_work ends up being marked for cancelation, we go through a
cancelation helper instead of the queue path. In converting task_work to
always hold a ctx reference, this path was missed. Make sure that
io_req_task_cancel() puts the reference that is being held against the
ctx.
Fixes: 6d816e088c ("io_uring: hold 'ctx' reference around task_work queue + execute")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When faulting in the pages for the user supplied buffer for the search
ioctl, we are passing only the base address of the buffer to the function
fault_in_pages_writeable(). This means that after the first iteration of
the while loop that searches for leaves, when we have a non-zero offset,
stored in 'sk_offset', we try to fault in a wrong page range.
So fix this by adding the offset in 'sk_offset' to the base address of the
user supplied buffer when calling fault_in_pages_writeable().
Several users have reported that the applications compsize and bees have
started to operate incorrectly since commit a48b73eca4 ("btrfs: fix
potential deadlock in the search ioctl") was added to stable trees, and
these applications make heavy use of the search ioctls. This fixes their
issues.
Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/
Link: https://github.com/kilobyte/compsize/issues/34
Fixes: a48b73eca4 ("btrfs: fix potential deadlock in the search ioctl")
CC: stable@vger.kernel.org # 4.4+
Tested-by: A L <mail@lechevalier.se>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fixes the following W=1 kernel build warning(s):
fs/ext2/balloc.c:203: warning: Excess function parameter 'rb_root' description in '__rsv_window_dump'
fs/ext2/balloc.c:294: warning: Excess function parameter 'rb_root' description in 'search_reserve_window'
fs/ext2/balloc.c:878: warning: Excess function parameter 'rsv' description in 'alloc_new_reservation'
Link: https://lore.kernel.org/r/20200911114036.60616-1-wanghai38@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Always grab work environment for deferred links. The assumption that we
will be running it always from the task in question is false, as exiting
tasks may mean that we're deferring this one to a thread helper. And at
that point it's too late to grab the work environment.
Fixes: debb85f496 ("io_uring: factor out grab_env() from defer_prep()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here are some small driver core and debugfs fixes for 5.9-rc5
Included in here are:
- firmware loader memory leak fix
- firmware loader testing fixes for non-EFI systems
- device link locking fixes found by lockdep
- kobject_del() bugfix that has been affecting some callers
- debugfs minor fix
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCX13Zhw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylfNQCfX4Bx9J1aLGr0/MOBwlXXEycChE0AmwQ9rXa7
u5Przdz+fMr1mLyNBaY5
=4kmo
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some small driver core and debugfs fixes for 5.9-rc5
Included in here are:
- firmware loader memory leak fix
- firmware loader testing fixes for non-EFI systems
- device link locking fixes found by lockdep
- kobject_del() bugfix that has been affecting some callers
- debugfs minor fix
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
test_firmware: Test platform fw loading on non-EFI systems
PM: <linux/device.h>: fix @em_pd kernel-doc warning
kobject: Drop unneeded conditional in __kobject_del()
driver core: Fix device_pm_lock() locking for device links
MAINTAINERS: Add the security document to SECURITY CONTACT
driver code: print symbolic error code
debugfs: Fix module state check condition
kobject: Restore old behaviour of kobject_del(NULL)
firmware_loader: fix memory leak for paged buffer
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9bj/UACgkQxWXV+ddt
WDs1HhAAgAvJVM2WJuMCjQhQiKljFjRT1a0Kbsp+9ayw5Q225t5S5kCMWrsA6mXF
/9bGRmmELm/Nr5pSH9hp5Bhbke0vNV+Y9XiRQXpegla4LLMF4MulVgADRIL3WoxO
ZAtNmZUokkjvB0CkzDuI7PqrF67TXLqV2hlctZo0p5SAFFgLaELyIYC6uAaO9Qo/
+EAAK+7oJyzWcUp44APu90wBbF79umwNVKEEkDfc6bwiA2Cut1JGzvPWgGvvQnta
fAd114LFViKg05GXcbnx4NxHYtf9tKHjDk9yYWssR+uV6vo/pWwAkDwYxXm/LzA4
Zv8QK5uvng1fW4eq9QkN3KflIDn+YhaH1jgwNcgyS+ZCdqZR1Mi949f+6Nj1fXt2
NeXOx3nhtqgNthKQNvHSMVJZrPjV3bdzOz+bULA+hMvTkr5gJy+ToAs30SLxGF5Y
BCJEE6b5M5Jnb+UHEBMuoxubBfmPHkY8LxfDzVWDLESsKcW2eYyeJyJXx4DNe/v9
O7Z5pcku+7R9LOlYQEzKeSuiYMqYLtmQtcNXyFBysksikjFJBWNgENna1LmgvmRH
j6fC5S9h4sIxzyKQkJgihIDt/a3f9WnhsoHw8EIn62tfdOIvMcT/xWq9YYgWaOjZ
H9040WXvEAFVcDn4cQ22DNgV+toJMpe0pLg6UXe7VtESUtbwMFM=
=JTfF
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes:
- regression fix for a crash after failed snapshot creation
- one more lockep fix: use nofs allocation when allocating missing
device
- fix reloc tree leak on degraded mount
- make some extent buffer alignment checks less strict to mount
filesystems created by btrfs-convert"
* tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix NULL pointer dereference after failure to create snapshot
btrfs: free data reloc tree on failed mount
btrfs: require only sector size alignment for parent eb bytenr
btrfs: fix lockdep splat in add_missing_dev
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl9c/IoACgkQiiy9cAdy
T1GB8Av/XCQ1NqifphSf9/AJ9GJRIJ9fl5pdoPDDPPpY88XGUBJ21frXJCm5Uv/5
MXmNltZnIPMqBa+K6QvMR2WjmGlE28zVaapiIXJK+ckaMClukvn09FOCKWmhVkgi
0m1LwLOCL92TslRxkmW7wI1IFdNFCDdcehQ9xvjkaCCkpkiZ6ec4cV68fvF9nQ0P
CD5I7E4m5KZiMDIyF1DGZvPEp/fVv5Wu8Fmvi/4DdUCMF8EXmBYrbCax6+SfqMEs
3ut3pJEiUPdaojZQKb2s0u4vf5zqLYI7azdplqeyEqACDvSmDDhiXyIddbLrT/It
OkfiffaRLXJADLEQrH1E1uVbABa7pY44LhorxV1J0+4T/v33kItbRcwLfi9LeOpn
jIIBXYzRP7Pmj2ymH9irf1F18z8O97gcQf8HiR8e6WaSC1Mf8GJLTcwZRsiR4J5H
4no9vDkL+T+lXLUGwnW8aRe10w2gRoIZeQxhkkWUZJZnf2x0NgecljckFOZaMQff
ycRefbVh
=JaQw
-----END PGP SIGNATURE-----
Merge tag '5.9-rc4-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"A fix for lookup on DFS link when cifsacl or modefromsid is used"
* tag '5.9-rc4-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix DFS mount with cifsacl/modefromsid
The returned integer is not required anywhere. So we need to change
the return value to bool type.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
writepages() can be concurrently invoked for the same file by different
threads such as a thread fsyncing the file and a kworker kernel thread.
So, changing i_compr_blocks without protection is racy and we need to
protect it by changing it with atomic type value. Plus, we don't need
a 64bit value for i_compr_blocks, so just we will use a atomic value,
not atomic64.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
to keep consistent with behavior when passing compress mount option
to kernel w/o compression feature, so that mount may not fail on
such condition.
Reported-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As 5kft <5kft@5kft.org> reported:
kworker/u9:3: page allocation failure: order:9, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 3 PID: 8168 Comm: kworker/u9:3 Tainted: G C 5.8.3-sunxi #trunk
Hardware name: Allwinner sun8i Family
Workqueue: f2fs_post_read_wq f2fs_post_read_work
[<c010d6d5>] (unwind_backtrace) from [<c0109a55>] (show_stack+0x11/0x14)
[<c0109a55>] (show_stack) from [<c056d489>] (dump_stack+0x75/0x84)
[<c056d489>] (dump_stack) from [<c0243b53>] (warn_alloc+0xa3/0x104)
[<c0243b53>] (warn_alloc) from [<c024473b>] (__alloc_pages_nodemask+0xb87/0xc40)
[<c024473b>] (__alloc_pages_nodemask) from [<c02267c5>] (kmalloc_order+0x19/0x38)
[<c02267c5>] (kmalloc_order) from [<c02267fd>] (kmalloc_order_trace+0x19/0x90)
[<c02267fd>] (kmalloc_order_trace) from [<c047c665>] (zstd_init_decompress_ctx+0x21/0x88)
[<c047c665>] (zstd_init_decompress_ctx) from [<c047e9cf>] (f2fs_decompress_pages+0x97/0x228)
[<c047e9cf>] (f2fs_decompress_pages) from [<c045d0ab>] (__read_end_io+0xfb/0x130)
[<c045d0ab>] (__read_end_io) from [<c045d141>] (f2fs_post_read_work+0x61/0x84)
[<c045d141>] (f2fs_post_read_work) from [<c0130b2f>] (process_one_work+0x15f/0x3b0)
[<c0130b2f>] (process_one_work) from [<c0130e7b>] (worker_thread+0xfb/0x3e0)
[<c0130e7b>] (worker_thread) from [<c0135c3b>] (kthread+0xeb/0x10c)
[<c0135c3b>] (kthread) from [<c0100159>]
zstd may allocate large size memory for {,de}compression, it may cause
file copy failure on low-end device which has very few memory.
For decompression, let's just allocate proper size memory based on current
file's cluster size instead of max cluster size.
Reported-by: 5kft <5kft@5kft.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Current compr_blocks of superblock info is not 64bit value. We are
accumulating each i_compr_blocks count of inodes to this value and
those are 64bit values. So, need to change this to 64bit value.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Need to add block address range check to compressed file case and
avoid calling get_data_block_bmap() for compressed file.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When the move range ioctl is used, check the input and output position and
ensure that it is a non-negative value. Without this check
f2fs_get_dnode_of_data may hit a memmory bug.
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Miss to update APP_DIRECT_IO/APP_DIRECT_READ_IO when receiving async DIO.
For example: fio -filename=/data/test.0 -bs=1m -ioengine=libaio -direct=1
-name=fill -size=10m -numjobs=1 -iodepth=32 -rw=write
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Instead of finding the first dirty page and then seeing if it matches
the index of a block that is NEW_ADDR, delay the lookup of the dirty
bit until we've actually found a block that's NEW_ADDR.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Checking for the lack of epitems refering to the epoll we want to insert into
is not enough; we might have an insertion of that epoll into another one that
has already collected the set of files to recheck for excessive reverse paths,
but hasn't gotten to creating/inserting the epitem for it.
However, any such insertion in progress can be detected - it will update the
generation count in our epoll when it's done looking through it for files
to check. That gets done under ->mtx of our epoll and that allows us to
detect that safely.
We are *not* holding epmutex here, so the generation count is not stable.
However, since both the update of ep->gen by loop check and (later)
insertion into ->f_ep_link are done with ep->mtx held, we are fine -
the sequence is
grab epmutex
bump loop_check_gen
...
grab tep->mtx // 1
tep->gen = loop_check_gen
...
drop tep->mtx // 2
...
grab tep->mtx // 3
...
insert into ->f_ep_link
...
drop tep->mtx // 4
bump loop_check_gen
drop epmutex
and if the fastpath check in another thread happens for that
eventpoll, it can come
* before (1) - in that case fastpath is just fine
* after (4) - we'll see non-empty ->f_ep_link, slow path
taken
* between (2) and (3) - loop_check_gen is stable,
with ->mtx providing barriers and we end up taking slow path.
Note that ->f_ep_link emptiness check is slightly racy - we are protected
against insertions into that list, but removals can happen right under us.
Not a problem - in the worst case we'll end up taking a slow path for
no good reason.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This switches f2fs over to the generic support provided in
the previous patch.
Since casefolded dentries behave the same in ext4 and f2fs, we decrease
the maintenance burden by unifying them, and any optimizations will
immediately apply to both.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This adds general supporting functions for filesystems that use
utf8 casefolding. It provides standard dentry_operations and adds the
necessary structures in struct super_block to allow this standardization.
The new dentry operations are functionally equivalent to the existing
operations in ext4 and f2fs, apart from the use of utf8_casefold_hash to
avoid an allocation.
By providing a common implementation, all users can benefit from any
optimizations without needing to port over improvements.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This adds a case insensitive hash function to allow taking the hash
without needing to allocate a casefolded copy of the string.
The existing d_hash implementations for casefolding allocate memory
within rcu-walk, by avoiding it we can be more efficient and avoid
worrying about a failed allocation.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
refcount_t type variable should never be less than one, so it's a
little bit hard to understand when we use it to indicate pending
compressed page count, let's change to use atomic_t for better
readability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes below compile warning reported by LKP
(kernel test robot)
cppcheck warnings: (new ones prefixed by >>)
>> fs/f2fs/file.c:761:9: warning: Identical condition 'err', second condition is always false [identicalConditionAfterEarlyExit]
return err;
^
fs/f2fs/file.c:753:6: note: first condition
if (err)
^
fs/f2fs/file.c:761:9: note: second condition
return err;
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
then, we can add specified entry into rb-tree with 64-bits segment time
as key.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't let f2fs inner GC ruins original aging degree of segment.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, once we update one block in segment, we will update mtime of
segment to last time, making aged segment becoming freshest, result in
that GC with cost benefit algorithm missing such segment, So this patch
changes to record mtime as average block updating time instead of last
updating time.
It's not needed to reset mtime for prefree segment, as se->valid_blocks
is zero, then old se->mtime won't take any weight with below calculation:
se->mtime = div_u64(se->mtime * se->valid_blocks + mtime,
se->valid_blocks + 1);
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previous implementation of aligned pinfile allocation will:
- allocate new segment on cold data log no matter whether last used
segment is partially used or not, it makes IOs more random;
- force concurrent cold data/GCed IO going into warm data area, it
can make a bad effect on hot/cold data separation;
In this patch, we introduce a new type of log named 'inmem curseg',
the differents from normal curseg is:
- it reuses existed segment type (CURSEG_XXX_NODE/DATA);
- it only exists in memory, its segno, blkofs, summary will not b
persisted into checkpoint area;
With this new feature, we can enhance scalability of log, special
allocators can be created for purposes:
- pure lfs allocator for aligned pinfile allocation or file
defragmentation
- pure ssr allocator for later feature
So that, let's update aligned pinfile allocation to use this new
inmem curseg fwk.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since DUMMY_WRITTEN_PAGE and ATOMIC_WRITTEN_PAGE have already been
converted as unsigned long type, we don't need do type casting again.
Signed-off-by: Xiaojun Wang <wangxiaojun11@huawei.com>
Reported-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This introduces some bug fixes including 1) SMR drive fix, 2) infinite loop
when building free node ids, 3) EOF at DIO read.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl9Y0skACgkQQBSofoJI
UNKggA//a8cCkIx0wp1BEabPQ8x8TrHkiOVZX9zTbiaLkI9MLYweF0YFjbCx/i/t
41r6JbTxWLiM+GRI+u9kOrkBOUgTHAZThIBcWbfYKR5Jz7XdijhsuNYqRf8n33ya
u4PtT2AiMW3gt8gJML08eI/U+/rBBpI6fkc8Sza03iTCLjJiWwvZ99h2rPSfgmIw
UdyRuof3mPgUz12fQ8BoO/oQ5KhKgA7kxMsCc4UJE9VX/hBYamTfIEF1AmnxlyKn
HYhkXyuAmhEFEenBA+ts9YZXMubWHwcj4GlTkWTP6Ib+A3cIinHIqpzEaJnPQdWY
JPG72L2TNg0Mtl8w+46zJa1mr6KvXTA5GNowX4OTTf2oHxeHJ4sxax3cmvrDOIfe
7Ga8kGyi5OON1Hymwuyyz4KuKM9zvyUouX0i4wsiamm9kFQoxzNSbfsM5L0weCwM
MWVOFhtOUhrQdJ+zpbtJpF8Ijj3nuOoUpw3/hUhl3GM+J4bzE8MdzRmzB3JjkVCr
4mfYuXCITPOH31Fu3obXEFI/9kgyZdxk6OHsXLZCCfRCvgRTDfk5KFQx1OIY3jk0
I/yRjmvnOyD37GVXHyGZxjRplDCgCpTOL6HFk1GOLFUvuoQeOOqwwL0oa/J94ED0
nE80dqs4e8c3/YmUqcgiOm4d4EkdwVzhE9RxdUq7s2SJgh5Eu/o=
=gbal
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
"Small bug fixes for:
- SMR drive fix
- infinite loop when building free node ids
- EOF at DIO read"
* tag 'f2fs-for-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: Return EOF on unaligned end of file DIO read
f2fs: fix indefinite loop scanning for free nid
f2fs: Fix type of section block count variables
Remove the now unused check_disk_change helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Like check_disk_changed, except that it does not call ->revalidate_disk
but leaves that to the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When bringing (portions of) a page uptodate, we were marking blocks that
were zeroed as being uptodate, but not blocks that were read from storage.
Like the previous commit, this problem was found with generic/127 and
a kernel which failed readahead I/Os. This bug causes writes to be
silently lost when working with flaky storage.
Fixes: 9dc55f1389 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we find a page in write_begin which is !Uptodate, we need
to clear any error on the page before starting to read data
into it. This matches how filemap_fault(), do_read_cache_page()
and generic_file_buffered_read() handle PageError on !Uptodate pages.
When calling iomap_set_range_uptodate() in __iomap_write_begin(), blocks
were not being marked as uptodate.
This was found with generic/127 and a specially modified kernel which
would fail (some) readahead I/Os. The test read some bytes in a prior
page which caused readahead to extend into page 0x34. There was
a subsequent write to page 0x34, followed by a read to page 0x34.
Because the blocks were still marked as !Uptodate, the read caused all
blocks to be re-read, overwriting the write. With this change, and the
next one, the bytes which were written are marked as being Uptodate, so
even though the page is still marked as !Uptodate, the blocks containing
the written data are not re-read from storage.
Fixes: 9dc55f1389 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When a direct I/O write falls back to buffered I/O entirely, dio->size
will be 0 in iomap_dio_complete. Function invalidate_inode_pages2_range
will try to invalidate the rest of the address space. If there are any
dirty pages in that range, the write will fail and a "Page cache
invalidation failure on direct I/O" error will be logged.
On gfs2, this can be reproduced as follows:
xfs_io \
-c "open -ft foo" -c "pwrite 4k 4k" -c "close" \
-c "open -d foo" -c "pwrite 0 4k"
Fix this by recognizing 0-length writes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
It is trivial to trigger a WARN_ON_ONCE(1) in iomap_dio_actor() by
unprivileged users which would taint the kernel, or worse - panic if
panic_on_warn or panic_on_taint is set. Hence, just convert it to
pr_warn_ratelimited() to let users know their workloads are racing.
Thank Dave Chinner for the initial analysis of the racing reproducers.
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.
Process can also steal one of its busy dax ranges if free range is not
available. I will refer it to as direct reclaim.
If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.
For reclaiming a range, as of now we need to hold following locks in
specified order.
down_write(&fi->i_mmap_sem);
down_write(&fi->dax->sem);
We look for a free range in following order.
A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is done along the lines of ext4 and xfs. I primarily wanted
->writepages hook at this time so that I could call into
dax_writeback_mapping_range(). This in turn will decide which pfns need to
be written back.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.
We make use of interval tree to keep track of per inode dax mappings.
Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field. Parse this field and
ensure our DAX mappings meet the alignment constraints.
We don't actually align anything differently since our mappings are
already 2MB aligned. Just check the value when the connection is
established. If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.
The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a mount option to allow using dax with virtio_fs.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Setup a dax device.
Use the shm capability to find the cache entry and map it.
The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86). Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.
Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This reduces code duplication and make it little easier to read code.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.
As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.
We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references (and don't want to unmap all the pages of inode).
Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Cc: Jan Kara <jack@suse.cz>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: "Weiny, Ira" <ira.weiny@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Soon, XFS will support quota grace period expiration timestamps beyond
the year 2038, widen the timestamp fields to handle the extra time bits.
Internally, XFS now stores unsigned 34-bit quantities, so the extra 8
bits here should work fine. (Note that XFS is the only user of this
structure.)
Link: https://lore.kernel.org/r/20200909163413.GJ7955@magnolia
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Highlights include:
Bugfixes:
- Fix an NFS/RDMA resource leak
- Fix the error handling during delegation recall
- NFSv4.0 needs to return the delegation on a zero-stateid SETATTR
- Stop printk reading past end of string
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl9ZFYAACgkQZwvnipYK
APLg+RAArQ0J54M4vTg7avKhUEwIrAlPCFjHvZ5jtlXiY8JDT7Cy2lEo9W/pC9x2
BiV02H6seKXq6vKUHIBgzVq0BdZBKeWQcOpoO/dfvWSPs9u+lxKlOEwcdsaXwdXz
31u5HS4xHYg2SlYj+BcKGfVexcWVEVyPqqPvflGBZIlKfzQLHo9YY390deUHMC6o
HrRXWADvpYXC1sJb3mtNtCojqr9a5A8Ty4clT19YvdwQL7cUt3HjjsOvJfbmB9S+
fW5/u3sdWJ1nYoz8AxC+utIMNmtXFBUhW0Sg+TPWMJj8yG9rclAgTxbobhXyzGph
j2ZamPhUtpcSYXBlwiQCm7GbUIItnzHgU6MSCs/nq8AeDc3WEx4qVONVqNvNr/sY
1T3znylZpXCHvxLmDWzDGsW8XvZT1r86Lm6zrJCmjWm+eoSKBzeoENcXGsGGYuJu
6NGz7pgQbYMb9t7VfOEFSxxt5w0wt7nRyhV1R7taBhm5B9XjF+BOmJBI0epQ1S7i
XRIr7WqxT00wijWyunNCQZxi1aDMHVYZXPwaqkEHTwJqeDzCtmir+ajAnZQUgUId
1MNiv8BDoN5YlPmj/gt+E3kbyj0Pu7M+09NvVEKqG7j8W80ltf6eb85XGrq+vp1E
Y0lmDXElBdNo3AA+dBOmk+peoVv4bfoog5PymElaRiwRM25VCOM=
=3fw2
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fix an NFS/RDMA resource leak
- Fix the error handling during delegation recall
- NFSv4.0 needs to return the delegation on a zero-stateid SETATTR
- Stop printk reading past end of string
* tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: stop printk reading past end of string
NFS: Zero-stateid SETATTR should first return delegation
NFSv4.1 handle ERR_DELAY error reclaiming locking state on delegation recall
xprtrdma: Release in-flight MRs on disconnect
Reading past end of file returns EOF for aligned reads but -EINVAL for
unaligned reads on f2fs. While documentation is not strict about this
corner case, most filesystem returns EOF on this case, like iomap
filesystems. This patch consolidates the behavior for f2fs, by making
it return EOF(0).
it can be verified by a read loop on a file that does a partial read
before EOF (A file that doesn't end at an aligned address). The
following code fails on an unaligned file on f2fs, but not on
btrfs, ext4, and xfs.
while (done < total) {
ssize_t delta = pread(fd, buf + done, total - done, off + done);
if (!delta)
break;
...
}
It is arguable whether filesystems should actually return EOF or
-EINVAL, but since iomap filesystems support it, and so does the
original DIO code, it seems reasonable to consolidate on that.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the sbi->ckpt->next_free_nid is not NAT block aligned and if there
are free nids in that NAT block between the start of the block and
next_free_nid, then those free nids will not be scanned in scan_nat_page().
This results into mismatch between nm_i->available_nids and the sum of
nm_i->free_nid_count of all NAT blocks scanned. And nm_i->available_nids
will always be greater than the sum of free nids in all the blocks.
Under this condition, if we use all the currently scanned free nids,
then it will loop forever in f2fs_alloc_nid() as nm_i->available_nids
is still not zero but nm_i->free_nid_count of that partially scanned
NAT block is zero.
Fix this to align the nm_i->next_scan_nid to the first nid of the
corresponding NAT block.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit da52f8ade4 ("f2fs: get the right gc victim section when section
has several segments") added code to count blocks of each section using
variables with type 'unsigned short', which has 2 bytes size in many
systems. However, the counts can be larger than the 2 bytes range and
type conversion results in wrong values. Especially when the f2fs
sections have blocks as many as USHRT_MAX + 1, the count is handled as 0.
This triggers eternal loop in init_dirty_segmap() at mount system call.
Fix this by changing the type of the variables to block_t.
Fixes: da52f8ade4 ("f2fs: get the right gc victim section when section has several segments")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers. It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method. The usual file systems and other
high bandwidth instances all provide a ->splice_read, so this just removes
support for various device drivers and procfs/debugfs files. If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't allow calling ->read or ->write with set_fs as a preparation for
killing off set_fs. All the instances that we use kernel_read/write on
are using the iter ops already.
If a file has both the regular ->read/->write methods and the iter
variants those could have different semantics for messed up enough
drivers. Also fails the kernel access to them in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Using the read_iter/write_iter interfaces allows for in-kernel users
to set sysctls without using set_fs(). Also, the buffer is a string,
so give it the real type of 'char *', not void *.
[AV: Christoph's fixup folded in]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Discarding blocks and buffers under a mounted filesystem is hardly
anything admin wants to do. Usually it will confuse the filesystem and
sometimes the loss of buffer_head state (including b_private field) can
even cause crashes like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
Call Trace:
__jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
kjournald2+0xbd/0x270 [jbd2]
So if we don't have block device open with O_EXCL already, claim the
block device while we truncate buffer cache. This makes sure any
exclusive block device user (such as filesystem) cannot operate on the
device while we are discarding buffer cache.
Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When an encryption policy has the IV_INO_LBLK_32 flag set, the IV
generation method involves hashing the inode number. This is different
from fscrypt's other IV generation methods, where the inode number is
either not used at all or is included directly in the IVs.
Therefore, in principle IV_INO_LBLK_32 can work with any length inode
number. However, currently fscrypt gets the inode number from
inode::i_ino, which is 'unsigned long'. So currently the implementation
limit is actually 32 bits (like IV_INO_LBLK_64), since longer inode
numbers will have been truncated by the VFS on 32-bit platforms.
Fix fscrypt_supported_v2_policy() to enforce the correct limit.
This doesn't actually matter currently, since only ext4 and f2fs support
IV_INO_LBLK_32, and they both only support 32-bit inode numbers. But we
might as well fix it in case it matters in the future.
Ideally inode::i_ino would instead be made 64-bit, but for now it's not
needed. (Note, this limit does *not* prevent filesystems with 64-bit
inode numbers from adding fscrypt support, since IV_INO_LBLK_* support
is optional and is useful only on certain hardware.)
Fixes: e3b1078bed ("fscrypt: add support for IV_INO_LBLK_32 policies")
Reported-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200824203841.1707847-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
If block_write_full_page() is called for a page that is beyond current
inode size, it will truncate page buffers for the page and return 0.
This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3
BUG due to race with truncate") in history.git tree to fix a problem
with ext3 in data=ordered mode. This particular problem doesn't exist
anymore because ext3 is long gone and ext4 handles ordered data
differently. Also normally buffers are invalidated by truncate code and
there's no need to specially handle this in ->writepage() code.
This invalidation of page buffers in block_write_full_page() is causing
issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
under filesystem's hands and metadata buffers get discarded while being
tracked by the journalling layer. Although it is obviously "not
supported" it can cause kernel crashes like:
[ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at
+0000000000000008
[ 7986.697197] PGD 0 P4D 0
[ 7986.699724] Oops: 0002 [#1] SMP PTI
[ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G
+O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
[ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
[ 7986.810150] Call Trace:
[ 7986.812595] __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
[ 7986.818408] jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
[ 7986.836467] kjournald2+0xbd/0x270 [jbd2]
which is not great. The crash happens because bh->b_private is suddently
NULL although BH_JBD flag is still set (this is because
block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup
found buffer without BH_Mapped set, called init_page_buffers() which has
rewritten bh->b_private). So just remove the invalidation in
block_write_full_page().
Note that the buffer cache invalidation when block device changes size
is already careful to avoid similar problems by using
invalidate_mapping_pages() which skips busy buffers so it was only this
odd block_write_full_page() behavior that could tear down bdev buffers
under filesystem's hands.
Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While testing a weird problem with -o degraded, I noticed I was getting
leaked root errors
BTRFS warning (device loop0): writable mount is not allowed due to too many missing devices
BTRFS error (device loop0): open_ctree failed
BTRFS error (device loop0): leaked root -9-0 refcount 1
This is the DATA_RELOC root, which gets read before the other fs roots,
but is included in the fs roots radix tree. Handle this by adding a
btrfs_drop_and_free_fs_root() on the data reloc root if it exists. This
is ok to do here if we fail further up because we will only drop the ref
if we delete the root from the radix tree, and all other cleanup won't
be duplicated.
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
A completely sane converted fs will cause kernel warning at balance
time:
[ 1557.188633] BTRFS info (device sda7): relocating block group 8162107392 flags data
[ 1563.358078] BTRFS info (device sda7): found 11722 extents
[ 1563.358277] BTRFS info (device sda7): leaf 7989321728 gen 95 total ptrs 213 free space 3458 owner 2
[ 1563.358280] item 0 key (7984947200 169 0) itemoff 16250 itemsize 33
[ 1563.358281] extent refs 1 gen 90 flags 2
[ 1563.358282] ref#0: tree block backref root 4
[ 1563.358285] item 1 key (7985602560 169 0) itemoff 16217 itemsize 33
[ 1563.358286] extent refs 1 gen 93 flags 258
[ 1563.358287] ref#0: shared block backref parent 7985602560
[ 1563.358288] (parent 7985602560 is NOT ALIGNED to nodesize 16384)
[ 1563.358290] item 2 key (7985635328 169 0) itemoff 16184 itemsize 33
...
[ 1563.358995] BTRFS error (device sda7): eb 7989321728 invalid extent inline ref type 182
[ 1563.358996] ------------[ cut here ]------------
[ 1563.359005] WARNING: CPU: 14 PID: 2930 at 0xffffffff9f231766
Then with transaction abort, and obviously failed to balance the fs.
[CAUSE]
That mentioned inline ref type 182 is completely sane, it's
BTRFS_SHARED_BLOCK_REF_KEY, it's some extra check making kernel to
believe it's invalid.
Commit 64ecdb647d ("Btrfs: add one more sanity check for shared ref
type") introduced extra checks for backref type.
One of the requirement is, parent bytenr must be aligned to node size,
which is not correct.
One example is like this:
0 1G 1G+4K 2G 2G+4K
| |///////////////////|//| <- A chunk starts at 1G+4K
| | <- A tree block get reserved at bytenr 1G+4K
Then we have a valid tree block at bytenr 1G+4K, but not aligned to
nodesize (16K).
Such chunk is not ideal, but current kernel can handle it pretty well.
We may warn about such tree block in the future, but should not reject
them.
[FIX]
Change the alignment requirement from node size alignment to sector size
alignment.
Also, to make our lives a little easier, also output @iref when
btrfs_get_extent_inline_ref_type() failed, so we can locate the item
easier.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205475
Fixes: 64ecdb647d ("Btrfs: add one more sanity check for shared ref type")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments and messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay reported a lockdep splat in generic/476 that I could reproduce
with btrfs/187.
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc2+ #1 Tainted: G W
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc_trace+0x3a/0x1a0
btrfs_alloc_device+0x43/0x210
add_missing_dev+0x20/0x90
read_one_chunk+0x301/0x430
btrfs_read_sys_array+0x17b/0x1b0
open_ctree+0xa62/0x1896
btrfs_mount_root.cold+0x12/0xea
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x379
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
vfs_rmdir.part.0+0x149/0x160
do_rmdir+0x136/0x1a0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x1184/0x1fa0
lock_acquire+0xa4/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&fs_info->chunk_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Tainted: G W 5.9.0-rc2+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x12d/0x150
__lock_acquire+0x1184/0x1fa0
lock_acquire+0xa4/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa4/0x3d0
? btrfs_evict_inode+0x11e/0x500
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x46/0x60
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This is because we are holding the chunk_mutex when we call
btrfs_alloc_device, which does a GFP_KERNEL allocation. We don't want
to switch that to a GFP_NOFS lock because this is the only place where
it matters. So instead use memalloc_nofs_save() around the allocation
in order to avoid the lockdep splat.
Reported-by: Nikolay Borisov <nborisov@suse.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
RHBZ: 1871246
If during cifs_lookup()/get_inode_info() we encounter a DFS link
and we use the cifsacl or modefromsid mount options we must suppress
any -EREMOTE errors that triggers or else we will not be able to follow
the DFS link and automount the target.
This fixes an issue with modefromsid/cifsacl where these mountoptions
would break DFS and we would no longer be able to access the share.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
With the recent rework of the inode cluster flushing, we no longer
ever wait on the the inode flush "lock". It was never a lock in the
first place, just a completion to allow callers to wait for inode IO
to complete. We now never wait for flush completion as all inode
flushing is non-blocking. Hence we can get rid of all the iflock
infrastructure and instead just set and check a state flag.
Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
xfs_iflock_nowait() test-and-set operations on that flag, and
replace all the xfs_ifunlock() calls to clear operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove kmem_realloc() function and convert its users to use MM API
directly (krealloc())
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9U/MMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpg4BEAC6TQ6ctc1yNTTWwiz4UrdIKMAWP7S7wepu
k00A9+JjLLJBVdkz9rZ2Y/SyGe12qBM+riiRSn/gNTbkd4qq2rCY2d8U1vXXKyP5
VDXPo12zsUD5WpqdEMfXUB4+DOQs0a3DAyDLT0+K/Qw0rpOQVZA26Ovn6GOh9+Kq
KJCllynYkwUQ/7CXsdI13ktMI1HADFOx3149SlGkbuPOggwNQrGiLJAxyUOn/i+E
uiHy8b8o3B2nun61+Y98q+MDISLf0xXYEbHeAsvETEy52ya2iadnMo1lwS5zHzM+
p6jOBybHaM/wz5t1V44VTBvfog1KAUtp8K0gsxcB6Ezf7LhYfTVjtvfZqpwVz1sl
txkxhnGEBURHpdr0aCtC15cIpbDGM85ymjP4RD0YlV7oyT0+Ufx7r9jJjLf7IhZO
FMyEFmVwSPD/NdJE9ZNx0I2v/qdIzYwfeAix4Z4bXoe51BtPrf1uBgsGWVuqNVz/
dKVf1vK1tPF8PgIiIW/o8GI4iF3RRVQcLwJGiAWMzBS11iniJLf9mUPRVl5Bxpeb
YRd2chm+ppHC/IgtK44x7Ce6415hnbKehE2KUr43PHB7nIMNWcRsurxLizM7ZGei
Gv2/K9PM3+4O/b1k20xLakz4Vw3Isk4W6/Flj9LoxheBo+CUG4Rsx5UyzFEB6apA
uXxiF6ORLg==
=v5QQ
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-block
Pull more io_uring fixes from Jens Axboe:
"Two followup fixes. One is fixing a regression from this merge window,
the other is two commits fixing cancelation of deferred requests.
Both have gone through full testing, and both spawned a few new
regression test additions to liburing.
- Don't play games with const, properly store the output iovec and
assign it as needed.
- Deferred request cancelation fix (Pavel)"
* tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-block:
io_uring: fix linked deferred ->files cancellation
io_uring: fix cancel of deferred reqs with ->files
io_uring: fix explicit async read/write mapping for large segments
While looking for ->files in ->defer_list, consider that requests there
may actually be links.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While trying to cancel requests with ->files, it also should look for
requests in ->defer_list, otherwise it might end up hanging a thread.
Cancel all requests in ->defer_list up to the last request there with
matching ->files, that's needed to follow drain ordering semantics.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix a broken metadata verifier that would incorrectly validate attr
fork extents of a realtime file against the realtime volume.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl9RDTMACgkQ+H93GTRK
tOseqA/8DTnwKRKuTXc6UboIXbgMwjGTJBv7xTD3KIDF3ls+Gmp2/RWHs7lfGRVY
LVdvlgAwDLEe1al8aLaibhrsRCYX8MLvMp8dX6xoqMWCY0KF80VCpXXy+FXbwBmZ
GQRqrThyjepseERJj15UC249p9ztBj4DAHQD0r2XD7JMKHBHo6cQ3eYY2NTzgqc3
q0kYcP5xPZZ4R6UFUNkJpkeV8PKpkzWTXyMod0+e8h4njZoRAmHimj7Id9DMQ6HB
ciHjZjNCd04Nu6JIpNBwlQE4epCPh79zX8hTQZf5nGNAh13CB9Wc2nHDvwCFBLxw
HUdU5BpxMNWGujJ+T5X+RVbzA4VXXaL3iypag9EjXadg+vOV6XmPQv6Fa3WqkfBT
GjEXB9+TG9ZuyxAObjP6yn1PRvob0l0iccAbtfnX5bE/mwqx1aDxN1QTfJrjG2Fv
2EiUnqI+jxro66HdS5QU+W8ko7dG0tiQPMF3DCv7nfb3ZxEgfXiDrgbHyYbzdLJq
pi5LqXBAfRHkDwRUzRU6G6pT7mplW31iTwh2AuiQogXnpPTWylb0XsUEVfYKpK9q
Z0HzwRfHmwQSkIEDZ0rZxP3wH5ssniHyLdXh5FtfFcPSBTyDvflMqFpLmYmvDV/6
MPfwCQAnMTrGv2GRJ8hbFiWd987bYTTfOKwJBEsFKDH01ZKYcV8=
=5U8o
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"Fix a broken metadata verifier that would incorrectly validate attr
fork extents of a realtime file against the realtime volume"
* tag 'xfs-5.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix xfs_bmap_validate_extent_raw when checking attr fork of rt files
When running in a dax mode, if the user maps a page with MAP_PRIVATE and
PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime
when the user hits a COW fault.
This breaks building of the Linux kernel. How to reproduce:
1. extract the Linux kernel tree on dax-mounted xfs filesystem
2. run make clean
3. run make -j12
4. run make -j12
at step 4, make would incorrectly rebuild the whole kernel (although it
was already built in step 3).
The reason for the breakage is that almost all object files depend on
objtool. When we run objtool, it takes COW page fault on its .data
section, and these faults will incorrectly update the timestamp of the
objtool binary. The updated timestamp causes make to rebuild the whole
tree.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running in a dax mode, if the user maps a page with MAP_PRIVATE and
PROT_WRITE, the ext2 filesystem would incorrectly update ctime and mtime
when the user hits a COW fault.
This breaks building of the Linux kernel. How to reproduce:
1. extract the Linux kernel tree on dax-mounted ext2 filesystem
2. run make clean
3. run make -j12
4. run make -j12
at step 4, make would incorrectly rebuild the whole kernel (although it
was already built in step 3).
The reason for the breakage is that almost all object files depend on
objtool. When we run objtool, it takes COW page fault on its .data
section, and these faults will incorrectly update the timestamp of the
objtool binary. The updated timestamp causes make to rebuild the whole
tree.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we exceed UIO_FASTIOV, we don't handle the transition correctly
between an allocated vec for requests that are queued with IOSQE_ASYNC.
Store the iovec appropriately and re-set it in the iter iov in case
it changed.
Fixes: ff6165b2d7 ("io_uring: retain iov_iter state over io_read/io_write calls")
Reported-by: Nick Hill <nick@nickhill.org>
Tested-by: Norman Maurer <norman.maurer@googlemail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a write delegation isn't available, the Linux NFS client uses
a zero-stateid when performing a SETATTR.
NFSv4.0 provides no mechanism for an NFS server to match such a
request to a particular client. It recalls all delegations for that
file, even delegations held by the client issuing the request. If
that client happens to hold a read delegation, the server will
recall it immediately, resulting in an NFS4ERR_DELAY/CB_RECALL/
DELEGRETURN sequence.
Optimize out this pipeline bubble by having the client return any
delegations it may hold on a file before it issues a
SETATTR(zero-stateid) on that file.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.
Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9SWN8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvtbD/4yI5dkopv6E2RHVuupFWmGlGoxLhPecnAZ
UHbKU+LA/tzWWMA7gZuwzzDEK1QWT/KmctpGTI22SXUKQCtjGzO/qnRMfyJ34TdQ
l4leYdw/QzUOHZG7dKVYUACHiaSxzQSallNuX1I9eM084KSXH3DgUkrwMLoew/8n
WJHKN+oRhcppnLVDekaLXbZEI9idTnY+gs/Dg8TNsxNSeO6y51OOlKltaNfL+npQ
dwlgMoolBYWHFozqgVyzIV7sU7fQ9QGppwBIfqBb1jEe9JU2ZymtlcDgfxUVpKcg
W8/PCoVT60AGiMdjV0EBoQO09r+nvwAcRQUSlWJU7Dn/pcZmFoaJkyse+SnD0Dac
cLTKhnhgMJSI4Zt3yQidFSNhz0Ouw15J8k7OTftn81zhtkHzPBgGnA7R6b7UUQsZ
5lJvlZh5aFPNBFp9A0do5+f5/lUMhHkxDpFVmZo+ywPtoNHJeDL2+jzzFawJ8kqv
IoFvVL8hl4DzqN+vShsJ40jH93+oITF/Jlq6kY8ILKtu42i5qAxpP0wUwycrN6Pz
/YNTKPveCoPU7zaFDvMfbc7U56Ke6ma+lmtTn6q6JOWFvUAYh7SUY4JGzEMpxfxK
QVyFMwXnCKhB66ZypJIFdbT4zqkTXmhxvu/Oz5txDv/uoytqT1o+zLHb3USi4Lw8
89NyvBc0aQ==
=NLOn
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.9-2020-09-04' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- EAGAIN with O_NONBLOCK retry fix
- Two small fixes for registered files (Jiufei)
* tag 'io_uring-5.9-2020-09-04' of git://git.kernel.dk/linux-block:
io_uring: no read/write-retry on -EAGAIN error and O_NONBLOCK marked file
io_uring: set table->files[i] to NULL when io_sqe_file_register failed
io_uring: fix removing the wrong file in __io_sqe_files_update()
The '#ifdef MODULE' check in the original commit does not work as intended.
The code under the check is not built at all if CONFIG_DEBUG_FS=y. Fix this
by using a correct check.
Fixes: 275678e7a9 ("debugfs: Check module state before warning in {full/open}_proxy_open()")
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200811150129.53343-1-vdronov@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The copy_mount_options() function takes a user pointer argument but no
size and it tries to read up to a PAGE_SIZE. However, copy_from_user()
is not guaranteed to return all the accessible bytes if, for example,
the access crosses a page boundary and gets a fault on the second page.
To work around this, the current copy_mount_options() implementation
performs two copy_from_user() passes, first to the end of the current
page and the second to what's left in the subsequent page.
On arm64 with MTE enabled, access to a user page may trigger a fault
after part of the buffer in a page has been copied (when the user
pointer tag, bits 56-59, no longer matches the allocation tag stored in
memory). Allow copy_mount_options() to handle such intra-page faults by
resorting to byte at a time copy in case of copy_from_user() failure.
Note that copy_from_user() handles the zeroing of the kernel buffer in
case of error.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
To enable tagging on a memory range, the user must explicitly opt in via
a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
memory type in the AttrIndx field of a pte, simplify the or'ing of these
bits over the protection_map[] attributes by making MT_NORMAL index 0.
There are two conditions for arch_vm_get_page_prot() to return the
MT_NORMAL_TAGGED memory type: (1) the user requested it via PROT_MTE,
registered as VM_MTE in the vm_flags, and (2) the vma supports MTE,
decided during the mmap() call (only) and registered as VM_MTE_ALLOWED.
arch_calc_vm_prot_bits() is responsible for registering the user request
as VM_MTE. The newly introduced arch_calc_vm_flag_bits() sets
VM_MTE_ALLOWED if the mapping is MAP_ANONYMOUS. An MTE-capable
filesystem (RAM-based) may be able to set VM_MTE_ALLOWED during its
mmap() file ops call.
In addition, update VM_DATA_DEFAULT_FLAGS to allow mprotect(PROT_MTE) on
stack or brk area.
The Linux mmap() syscall currently ignores unknown PROT_* flags. In the
presence of MTE, an mmap(PROT_MTE) on a file which does not support MTE
will not report an error and the memory will not be mapped as Normal
Tagged. For consistency, mprotect(PROT_MTE) will not report an error
either if the memory range does not support MTE. Two subsequent patches
in the series will propose tightening of this behaviour.
Co-developed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
For arm64 MTE support it is necessary to be able to mark pages that
contain user space visible tags that will need to be saved/restored e.g.
when swapped out.
To support this add a new arch specific flag (PG_arch_2). This flag is
only available on 64-bit architectures due to the limited number of
spare page flags on the 32-bit ones.
Signed-off-by: Steven Price <steven.price@arm.com>
[catalin.marinas@arm.com: use CONFIG_64BIT for guarding this new flag]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
As stated in https://sourceforge.net/projects/fuse/, "the FUSE project has
moved to https://github.com/libfuse/" in 22-Dec-2015. Update URLs to
reflect this.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull networking fixes from David Miller:
1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi
Kivilinna.
2) Fix loss of RTT samples in rxrpc, from David Howells.
3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu.
4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka.
5) We disable BH for too lokng in sctp_get_port_local(), add a
cond_resched() here as well, from Xin Long.
6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu.
7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from
Yonghong Song.
8) Missing of_node_put() in mt7530 DSA driver, from Sumera
Priyadarsini.
9) Fix crash in bnxt_fw_reset_task(), from Michael Chan.
10) Fix geneve tunnel checksumming bug in hns3, from Yi Li.
11) Memory leak in rxkad_verify_response, from Dinghao Liu.
12) In tipc, don't use smp_processor_id() in preemptible context. From
Tuong Lien.
13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu.
14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter.
15) Fix ABI mismatch between driver and firmware in nfp, from Louis
Peens.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits)
net/smc: fix sock refcounting in case of termination
net/smc: reset sndbuf_desc if freed
net/smc: set rx_off for SMCR explicitly
net/smc: fix toleration of fake add_link messages
tg3: Fix soft lockup when tg3_reset_task() fails.
doc: net: dsa: Fix typo in config code sample
net: dp83867: Fix WoL SecureOn password
nfp: flower: fix ABI mismatch between driver and firmware
tipc: fix shutdown() of connectionless socket
ipv6: Fix sysctl max for fib_multipath_hash_policy
drivers/net/wan/hdlc: Change the default of hard_header_len to 0
net: gemini: Fix another missing clk_disable_unprepare() in probe
net: bcmgenet: fix mask check in bcmgenet_validate_flow()
amd-xgbe: Add support for new port mode
net: usb: dm9601: Add USB ID of Keenetic Plus DSL
vhost: fix typo in error message
net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
pktgen: fix error message with wrong function name
net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode
cxgb4: fix thermal zone device registration
...
This will allow proc files to implement iter read semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of providing a special no-compat version provide a special
compat version for operations with ->compat_ioctl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9Qx/MACgkQxWXV+ddt
WDvT8g//bO9+xB3dRYLVl5FWur826Zz/dmzpp+A3QCI0er1+zh8/3FHD6dvGyBQu
CHQE1GFTdewry1usuGu08WopGiywnCa1svmp+qCJDugiDAG/ciTq2zQWemV2gRXU
1/9yZ62B3eBKuiL+lFNPYFzqWhsX0zP21nxHEkTJi2d8enJDNcmrrWjBrVHUVa1o
D0al0zPvaXpcknadhsNSkJnvgMeTcSz89AEIEC5OVfitbaIX09UAadG9kjPLlGJN
KNL+/vmAARNe1ze5qjTf3++E1LH28CdJ7rrlhAqyR4L4rApCQw8ZQv3h8EbnidSy
VJgDQCk35Z6IxVwbUKYEGiperV2dnsZHukzPT/sqjGrHO0Ct+Fy+YeK+4aw9yaof
mWG7RGeczj9t/FbKEGmELlEOHKmJw+PHi5byTokPfsA8smrfaPKNJKlb5bbjALni
hOThGiYr18n3RrSidpZ0GVxRcl8j3OwHSVDS/9U3eAF7Z31QtTDoX0MHbK6PrbvW
KvGkqIwyLknshS868lyAzXSdg6K4wYqvuh2Z/TuFFPV2MGbsXxHEFgM7M1wueV6x
iLhqUDYn2ki79FqqCYsUqWzDMc8m/4v//T4KPTj+BN9H8ij9bdzTXpaHrZAqLX0Y
ielkB51hdAlCML/Ctd5xojn5YOXgo5TkSC32QOHvxovSI1qJCIo=
=lVnf
-----END PGP SIGNATURE-----
Merge tag 'affs-for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull affs fix from David Sterba:
"One fix to make permissions work the same way as on AmigaOS"
* tag 'affs-for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
affs: fix basic permission bits to actually work
The realtime flag only applies to the data fork, so don't use the
realtime block number checks on the attr fork of a realtime file.
Fixes: 30b0984d91 ("xfs: refactor bmap record validation")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
- Avoid a log recovery failure for an insert range operation by rolling
deferred ops incrementally instead of at the end.
- Fix an off-by-one error when calculating log space reservations for
anything involving an inode allocation or free.
- Fix a broken shortform xattr verifier.
- Ensure that the shortform xattr header padding is always initialized
to zero.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl9H1PkACgkQ+H93GTRK
tOsfXA//QYUtx6BccgiXGJCJ7dkOJwM+2aKS1kvadfJlN6hZh7x6jzi424qsBHbH
jfNGcGjlrvFp3MAex3w+LeQFEK9gX47mKXGVR8MakWjo/khuxNoqpx4rzFI+wdMS
Sh51f9ovbsZIRrKHPv9YrxBUhO3yzbWbz2bbWYAV1/kN4jxdzKNByVrhIn97yFsg
NOj9uh1R3UsD1IKez9FeMKwYt/6jXZKs/LzroBKGxcBvdEu99X4vpD9WBqghe0Ye
e57rf+zb5b5M0ydOp9mEme8wogQcLKmMPgNaZXzduPuhx8GtsqZ4gTKoIpRSBCWA
z5oVerIr2jMN4+OMOaBWrEI34kRFUh5KG0Z0U4uOLavm04//KfQ3zUWAyea20Gmg
rkq4me7DJk0cVmWlJkB9X4LbR4OjeS7ANqvM94AmcYVHzxw76O/UhgMYXBgBin2J
4egsiXVi+Pii5Syw7kU9dTxWETP9mlN+2yO1CQdW0HgFnCJ3ARng0hdJBftQEIVo
Rec/YmpE9Mi7QSRjfp+I+30ZPNMl3lHMIfAVEj2Nhn32do5l6h9ZN4XP5OSQyqa6
yoNAi2oB1lqPaiV1/N2hRZwbxIexzeAZZSkGKGpiF/g9u5nwDTGRxVOmc/vTSGVN
7swYctIl9HuzIAuLGU3d8yCj6FnWvzNM6JGM0GDPi8ZFBqezhY4=
=9IZj
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.9-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Various small corruption fixes that have come in during the past
month:
- Avoid a log recovery failure for an insert range operation by
rolling deferred ops incrementally instead of at the end.
- Fix an off-by-one error when calculating log space reservations for
anything involving an inode allocation or free.
- Fix a broken shortform xattr verifier.
- Ensure that the shortform xattr header padding is always
initialized to zero"
* tag 'xfs-5.9-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: initialize the shortform attr header padding entry
xfs: fix boundary test in xfs_attr_shortform_verify
xfs: fix off-by-one in inode alloc block reservation calculation
xfs: finish dfops on every insert range shift iteration
Pull epoll fixup from Al Viro:
"Fixup for epoll regression; there's a better solution longer term, but
this is the least intrusive fix"
* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix regression in "epoll: Keep a reference on files added to the check list"
Actually two things that need fixing up here:
- The io_rw_reissue() -EAGAIN retry is explicit to block devices and
regular files, so don't ever attempt to do that on other types of
files.
- If we hit -EAGAIN on a nonblock marked file, don't arm poll handler for
it. It should just complete with -EAGAIN.
Cc: stable@vger.kernel.org
Reported-by: Norman Maurer <norman.maurer@googlemail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
epoll_loop_check_proc() can run into a file already committed to destruction;
we can't grab a reference on those and don't need to add them to the set for
reverse path check anyway.
Tested-by: Marc Zyngier <maz@kernel.org>
Fixes: a9ed4a6560 ("epoll: Keep a reference on files added to the check list")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
While io_sqe_file_register() failed in __io_sqe_files_update(),
table->files[i] still point to the original file which may freed
soon, and that will trigger use-after-free problems.
Cc: stable@vger.kernel.org
Fixes: f3bd9dae37 ("io_uring: fix memleak in __io_sqe_files_update()")
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the now unused helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
revalidate_disk is a relative awkward helper for driver use, as it first
calls an optional driver method and then updates the block device size,
while most callers either don't need the method call at all, or want to
keep state between the caller and the called method.
Add a revalidate_disk_size helper that just performs the update of the
block device size from the gendisk one, and switch all drivers that do
not implement ->revalidate_disk to use the new helper instead of
revalidate_disk()
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
variable to better describe the condition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bd_invalidated is set by check_disk_change or in add_disk to initiate a
partition scan. Move it from check_disk_size_change which is called
from both revalidate_disk() and bdev_disk_changed() to only the latter,
as that is what is called from the block device open code (and nbd) to
deal with the bd_invalidated event. revalidate_disk() on the other hand
is mostly used to propagate a size update from the gendisk to the block
device, which is entirely unrelated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ovl_can_list() should return false for overlay private xattrs. Since
currently these use the "trusted.overlay." prefix, they will always match
the "trusted." prefix as well, hence the test for being non-trusted will
not trigger.
Prepare for using the "user.overlay." namespace by moving the test for
private xattr before the test for non-trusted.
This patch doesn't change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of passing the xattr name down to the ovl_do_*xattr() accessor
functions, pass an enumerated value. The enum can use the same names as
the the previous #define for each xattr name.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Call ovl_do_*xattr() when accessing an overlay private xattr, vfs_*xattr()
otherwise.
This has an effect on debug output, which is made more consistent by this
patch.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use the convention of calling ovl_do_foo() for operations which are overlay
specific.
This patch is a no-op, and will have significance for supporting
"user.overlay." xattr namespace.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is a partial revert (with some cleanups) of commit 993a0b2aec ("ovl:
Do not lose security.capability xattr over metadata file copy-up"), which
introduced ovl_getxattr() in the first place.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Lose the padding and the failure message (in line with other parts of the
copy up process). Return zero for both nonexistent or empty xattr.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_getattr() returns the value of an xattr in a kmalloced buffer. There
are two callers:
ovl_copy_up_meta_inode_data() (copy_up.c)
ovl_get_redirect_xattr() (util.c)
This patch just copies ovl_getxattr() to copy_up.c, the following patches
will deal with the differences in idividual callers.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Container folks are complaining that dnf/yum issues too many sync while
installing packages and this slows down the image build. Build requirement
is such that they don't care if a node goes down while build was still
going on. In that case, they will simply throw away unfinished layer and
start new build. So they don't care about syncing intermediate state to the
disk and hence don't want to pay the price associated with sync.
So they are asking for mount options where they can disable sync on overlay
mount point.
They primarily seem to have two use cases.
- For building images, they will mount overlay with nosync and then sync
upper layer after unmounting overlay and reuse upper as lower for next
layer.
- For running containers, they don't seem to care about syncing upper layer
because if node goes down, they will simply throw away upper layer and
create a fresh one.
So this patch provides a mount option "volatile" which disables all forms
of sync. Now it is caller's responsibility to throw away upper if system
crashes or shuts down and start fresh.
With "volatile", I am seeing roughly 20% speed up in my VM where I am just
installing emacs in an image. Installation time drops from 31 seconds to 25
seconds when nosync option is used. This is for the case of building on top
of an image where all packages are already cached. That way I take out the
network operations latency out of the measurement.
Giuseppe is also looking to cut down on number of iops done on the disk. He
is complaining that often in cloud their VMs are throttled if they cross
the limit. This option can help them where they reduce number of iops (by
cutting down on frequent sync and writebacks).
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
An incompatible feature is marked by a non-empty directory nested
2 levels deep under "work" dir, e.g.:
workdir/work/incompat/volatile.
This commit checks for marked incompat features, warns about them
and fails to mount the overlay, for example:
overlayfs: overlay with incompat feature 'volatile' cannot be mounted
Very old kernels (i.e. v3.18) will fail to remove a non-empty "work"
dir and fail the mount. Newer kernels will fail to remove a "work"
dir with entries nested 3 levels and fall back to read-only mount.
User mounting with old kernel will see a warning like these in dmesg:
overlayfs: cleanup of 'incompat/...' failed (-39)
overlayfs: cleanup of 'work/incompat' failed (-39)
overlayfs: cleanup of 'ovl-work/work' failed (-39)
overlayfs: failed to create directory /vdf/ovl-work/work (errno: 17);
mounting read-only
These warnings should give the hint to the user that:
1. mount failure is caused by backward incompatible features
2. mount failure can be resolved by manually removing the "work" directory
There is nothing preventing users on old kernels from manually removing
workdir entirely or mounting overlay with a new workdir, so this is in
no way a full proof backward compatibility enforcement, but only a best
effort.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9NWskACgkQxWXV+ddt
WDsO5A//Xcjo45Th0T0mTiPXWpCPexOmQd3yvEezME5TVavEKM9hkfxYrv5d5i/k
dfFhPg8FYSasW7Sie1PdcP0Hu9mlla5G/4L1pQYMewCdJCSx0mt/dDtEh4GPwcmI
1MgyQ0pnKufNBD9SquO4yGZHR+lP+wY/FTZ6yCePJRvNEvtOWtcxH3vhTrbKC0sc
wSO2xL8Ht3VQBDA6YR+A/OMY8t6k2aN0qOk1BJWcMil0HmNw/rZBH/bWvPOn0dMQ
vaXQIUNRIUUD5qS7vcv5rgVDDWMnyM41eC6idckg1A2tpLGYZyFO3fv2c/VNKFaZ
dWQGeSbg0wRMnQhHhDmAZZwRfTsOhGuSPADnh4qQ07klDto6s5EbmrwkWmGNC6QY
B9yGF09mNwcHvgjz6RVv3mA/dINP32Qg3s90TtO2b7Rrolg1sQLRYDewuTUukmvH
jIg8q1oZBppHS8y5B4Vxl+18Swz4j66Kurq9jhU669n0CUqYBiqg5dEQzDHm54Ca
uEujNWfUGzfiFbj3uVtMG0i9XSHlwEtiGOCkXErlEqAFRSzPQsoFJQ0ae3Hl9FiI
PJuTVmS1UekhCB/ubewf5dOHbEk/nRKbIHnXbuJTQcWWzGtzzyBv4uJH3arQcMrC
QdiBpbqR1BSQ6ki4GbnG9f/4mvnLffsfPDAh+dt9VGxEru/9Q/c=
=cdvW
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two small fixes and a bunch of lockdep fixes for warnings that show up
with an upcoming tree locking update but are valid with current locks
as well"
* tag 'for-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: fix the error message for transid error
btrfs: set the lockdep class for log tree extent buffers
btrfs: set the correct lockdep class for new nodes
btrfs: allocate scrub workqueues outside of locks
btrfs: fix potential deadlock in the search ioctl
btrfs: drop path before adding new uuid tree entry
btrfs: block-group: fix free-space bitmap threshold
devcgroup_inode_permission is never called for the recusive case, so
move it out into blkdev_get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Two different callers use two different mutexes for updating the
block device size, which obviously doesn't help to actually protect
against concurrent updates from the different callers. In addition
one of the locks, bd_mutex is rather prone to deadlocks with other
parts of the block stack that use it for high level synchronization.
Switch to using a new spinlock protecting just the size updates, as
that is all we need, and make sure everyone does the update through
the proper helper.
This fixes a bug reported with the nvme revalidating disks during a
hot removal operation, which can currently deadlock on bd_mutex.
Reported-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bd_set_size with a version that takes the number of sectors
instead, as that fits most of the current and future callers much better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Index here is already the position of the file in fixed_file_table, we
should not use io_file_from_index() again to get it. Otherwise, the
wrong file which still in use may be released unexpectedly.
Cc: stable@vger.kernel.org # v5.6
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The basic permission bits (protection bits in AmigaOS) have been broken
in Linux' AFFS - it would only set bits, but never delete them.
Also, contrary to the documentation, the Archived bit was not handled.
Let's fix this for good, and set the bits such that Linux and classic
AmigaOS can coexist in the most peaceful manner.
Also, update the documentation to represent the current state of things.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Max Staudt <max@enpas.org>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl9Lw74ACgkQiiy9cAdy
T1HOMwv/WwqctX4SN4kA97C4HQFJwan5kPf1bBYdp3zEm45umxkZRKI7i8NN+4Z7
a7m3n9Kwm5CP0pHICJ6PLhYNs5J9ZSEx89J2GOmyl1SIbjNUHKGftrf75BCMceGT
6dcEoMLAFw8Z9D39n1mkLa09IOI7XAlHt48VUis2qnLIZc1WDA5wzZ8dW+EqSFWX
itg/P8I/4QUWf+IzXw3Hj9WiiIJMVkaIkz7lccrS8VzQD2KYNyDPl9+xNr8Q54Uu
n0sTiHXQenPrH+tubrKrdQ1b9OwxL41kfCeE0PfC/BatSJ5rBk4x+zw5EWvfM6Mz
y/llDLqtShfycKNGOChfrA2Dv3VvH7P0TDYu/Nl0x09KbZLRiswJ1iGn0WAMI8gG
0POaEWKHLkBGesItE9vMi7RZEb1wB8z6pFgwr6xadHx1RIWztv80rIcUF+qywJvZ
paVIPCFWyuahbbzWltxCmCLGLLn3j+Qm57md8PtLdSu5vOJu8kF35F6xP+DxHF1n
E/WgF59O
=0PxR
-----END PGP SIGNATURE-----
Merge tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull cfis fix from Steve French:
"DFS fix for referral problem when using SMB1"
* tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix check of tcon dfs in smb1
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9JYEgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpo5zD/4wcNe7gZZyRZWetSWBrdCIoWxN2yu1AsQu
DYTvmQLHxnqKKMvFX5PbKKfCi0W6igcqUE+j4U8AabL9xitd+t42v9XhYz8gsozF
4JjDr7Xvmk+eQLTpC1TZde2A823MeI+qXxjCzfEIw/SRcTIBE+VUJ4eSrRs+X2SE
QSpAObb4oqxOZC3Ja0JPbr5tuj31NP4zyJZTnysn5j8P26QR+WAca9fJoSUkt5UC
Kyew9IhsCad1s2v72GyFu6c7WLQ1BAi/x4P3QPZ6uG7ExJ/gNPJhe/onkkuJw4to
ggLxsY1cTwEHkHPYZ1Oh+C6JlE9rYBnWLMfPMdVjzYCTDOUONfty6Pjh+Bzz40xB
MPkx0IdngaL0xOtJOAf7Gk2YqMziJuc/yENJ6H8O4xEUKmB5vbhuSecejfHZ67wj
p7j4lih/sAtirORLFir0FyVZPqRpYbHnC7F9PgC75yB+UyTlGjpoOLlBaqK/iViK
agqZAWlw2EmsvCSt1mAWADItzjTuolbL9aORSSvX/dvMVCPJMDFs8eUi+aYKYvxA
e/P14PvCYTfp5TDDBg6H1RqCu9r1PDL1RsOIggmAfW0X+mM88vz7fTtzz21cHPMV
5p8x89JrCuufeLp5PiRSI/d+UJJgkWNr05B2Rf8bcSbkugC9zoYGrZfenZDQzAI0
Uuj5l1pHlw==
=nSwl
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few fixes in here, all based on reports and test cases from folks
using it. Most of it is stable material as well:
- Hashed work cancelation fix (Pavel)
- poll wakeup signalfd fix
- memlock accounting fix
- nonblocking poll retry fix
- ensure we never return -ERESTARTSYS for reads
- ensure offset == -1 is consistent with preadv2() as documented
- IOPOLL -EAGAIN handling fixes
- remove useless task_work bounce for block based -EAGAIN retry"
* tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-block:
io_uring: don't bounce block based -EAGAIN retry off task_work
io_uring: fix IOPOLL -EAGAIN retries
io_uring: clear req->result on IOPOLL re-issue
io_uring: make offset == -1 consistent with preadv2/pwritev2
io_uring: ensure read requests go through -ERESTART* transformation
io_uring: don't use poll handler if file can't be nonblocking read/written
io_uring: fix imbalanced sqo_mm accounting
io_uring: revert consumed iov_iter bytes on error
io-wq: fix hang after cancelling pending hashed work
io_uring: don't recurse on tsk->sighand->siglock with signalfd
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl9JG9wACgkQnJ2qBz9k
QNlp3ggA3B/Xopb2X3cCpf2fFw63YGJU4i0XJxi+3fC/v6m8U+D4XbqJUjaM5TZz
+4XABQf7OHvSwDezc3n6KXXD/zbkZCeVm9aohEXvfMYLyKbs+S7QNQALHEtpfBUU
3IY2pQ90K7JT9cD9pJls/Y/EaA1ObWP7+3F1zpw8OutGchKcE8SvVjzL3SSJaj7k
d8OTtMosAFuTe4saFWfsf9CmZzbx4sZw3VAzXEXAArrxsmqFKIcY8dI8TQ0WaYNh
C3wQFvW+n9wHapylyi7RhGl2QH9Tj8POfnCTahNFFJbsmJBx0Z3r42mCBAk4janG
FW+uDdH5V780bTNNVUKz0v4C/YDiKg==
=jQnW
-----END PGP SIGNATURE-----
Merge tag 'writeback_for_v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull writeback fixes from Jan Kara:
"Fixes for writeback code occasionally skipping writeback of some
inodes or livelocking sync(2)"
* tag 'writeback_for_v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
writeback: Drop I_DIRTY_TIME_EXPIRE
writeback: Fix sync livelock due to b_dirty_time processing
writeback: Avoid skipping inode writeback
writeback: Protect inode->i_io_list with inode->i_lock
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl9JBOAUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqs3w//TpQfBrdl64/W2J6D3fJf4yWfGDYr
NBIeSV4/Tlf896p5lMGqwvvDLBKjV9KV2mJcepdfeCQDzCnt6oa70gJ3nQl27LWo
OEyD9vbT0hmjFddhVTljMXPSrIOGaEA4pg4ikHi0lQT8t/Mzip9FaOq/NwPH7ZJa
ARsjiWUB/qq3DwxAHOwrrSnlYQoqJT1mTeLUYVsPswX4e0DxJiioII+e2oMKSScY
xUvK4TYo0PDz8176n2sK1vRw+l4zqBTYLGi72wWk3U7awzDJHIRGW80sEYvcRX4n
1TUDWPpwumqBhl0a8o4VBUxGTCeGAyLiIMs9TofVHGpO+yBlwa3V4Ubzrf3mqWe5
0s7sOfwpqjgor+mVzuFAXtm11kM66pEbrrNK6BB+yVmfZbfz7FY4bNy22HDG7Fyv
29R2R3iY19NehEE78wwy3zxzRBWNLj2zNEDeqwSkmgjwytSBMm8eAfSpCxpMwaBt
nk27qIPp9pQb4u+cfby3qVephMaziBtdw2rX8UMdmuFVA1gEsUkL9SqA/ti93XBM
gth1MsV5ys/vsCRdND6UDRV3jeg8+0exDKloHJwQ6cgg8NkzezF9sV3fWgEXelA2
yC87E3I7ewhlS4XJNrFK6jph6mtqcoYvXgHSSSkciSANlhroFoWOSLczq4WeOgRd
BJOrWYlYL6VyfUw=
=Wou+
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Andreas Gruenbacher:
"Fix a memory leak on filesystem withdraw.
We didn't detect this bug because we have slab merging on by default
(CONFIG_SLAB_MERGE_DEFAULT). Adding 'slub_nomerge' to the kernel
command line exposed the problem"
* tag 'gfs2-v5.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: add some much needed cleanup for log flushes that fail
a 64-bit architecture with a 32-bit ino_t, a patch to disallow leases
to avoid potential data integrity issues when CephFS is re-exported
via NFS or CIFS and a fix for the bulk of W=1 compilation warnings.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl9I9UITHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5HcB/0XKCF9pJmHM7vyMpZvALaNn97B7jgD
o4Lq0RBnIyum0UltnxvrROO1LGyurFH0hWofRykEWXy8Qt/9fZTddClOmY+6W+7P
A1MVInSxCXpNHj5y1uMuwhWkLmuNWnW1aKJXBn8tXKsmrYQM3SmbCMRx8a2GxJk9
mxl9zAtTRsih0AovRbde93i4FMpeXsyDk9EcyiJFcgnDjJpCAXgN9smu9xjPskOI
rrFQBjsrYB04FVuv5lEB/xZI/2QLM2FqzlpIRsa6udtYRsDnOtgIbLq8p0sNJnC6
QFxDaWFUJAXN/b1UWZz8yw2Y7nF7r47Hg6fDM4PBQ09PQEbIQKY+qn3R
=6+bm
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"We have an inode number handling change, prompted by s390x which is a
64-bit architecture with a 32-bit ino_t, a patch to disallow leases to
avoid potential data integrity issues when CephFS is re-exported via
NFS or CIFS and a fix for the bulk of W=1 compilation warnings"
* tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client:
ceph: don't allow setlease on cephfs
ceph: fix inode number handling on arches with 32-bit ino_t
libceph: add __maybe_unused to DEFINE_CEPH_FEATURE
For SMB1, the DFS flag should be checked against tcon->Flags rather
than tcon->share_flags. While at it, add an is_tcon_dfs() helper to
check for DFS capability in a more generic way.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
When a usrjquota or grpjquota mount option is used multiple times, we
will leak memory allocated for the file name. Make sure the last setting
is used and all the previous ones are properly freed.
Reported-by: syzbot+c9e294bbe0333a6b7640@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Use kvzalloc() in udf_sb_alloc_bitmap() instead of open-coding it.
Size computation wrapped in struct_size() macro to prevent potential
integer overflows.
Link: https://lore.kernel.org/r/20200827221652.64660-1-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Remove linux/fiemap.h which is included more than once
Link: https://lore.kernel.org/r/20200819025434.65763-1-wanghai38@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
These events happen inline from submission, so there's no need to
bounce them through the original task. Just set them up for retry
and issue retry directly instead of going over task_work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This normally isn't hit, as polling is mostly done on NVMe with deep
queue depths. But if we do run into request starvation, we need to
ensure that retries are properly serialized.
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch use free_con() functionality to free the listen connection if
listen fails. It also fixes an issue that a freed resource is still part
of the connection_hash as hlist_del() is not called in this case. The
only difference is that free_con() handles othercon as well, but this is
never been set for the listen connection.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch adds free of possible other writequeue entries in othercon
member of struct connection.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch just move the free of struct connection member writequeue
into the functionality when struct connection will be freed instead of
doing two iterations.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch fixes the following memory detected by kmemleak and umount
gfs2 filesystem which removed the last lockspace:
unreferenced object 0xffff9264f482f600 (size 192):
comm "dlm_controld", pid 325, jiffies 4294690276 (age 48.136s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 6e 6f 64 65 73 00 00 00 ........nodes...
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000060481d7>] make_space+0x41/0x130
[<000000008d905d46>] configfs_mkdir+0x1a2/0x5f0
[<00000000729502cf>] vfs_mkdir+0x155/0x210
[<000000000369bcf1>] do_mkdirat+0x6d/0x110
[<00000000cc478a33>] do_syscall_64+0x33/0x40
[<00000000ce9ccf01>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The patch just remembers the "nodes" entry pointer in space as I think
it's created as subdirectory when parent "spaces" is created. In
function drop_space() we will lost the pointer reference to nds because
configfs_remove_default_groups(). However as this subdirectory is always
available when "spaces" exists it will just be freed when "spaces" will be
freed.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
There are some problems with the connections_lock. During my
experiements I saw sometimes circular dependencies with sock_lock.
The reason here might be code parts which runs nodeid2con() before
or after sock_lock is acquired.
Another issue are missing locks in for_conn() iteration. Maybe this
works fine because for_conn() is running in a context where
connection_hash cannot be manipulated by others anymore.
However this patch changes the connection_hash to be protected by
sleepable rcu. The hotpath function __find_con() is implemented
lockless as it is only a reader of connection_hash and this hopefully
fixes the circular locking dependencies. The iteration for_conn() will
still call some sleepable functionality, that's why we use sleepable rcu
in this case.
This patch removes the kmemcache functionality as I think I need to
make some free() functionality via call_rcu(). However allocation time
isn't here an issue. The dlm_allow_con will not be protected by a lock
anymore as I think it's enough to just set and flush workqueues
afterwards.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch moves the dlm workqueue dlm synchronization before shutdown
handling. The patch just flushes all pending work before starting to
shutdown the connection. At least for the send_workqeue we should flush
the workqueue to make sure there is no new connection handling going on
as dlm_allow_conn switch is turned to false before.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
For mounts that have the new "nosymfollow" option, don't follow symlinks
when resolving paths. The new option is similar in spirit to the
existing "nodev", "noexec", and "nosuid" options, as well as to the
LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD
variants have been supporting the "nosymfollow" mount option for a long
time with equivalent implementations.
Note that symlinks may still be created on file systems mounted with
the "nosymfollow" option present. readlink() remains functional, so
user space code that is aware of symlinks can still choose to follow
them explicitly.
Setting the "nosymfollow" mount option helps prevent privileged
writers from modifying files unintentionally in case there is an
unexpected link along the accessed path. The "nosymfollow" option is
thus useful as a defensive measure for systems that need to deal with
untrusted file systems in privileged contexts.
More information on the history and motivation for this patch can be
found here:
https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal
Signed-off-by: Mattias Nissler <mnissler@chromium.org>
Signed-off-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl9HwPIACgkQ+7dXa6fL
C2vHdw//S5s+nPNuJUpz3aeYDC1Yl1YP+r2+XBfcCNem504GKfsvTBbLo7FYP6Oc
TCLJouPxoNSWAtlrJ86YJu8EcFdYNI4w6mCHncIrLXiiIZhsAeUMAw1GiASL5VUr
QhLjJJU4zUBg3/RWLRC1pUXBXAa3sYx5r1p8j3KdBAkAmAzXdFWSRFeffVeXIk46
FZ52YoQkplJYqDL+1oQCjJLetVJGGvc68AIOjJOLh6CAjrZbx1aiY79XA+flsoE9
B+KVjhEn0f9dVybgtF4rEqS88y1FJtSQgul6FOsys+Rx2I0G7ei8PB5TQ2wNCQhy
37gGfg1AV6apcqXKcHJdovVnApoQzzdCJESCgbMvsczqM88dP7pLWrPz4wpxs655
3t6qUyXI6yMOJhUWPOoFi30+4NM+MmsqrbYLZt2f/aXRhKnyaeaLd+VAIlkgz3Lj
sZuuHQsTH0xiR2uCsINWEc7d7UV6WjeVUJ77LzYiiRzEC2pX80tGu5EKHN8W9oBk
xRAuExXEGtyOR0p3/S3StkT490Tt4bIxUwnKYAaQZEydMCnTlWVbtIXwzmlCbCSB
p+P/7twR7LiQlHCTJU94jH3Jfpm3jpFpgjQZoZcx6ZLvtSH4+QP11nMY9+sRxEz1
hpB12AY7Wp2N7P+GhBPuXUCQpzW751aNZz4X9Etu6kRgwjuyaJM=
=FWHJ
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20200820' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc, afs: Fix probing issues
Here are some fixes for rxrpc and afs to fix issues in the RTT measuring in
rxrpc and thence the Volume Location server probing in afs:
(1) Move the serial number of a received ACK into a local variable to
simplify the next patch.
(2) Fix the loss of RTT samples due to extra interposed ACKs causing
baseline information to be discarded too early. This is a particular
problem for afs when it sends a single very short call to probe a
server it hasn't talked to recently.
(3) Fix rxrpc_kernel_get_srtt() to indicate whether it actually has seen
any valid samples or not.
(4) Remove a field that's set/woken, but never read/waited on.
(5) Expose the RTT and other probe information through procfs to make
debugging of this stuff easier.
(6) Fix VL rotation in afs to only use summary information from VL probing
and not the probe running state (which gets clobbered when next a
probe is issued).
(7) Fix VL rotation to actually return the error aggregated from the probe
errors.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The fall through annotation comes after a return statement so it's not
reachable.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Don't leak kernel memory contents into the shortform attr fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The error message for inode transid is the same as for inode generation,
which makes us unable to detect the real problem.
Reported-by: Tyler Richmond <t.d.richmond@gmail.com>
Fixes: 496245cac5 ("btrfs: tree-checker: Verify inode item")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are special extent buffers that get rewound in order to lookup
the state of the tree at a specific point in time. As such they do not
go through the normal initialization paths that set their lockdep class,
so handle them appropriately when they are created and before they are
locked.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When flipping over to the rw_semaphore I noticed I'd get a lockdep splat
in replace_path(), which is weird because we're swapping the reloc root
with the actual target root. Turns out this is because we're using the
root->root_key.objectid as the root id for the newly allocated tree
block when setting the lockdep class, however we need to be using the
actual owner of this new block, which is saved in owner.
The affected path is through btrfs_copy_root as all other callers of
btrfs_alloc_tree_block (which calls init_new_buffer) have root_objectid
== root->root_key.objectid .
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
------------------------------------------------------
compsize/11122 is trying to acquire lock:
ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
but task is already holding lock:
ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-fs-00){++++}-{3:3}:
down_write_nested+0x3b/0x70
__btrfs_tree_lock+0x24/0x120
btrfs_search_slot+0x756/0x990
btrfs_lookup_inode+0x3a/0xb4
__btrfs_update_delayed_inode+0x93/0x270
btrfs_async_run_delayed_root+0x168/0x230
btrfs_work_helper+0xd4/0x570
process_one_work+0x2ad/0x5f0
worker_thread+0x3a/0x3d0
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #1 (&delayed_node->mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_delayed_update_inode+0x50/0x440
btrfs_update_inode+0x8a/0xf0
btrfs_dirty_inode+0x5b/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&mm->mmap_lock#2){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
copy_to_sk.isra.32+0x121/0x300
search_ioctl+0x106/0x200
btrfs_ioctl_tree_search_v2+0x7b/0xf0
btrfs_ioctl+0x106f/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-fs-00);
lock(&delayed_node->mutex);
lock(btrfs-fs-00);
lock(&mm->mmap_lock#2);
*** DEADLOCK ***
1 lock held by compsize/11122:
#0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __might_fault+0x3e/0x90
? find_held_lock+0x72/0x90
__might_fault+0x68/0x90
? __might_fault+0x3e/0x90
_copy_to_user+0x1e/0x80
copy_to_sk.isra.32+0x121/0x300
? btrfs_search_forward+0x2a6/0x360
search_ioctl+0x106/0x200
btrfs_ioctl_tree_search_v2+0x7b/0xf0
btrfs_ioctl+0x106f/0x30a0
? __do_sys_newfstat+0x5a/0x70
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The problem is we're doing a copy_to_user() while holding tree locks,
which can deadlock if we have to do a page fault for the copy_to_user().
This exists even without my locking changes, so it needs to be fixed.
Rework the search ioctl to do the pre-fault and then
copy_to_user_nofault for the copying.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
------------------------------------------------------
btrfs-uuid/7955 is trying to acquire lock:
ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
but task is already holding lock:
ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-uuid-00){++++}-{3:3}:
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_uuid_tree_add+0x89/0x2d0
btrfs_uuid_scan_kthread+0x330/0x390
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #0 (btrfs-root-00){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-uuid-00);
lock(btrfs-root-00);
lock(btrfs-uuid-00);
lock(btrfs-root-00);
*** DEADLOCK ***
1 lock held by btrfs-uuid/7955:
#0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __btrfs_tree_read_lock+0x39/0x180
? btrfs_root_node+0x1c/0x1d0
down_read_nested+0x3e/0x140
? __btrfs_tree_read_lock+0x39/0x180
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
? btree_readpage+0x20/0x20
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots. The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.
It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop. So simply move
the path release before we add our entry to the uuid tree.
This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
After commit 9afc66498a ("btrfs: block-group: refactor how we read one
block group item"), cache->length is being assigned after calling
btrfs_create_block_group_cache. This causes a problem since
set_free_space_tree_thresholds calculates the free-space threshold to
decide if the free-space tree should convert from extents to bitmaps.
The current code calls set_free_space_tree_thresholds with cache->length
being 0, which then makes cache->bitmap_high_thresh zero. This implies
the system will always use bitmap instead of extents, which is not
desired if the block group is not fragmented.
This behavior can be seen by a test that expects to repair systems
with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
created FREE_SPACE_BITMAP.
[FIX]
Call set_free_space_tree_thresholds after setting cache->length. There
is now a WARN_ON in set_free_space_tree_thresholds to help preventing
the same mistake to happen again in the future.
Link: https://github.com/kdave/btrfs-progs/issues/251
Fixes: 9afc66498a ("btrfs: block-group: refactor how we read one block group item")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make sure we clear req->result, which was set to -EAGAIN for retry
purposes, when moving it to the reissue list. Otherwise we can end up
retrying a request more than once, which leads to weird results in
the io-wq handling (and other spots).
Cc: stable@vger.kernel.org
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A client should be able to handle getting an ERR_DELAY error
while doing a LOCK call to reclaim state due to delegation being
recalled. This is a transient error that can happen due to server
moving its volumes and invalidating its file location cache and
upon reference to it during the LOCK call needing to do an
expensive lookup (leading to an ERR_DELAY error on a PUTFH).
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The boundary test for the fixed-offset parts of xfs_attr_sf_entry in
xfs_attr_shortform_verify is off by one, because the variable array
at the end is defined as nameval[1] not nameval[].
Hence we need to subtract 1 from the calculation.
This can be shown by:
# touch file
# setfattr -n root.a file
and verifications will fail when it's written to disk.
This only matters for a last attribute which has a single-byte name
and no value, otherwise the combination of namelen & valuelen will
push endp further out and this test won't fail.
Fixes: 1e1bbd8e7e ("xfs: create structure verifier function for shortform xattrs")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The inode chunk allocation transaction reserves inobt_maxlevels-1
blocks to accommodate a full split of the inode btree. A full split
requires an allocation for every existing level and a new root
block, which means inobt_maxlevels is the worst case block
requirement for a transaction that inserts to the inobt. This can
lead to a transaction block reservation overrun when tmpfile
creation allocates an inode chunk and expands the inobt to its
maximum depth. This problem has been observed in conjunction with
overlayfs, which makes frequent use of tmpfiles internally.
The existing reservation code goes back as far as the Linux git repo
history (v2.6.12). It was likely never observed as a problem because
the traditional file/directory creation transactions also include
worst case block reservation for directory modifications, which most
likely is able to make up for a single block deficiency in the inode
allocation portion of the calculation. tmpfile support is relatively
more recent (v3.15), less heavily used, and only includes the inode
allocation block reservation as tmpfiles aren't linked into the
directory tree on creation.
Fix up the inode alloc block reservation macro and a couple of the
block allocator minleft parameters that enforce an allocation to
leave enough free blocks in the AG for a full inobt split.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The recent change to make insert range an atomic operation used the
incorrect transaction rolling mechanism. The explicit transaction
roll does not finish deferred operations. This means that intents
for rmapbt updates caused by extent shifts are not logged until the
final transaction commits. Thus if a crash occurs during an insert
range, log recovery might leave the rmapbt in an inconsistent state.
This was discovered by repeated runs of generic/455.
Update insert range to finish dfops on every shift iteration. This
is similar to collapse range and ensures that intents are logged
with the transactions that make associated changes.
Fixes: dd87f87d87 ("xfs: rework insert range into an atomic operation")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The man page for io_uring generally claims were consistent with what
preadv2 and pwritev2 accept, but turns out there's a slight discrepancy
in how offset == -1 is handled for pipes/streams. preadv doesn't allow
it, but preadv2 does. This currently causes io_uring to return -EINVAL
if that is attempted, but we should allow that as documented.
This change makes us consistent with preadv2/pwritev2 for just passing
in a NULL ppos for streams if the offset is -1.
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Benedikt Ames <wisp3rwind@posteo.eu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Eliminate an oops introduced in v5.8
- Remove a duplicate #include added by nfsd-5.9
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl8/4gUACgkQM2qzM29m
f5cxKBAAp7UjD3YNlLhSviowuOfYpWNjyk1cEQ6hWFA9oVeSfZfU3/axW8uYTHPm
QZ6ams6gjorP4CXwVkFGFHpTRg4CfVN9g5lKxrcjvELqNllWBhE9UupRgbX3+XBE
qselRI22M64o2tfDE+tPrDB8w8PwHmqrHwRydXfgiFlHk7nt6xD7NitaJBnPlYPM
21OBl6mrjLwtRwvX9n5wpy/+bfOTHbGV5VNez0fAfKXggNmRdt/UNROC4doLg4M0
2khAV3vgx49FRpCPL6SZPcBYd6zfrYOcj3iSf6wpxS5nTb2MifXFqz1MvKRTj863
gzvSmh7vuf0+EaOAXuLjCD9dURZpuG/k0vJGijOgaSt0+vNQHjIgZ1XRFHQtQCp4
zPJ/Qyk5k7uajHzcBFuNPUFAkOovH6LRoOzpqGvXhwaxrMPWti0LyyVKidVJrt/d
EtOKQR+HCN0zAwjadXSPK8Nw1PjMzplkF7TaxXvF2LdO/4vpEZZNoz+if59gRcFY
65h2++7y+0MCX8l83uUZfs+jQU2aR1w5a0DjVzi86xzJtyhr6gEyTj3Z6L9HIHwW
dnSpUmoiaCoN0eqxvEBjw0VEPqB806CuiUER0Jdd8k7mPk04fsQ/9+UsYyliSLEG
N56LFSWLXLHsySa2WkuB/ghzT2/Q0vFoZKXW0KNSD7W4C5XMxi4=
=czB3
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.9-1' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull nfs server fixes from Chuck Lever:
- Eliminate an oops introduced in v5.8
- Remove a duplicate #include added by nfsd-5.9
* tag 'nfsd-5.9-1' of git://git.linux-nfs.org/projects/cel/cel-2.6:
SUNRPC: remove duplicate include
nfsd: fix oops on mixed NFSv4/NFSv3 client access
Fixes include:
. revert binfmt_flat data offset removal
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEmsfM6tQwfNjBOxr3TiQVqaG9L4AFAl9FEmEACgkQTiQVqaG9
L4CzmBAAqqvgz04SYakQ1CLaeG2syWW3JM9XmPdYgoca+YJwm1SRaSpAxXJRQxfG
/4iQucbjFW7lD+KiemepHkYQapesNW8UaHdwN2YEemRMfjCj69WJ9wVh2jCyaBN8
t7Qg1CuK5L9TfjqXYvAFW5dyeO2/hNQuFGYpy6sSrE2MW1m1vC7pG8Ku+yhhRKn6
W6xt3+R9PVHtYszBf8aMMcc+G394C4K7B8cbZbyllayFMX59ZKyArZJx6A43FT1I
Lb/yaEh4uujnL1XnlT2LOy2RA7heJpfCFH8fwz2oA5tMhzqHIpEZ82iCaaMyRvXe
fbx2IDXD/NqwxgrbU1PuUP9lyUuWTiqczwwv5FQDFoEBH9My04QR6Do/fUojWoIz
ah8SzS2H96o20CJZXQaUpwdLXyj9E5OCME0J2EBIu6RkM2kbaDoRHwKIKEyFx0qx
vE/vC/KY2b81xwQgmKQa28HhCmQRlQvLBZTm8pO41ut3y6zgGzFs9/MsAmcQrrAk
0dsIlOR6ApKDPM2QCF8WXB9cSY2D7ZdEPd1Sujam65n0rXfESPBpQmr5Izx0j6KZ
9p1FVphJT1UYa1SZiCZq8BYz4vIqgKvJgMTFW2p0WtNC9T7Dc9/uBmWxnrLdbcqe
Y2+IyialY2YAvnIEkQV7dBrSaAkslxwe9a723dbhx4vkq2HBU0Y=
=ENzl
-----END PGP SIGNATURE-----
Merge tag 'm68knommu-for-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68knommu fix from Greg Ungerer:
"Only a single fix for the binfmt_flat loader (reverting a recent
change)"
* tag 'm68knommu-for-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
binfmt_flat: revert "binfmt_flat: don't offset the data start"
We need to call kiocb_done() for any ret < 0 to ensure that we always
get the proper -ERESTARTSYS (and friends) transformation done.
At some point this should be tied into general error handling, so we
can get rid of the various (mostly network) related commands that check
and perform this substitution.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no point in using the poll handler if we can't do a nonblocking
IO attempt of the operation, since we'll need to go async anyway. In
fact this is actively harmful, as reading from eg pipes won't return 0
to indicate EOF.
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Benedikt Ames <wisp3rwind@posteo.eu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We do the initial accounting of locked_vm and pinned_vm before we have
setup ctx->sqo_mm, which means we can end up having not accounted the
memory at setup time, but still decrement it when we exit. This causes
an imbalance in the accounting.
Setup ctx->sqo_mm earlier in io_uring_create(), before we do the first
accounting of mm->{locked,pinned}_vm. This also unifies the state
grabbing for the ctx, and eliminates a failure case in
io_sq_offload_start().
Fixes: f74441e631 ("io_uring: account locked memory before potential error case")
Reported-by: Robert M. Muncrief <rmuncrief@humanavance.com>
Reported-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Robert M. Muncrief <rmuncrief@humanavance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some consumers of the iov_iter will return an error, but still have
bytes consumed in the iterator. This is an issue for -EAGAIN, since we
rely on a sane iov_iter state across retries.
Fix this by ensuring that we revert consumed bytes, if any, if the file
operations have consumed any bytes from iterator. This is similar to what
generic_file_read_iter() does, and is always safe as we have the previous
bytes count handy already.
Fixes: ff6165b2d7 ("io_uring: retain iov_iter state over io_read/io_write calls")
Reported-by: Dmitry Shulyak <yashulyak@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix the check for the mainline vboxsf code being used with the old
mount.vboxsf mount binary from the out-of-tree vboxsf version doing
a comparison between signed and unsigned data types.
This fixes the following smatch warnings:
fs/vboxsf/super.c:390 vboxsf_parse_monolithic() warn: impossible condition '(options[1] == (255)) => ((-128)-127 == 255)'
fs/vboxsf/super.c:391 vboxsf_parse_monolithic() warn: impossible condition '(options[2] == (254)) => ((-128)-127 == 254)'
fs/vboxsf/super.c:392 vboxsf_parse_monolithic() warn: impossible condition '(options[3] == (253)) => ((-128)-127 == 253)'
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, io_uring's recvmsg subscribes to both POLLERR and POLLIN. In
the context of TCP tx zero-copy, this is inefficient since we are only
reading the error queue and not using recvmsg to read POLLIN responses.
This patch was tested by using a simple sending program to call recvmsg
using io_uring with MSG_ERRQUEUE set and verifying with printks that the
POLLIN is correctly unset when the msg flags are MSG_ERRQUEUE.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9D5EkACgkQxWXV+ddt
WDto+g/6A/2QzxhgOmqqHTiDvn3DkL60XfjB6lmq3NEvinrST+VH20EoX/EuX2Kn
u2+gMiWrgBUwlvERkoSasxdJf/6dCCc+9zYDjjKkAxCckENT85Np71o3iEc7Z5z+
LFgS26mt6aYlCCHyIsHutzHtK2MKiUz7/oaUYZMJBHHkKS/5hL1mzIbwiWAqfU2H
q0iMz9L2mjp1kZnpwa/yhg/NJ/oGZsKm3UPGDhdc0RlCWHBbDXHFk1wvNRo/yKQW
l+yy0dh6PAZ45pRL0/WZwvOzcAglb+uSmwa64UOvwio4Na9P7oAcBzTFmtbBtvP4
WBrOUPCTzkvgQcmoAsWFpD4nrzgW4oS71EICTOIRlPx7A86TP3wYpFEygUlLCoZC
Pd4e9mPClmW78hcRT12eJeGcJIzgoKWhR8597jNUEYz3R5T2wKHOcNnq9a1E1PLv
zR+5MFShsylUHd7HbMC1O86XnfXe5esegNQMvx36kTS+cR9Dyt5EWMNIAYK4BPM3
/tXWZRqlZPOh3T7DZ4QR5oSSDDNq7ROTdv9jmsleno+woG0MNDYsA7jCbeJnGTmI
CtTUP+p41otyM2lFZjV8PG/XyXDKb3UfU5gcsDOZdGP5S0tkyBiKSA6eqhz6DaTi
fQOLGZdkNpNN/burbMq7d7YEHr3F6LC17U3L4k5V4MTAm2lp7ZQ=
=ONgI
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix swapfile activation on subvolumes with deleted snapshots
- error value mixup when removing directory entries from tree log
- fix lzo compression level reset after previous level setting
- fix space cache memory leak after transaction abort
- fix const function attribute
- more error handling improvements
* tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: detect nocow for swap after snapshot delete
btrfs: check the right error variable in btrfs_del_dir_entries_in_log
btrfs: fix space cache memory leak after transaction abort
btrfs: use the correct const function attribute for btrfs_get_num_csums
btrfs: reset compression level for lzo on remount
btrfs: handle errors from async submission
Leases don't currently work correctly on kcephfs, as they are not broken
when caps are revoked. They could eventually be implemented similarly to
how we did them in libcephfs, but for now don't allow them.
[ idryomov: no need for simple_nosetlease() in ceph_dir_fops and
ceph_snapdir_fops ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tuan and Ulrich mentioned that they were hitting a problem on s390x,
which has a 32-bit ino_t value, even though it's a 64-bit arch (for
historical reasons).
I think the current handling of inode numbers in the ceph driver is
wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
actually not a problem. 32-bit arches can deal with 64-bit inode numbers
just fine when userland code is compiled with LFS support (the common
case these days).
What we really want to do is just use 64-bit numbers everywhere, unless
someone has mounted with the ino32 mount option. In that case, we want
to ensure that we hash the inode number down to something that will fit
in 32 bits before presenting the value to userland.
Add new helper functions that do this, and only do the conversion before
presenting these values to userland in getattr and readdir.
The inode table hashvalue is changed to just cast the inode number to
unsigned long, as low-order bits are the most likely to vary anyway.
While it's not strictly required, we do want to put something in
inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
the size of the ino_t type.
NOTE: This is a user-visible change on 32-bit arches:
1/ inode numbers will be seen to have changed between kernel versions.
32-bit arches will see large inode numbers now instead of the hashed
ones they saw before.
2/ any really old software not built with LFS support may start failing
stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
can do about these, but hopefully the intersection of people running
such code on ceph will be very small.
The workaround for both problems is to mount with "-o ino32".
[ idryomov: changelog tweak ]
URL: https://tracker.ceph.com/issues/46828
Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When a log flush fails due to io errors, it signals the failure but does
not clean up after itself very well. This is because buffers are added to
the transaction tr_buf and tr_databuf queue, but the io error causes
gfs2_log_flush to bypass the "after_commit" functions responsible for
dequeueing the bd elements. If the bd elements are added to the ail list
before the error, function ail_drain takes care of dequeueing them.
But if they haven't gotten that far, the elements are forgotten and
make the transactions unable to be freed.
This patch introduces new function trans_drain which drains the bd
elements from the transaction so they can be freed properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
binfmt_flat loader uses the gap between text and data to store data
segment pointers for the libraries. Even in the absence of shared
libraries it stores at least one pointer to the executable's own data
segment. Text and data can go back to back in the flat binary image and
without offsetting data segment last few instructions in the text
segment may get corrupted by the data segment pointer.
Fix it by reverting commit a2357223c5 ("binfmt_flat: don't offset the
data start").
Cc: stable@vger.kernel.org
Fixes: a2357223c5 ("binfmt_flat: don't offset the data start")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
Don't forget to update wqe->hash_tail after cancelling a pending work
item, if it was hashed.
Cc: stable@vger.kernel.org # 5.7+
Reported-by: Dmitry Shulyak <yashulyak@gmail.com>
Fixes: 86f3cd1b58 ("io-wq: handle hashed writes in chains")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an application is doing reads on signalfd, and we arm the poll handler
because there's no data available, then the wakeup can recurse on the
tasks sighand->siglock as the signal delivery from task_work_add() will
use TWA_SIGNAL and that attempts to lock it again.
We can detect the signalfd case pretty easily by comparing the poll->head
wait_queue_head_t with the target task signalfd wait queue. Just use
normal task wakeup for this case.
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull epoll fixes from Al Viro:
"Fix reference counting and clean up exit paths"
* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_epoll_ctl(): clean the failure exits up a bit
epoll: Keep a reference on files added to the check list
When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.
However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.
Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9AMgoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsqjEAC3hNnlu7BwVRFMeJzOyxZUqvvtT2ktYTZs
9duV7qezounq022nKihl0/wy0KWOxfg4HDT492la54nKYAPf6xhszrbuAUx9pW8Z
pwcUccaci7nB29V4a4wedtxz+jegCN2LXbRNk4DOpchlVKULfrOIcfW5/rL/7gkp
15n/AAIZNChJ6y9dJDqYRoiF152/6uk7t+BolU/+W9QCKi2PW40nTOgfkzSnBvJV
WaHlYHKAOUaiurIUjZQolgohNNBUzNwWtF/4HSeT5n8c94gSpI3IKFkmNCjxQQ96
I0gjJZIss7N8ysKFBy3WALqx9FqxSWS3pi/G9fai4o/VPEFj+fhfBTh+H1fzLaoM
V+oOHMCt5Cwlw+n8vSgtUU0JF6ZnmoolfpHWPchtCJyQ42i/gt41MrePdu/tUC+n
tV7wvftuM/+AN36vDDgbDc5BTKjCnRQSHz80M3EwUznJJjaeTAPxnQ+pVlpN9IS+
sbywlg+Xake9F19qA/astAH9n3U2+m3HdmoIXfG1vrXKFt/I9d36gh5hzlCh5//5
zAu1/iwy1fAlaI4CWRR14+e8/ozu5SCxlswsI79sGZcFuv+WQsQ84q297rq8v0Wr
HdtmiRDGlBFfcuiEOjoSzSEwMWPc1F+8EcmiEp8SZBglKDM+kQI9XMKKXakqh7K0
yEWGAMm+1g==
=dLiS
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Make sure the head link cancelation includes async work
- Get rid of kiocb_wait_page_queue_init(), makes no sense to have it as
a separate function since you moved it into io_uring itself
- io_import_iovec cleanups (Pavel, me)
- Use system_unbound_wq for ring exit work, to avoid spawning tons of
these if we have tons of rings exiting at the same time
- Fix req->flags overflow flag manipulation (Pavel)
* tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block:
io_uring: kill extra iovec=NULL in import_iovec()
io_uring: comment on kfree(iovec) checks
io_uring: fix racy req->flags modification
io_uring: use system_unbound_wq for ring exit work
io_uring: cleanup io_import_iovec() of pre-mapped request
io_uring: get rid of kiocb_wait_page_queue_init()
io_uring: find and cancel head link async work on files exit
systems, especially when the file system or files which are highly
fragmented. There is a new mount option, prefetch_block_bitmaps which
will pull in the block bitmaps and set up the in-memory buddy bitmaps
when the file system is initially mounted.
Beyond that, a lot of bug fixes and cleanups. In particular, a number
of changes to make ext4 more robust in the face of write errors or
file system corruptions.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl8/Q9YACgkQ8vlZVpUN
gaPz+wgAkiWwpge0pfcukABW9FcHK9R82IPggA/NnFu0I+3trpqVQP8mYWqg+1l7
X0W6B6GHMcITGdwxVDNGHHv0WabXCqFPT0ENwW1cnl9UL6I91Ev2NjmG9HP6hVZa
g3+NyXJwiOP38xsxpPJGPoYFw2wZyv8/e41MMnsE6goYjMmB04sHvXCUQkbN41Fn
3CMdsiueYZDAKflvAlL50Jy7Imz5tq9oy81/z+amqvWo4T0U8zRwQuf25nBAhr25
1WdT4CbCNGO2Qwyu9X+t/KGNVIQhCctkx/yz71l3p2piEGkw/XE4VJNrkmWb0zN7
k9F5uGOZlAlQEzx+5PN//Qtz1Db0QQ==
=E6vv
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Improvements to ext4's block allocator performance for very large file
systems, especially when the file system or files which are highly
fragmented. There is a new mount option, prefetch_block_bitmaps which
will pull in the block bitmaps and set up the in-memory buddy bitmaps
when the file system is initially mounted.
Beyond that, a lot of bug fixes and cleanups. In particular, a number
of changes to make ext4 more robust in the face of write errors or
file system corruptions"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits)
ext4: limit the length of per-inode prealloc list
ext4: reorganize if statement of ext4_mb_release_context()
ext4: add mb_debug logging when there are lost chunks
ext4: Fix comment typo "the the".
jbd2: clean up checksum verification in do_one_pass()
ext4: change to use fallthrough macro
ext4: remove unused parameter of ext4_generic_delete_entry function
mballoc: replace seq_printf with seq_puts
ext4: optimize the implementation of ext4_mb_good_group()
ext4: delete invalid comments near ext4_mb_check_limits()
ext4: fix typos in ext4_mb_regular_allocator() comment
ext4: fix checking of directory entry validity for inline directories
fs: prevent BUG_ON in submit_bh_wbc()
ext4: correctly restore system zone info when remount fails
ext4: handle add_system_zone() failure in ext4_setup_system_zone()
ext4: fold ext4_data_block_valid_rcu() into the caller
ext4: check journal inode extents more carefully
ext4: don't allow overlapping system zones
ext4: handle error of ext4_setup_system_zone() on remount
ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()
...
If an error occurs during the construction of an afs superblock, it's
possible that an error occurs after a superblock is created, but before
we've created the root dentry. If the superblock has a dynamic root
(ie. what's normally mounted on /afs), the afs_kill_super() will call
afs_dynroot_depopulate() to unpin any created dentries - but this will
oops if the root hasn't been created yet.
Fix this by skipping that bit of code if there is no root dentry.
This leads to an oops looking like:
general protection fault, ...
KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f]
...
RIP: 0010:afs_dynroot_depopulate+0x25f/0x529 fs/afs/dynroot.c:385
...
Call Trace:
afs_kill_super+0x13b/0x180 fs/afs/super.c:535
deactivate_locked_super+0x94/0x160 fs/super.c:335
afs_get_tree+0x1124/0x1460 fs/afs/super.c:598
vfs_get_tree+0x89/0x2f0 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x1387/0x2070 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount fs/namespace.c:3390 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
which is oopsing on this line:
inode_lock(root->d_inode);
presumably because sb->s_root was NULL.
Fixes: 0da0b7fd73 ("afs: Display manually added cells in dynamic root mount")
Reported-by: syzbot+c1eff8205244ae7e11a6@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a regression introduced by the patch "migrate from ll_rw_block
usage to BIO".
Bio_alloc() is limited to 256 pages (1 Mbyte). This can cause a failure
when reading 1 Mbyte block filesystems. The problem is a datablock can be
fully (or almost uncompressed), requiring 256 pages, but, because blocks
are not aligned to page boundaries, it may require 257 pages to read.
Bio_kmalloc() can handle 1024 pages, and so use this for the edge
condition.
Fixes: 93e72b3c61 ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: Nicolas Prochazka <nicolas.prochazka@gmail.com>
Reported-by: Tomoatsu Shimada <shimada@walbrix.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Cc: Philippe Liard <pliard@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Daniel Rosenberg <drosen@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200815035637.15319-1-phillip@squashfs.org.uk
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
romfs has a superblock field that limits the size of the filesystem; data
beyond that limit is never accessed.
romfs_dev_read() fetches a caller-supplied number of bytes from the
backing device. It returns 0 on success or an error code on failure;
therefore, its API can't represent short reads, it's all-or-nothing.
However, when romfs_dev_read() detects that the requested operation would
cross the filesystem size limit, it currently silently truncates the
requested number of bytes. This e.g. means that when the content of a
file with size 0x1000 starts one byte before the filesystem size limit,
->readpage() will only fill a single byte of the supplied page while
leaving the rest uninitialized, leaking that uninitialized memory to
userspace.
Fix it by returning an error code instead of truncating the read when the
requested read operation would go beyond the end of the filesystem.
Fixes: da4458bda2 ("NOMMU: Make it possible for RomFS to use MTD devices directly")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200818013202.2246365-1-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
detecting a must cow condition which is not exactly accurate, but saves
unnecessary tree traversal. The incorrect assumption is that if the
extent was created in a generation smaller than the last snapshot
generation, it must be referenced by that snapshot. That is true, except
the snapshot could have since been deleted, without affecting the last
snapshot generation.
The original patch claimed a performance win from this check, but it
also leads to a bug where you are unable to use a swapfile if you ever
snapshotted the subvolume it's in. Make the check slower and more strict
for the swapon case, without modifying the general cow checks as a
compromise. Turning swap on does not seem to be a particularly
performance sensitive operation, so incurring a possibly unnecessary
btrfs_search_slot seems worthwhile for the added usability.
Note: Until the snapshot is competely cleaned after deletion,
check_committed_refs will still cause the logic to think that cow is
necessary, so the user must until 'btrfs subvolu sync' finished before
activating the swapfile swapon.
CC: stable@vger.kernel.org # 5.4+
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
With my new locking code dbench is so much faster that I tripped over a
transaction abort from ENOSPC. This turned out to be because
btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
function sets err on error, and returns err. So instead of properly
marking the inode as needing a full commit, we were returning -ENOSPC
and aborting in __btrfs_unlink_inode. Fix this by checking the proper
variable so that we return the correct thing in the case of ENOSPC.
The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
can return -ENOENT if the dir item isn't in the tree log (which would
happen if we hadn't fsync'ed this guy). We actually handle that case in
__btrfs_unlink_inode, so it's an expected error to get back.
Fixes: 4a500fd178 ("Btrfs: Metadata ENOSPC handling for tree log")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note and comment about ENOENT ]
Signed-off-by: David Sterba <dsterba@suse.com>
The afs_put_operation() function needs to put the reference to the key
that's authenticating the operation.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The error handling in the VL server rotation in the case of there being no
contactable servers is not correct. In such a case, the records of all the
servers in the list are scanned and the errors and abort codes are mapped
and prioritised and one error is chosen. This is then forgotten and the
default error is used (EDESTADDRREQ).
Fix this by using the calculated error.
Also we need to note whether a server responded on one of its endpoints so
that we can priorise an error from an abort message over local and network
errors.
Fixes: 4584ae96ae ("afs: Fix missing net error handling")
Signed-off-by: David Howells <dhowells@redhat.com>
Don't use the running state for VL server probes to make decisions about
which server to use as the state is cleared at the start of a probe and
intermediate values might also be misleading.
Instead, add a separate 'latest known' rtt in the afs_vlserver struct and a
flag to indicate if the server is known to be responding and update these
as and when we know what to change them to.
Fixes: 3bf0fb6f33 ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: David Howells <dhowells@redhat.com>
Convert various bitfields in afs_vlserver::probe to a mask and then expose
this and some other bits of information through /proc/net/afs/<cell>/vlservers
to make it easier to debug VL server communication issues.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix rxrpc_kernel_get_srtt() to indicate the validity of the returned
smoothed RTT. If we haven't had any valid samples yet, the SRTT isn't
useful.
Fixes: c410bf0193 ("rxrpc: Fix the excessive initial retransmission timeout")
Signed-off-by: David Howells <dhowells@redhat.com>
If io_import_iovec() returns an error, return iovec is undefined and
must not be used, so don't set it to NULL when failing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kfree() handles NULL pointers well, but io_{read,write}() checks it
because of performance reasons. Leave a comment there for those who are
tempted to patch it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and
io_cqring_overflow_flush() are racy, because they might be called
asynchronously.
REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can
be guaranteed that requests _currently_ marked inflight can't be
overflown, the problem will be solved with removing the flag
altogether.
That's how the patch works, it removes inflight status of a request
in io_cqring_fill_event() whenever it should be thrown into CQ-overflow
list. That's Ok to do, because no opcode specific handling can be done
after io_cqring_fill_event(), the same assumption as with "struct
io_completion" patches.
And it already have a good place for such cleanups, which is
io_clean_op(). A nice side effect of this is removing this inflight
check from the hot path.
note on synchronisation: now __io_cqring_fill_event() may be taking two
spinlocks simultaneously, completion_lock and inflight_lock. It's fine,
because we never do that in reverse order, and CQ-overflow of inflight
requests shouldn't happen often.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use system_wq, which is unbounded in terms of number of
workers. This means that if we're exiting tons of rings at the same
time, then we'll briefly spawn tons of event kworkers just for a very
short blocking time as the rings exit.
Use system_unbound_wq instead, which has a sane cap on the concurrency
level.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:
1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
and it's about to start writing out dirty extent buffers;
2) Transaction N + 1 already started and another task, task A, just called
btrfs_commit_transaction() on it;
3) Block group B was dirtied (extents allocated from it) by transaction
N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
very beginning of the transaction commit, it starts writeback for the
block group's space cache by calling btrfs_write_out_cache(), which
allocates the pages array for the block group's io_ctl with a call to
io_ctl_init(). Block group A is added to the io_list of transaction
N + 1 by btrfs_start_dirty_block_groups();
4) While transaction N's commit is writing out the extent buffers, it gets
an IO error and aborts transaction N, also setting the file system to
RO mode;
5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
btrfs_commit_transaction() and has set transaction N + 1 state to
TRANS_STATE_COMMIT_START. Immediately after that it checks that the
filesystem was turned to RO mode, due to transaction N's abort, and
jumps to the "cleanup_transaction" label. After that we end up at
btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
That helper finds block group B in the transaction's io_list but it
never releases the pages array of the block group's io_ctl, resulting in
a memory leak.
In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().
We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.
So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().
This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:
unreferenced object 0xffff9bbf009fa600 (size 512):
comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
hex dump (first 32 bytes):
00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff ..|M=...@.|M=...
80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff ..|M=.....|M=...
backtrace:
[<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
[<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
[<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
[<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
[<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
[<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
[<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
[<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
[<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
[<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
[<0000000043ace2c9>] do_syscall_64+0x33/0x80
[<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The build robot reports
compiler: h8300-linux-gcc (GCC) 9.3.0
In file included from fs/btrfs/tests/extent-map-tests.c:8:
>> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
2166 | size_t __const btrfs_get_num_csums(void);
| ^~~~~~~
The function attribute for const does not follow the expected scheme and
in this case is confused with a const type qualifier.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently a user can set mount "-o compress" which will set the
compression algorithm to zlib, and use the default compress level for
zlib (3):
relatime,compress=zlib:3,space_cache
If the user remounts the fs using "-o compress=lzo", then the old
compress_level is used:
relatime,compress=lzo:3,space_cache
But lzo does not expose any tunable compression level. The same happens
if we set any compress argument with different level, also with zstd.
Fix this by resetting the compress_level when compress=lzo is
specified. With the fix applied, lzo is shown without compress level:
relatime,compress=lzo,space_cache
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs' async submit mechanism is able to handle errors in the submission
path and the meta-data async submit function correctly passes the error
code to the caller.
In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're
not handling the errors returned by btrfs_csum_one_bio() correctly though
and simply call BUG_ON(). This is unnecessary as the caller of these two
functions - run_one_async_start - correctly checks for the return values
and sets the status of the async_submit_bio. The actual bio submission
will be handled later on by run_one_async_done only if
async_submit_bio::status is 0, so the data won't be written if we
encountered an error in the checksum process.
Simply return the error from btrfs_csum_one_bio() to the async submitters,
like it's done in btree_submit_bio_start().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.
After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:
Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m2.051s
user 0m0.008s
sys 0m2.026s
Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m0.471s
user 0m0.004s
sys 0m0.395s
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reorganize the if statement of ext4_mb_release_context(), make it
easier to read.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/5439ac6f-db79-ad68-76c1-a4dda9aa0cc3@gmail.com
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Lost chunks are when some other process raced with the current thread
to grab a particular block allocation. Add mb_debug log for
developers who wants to see how often this is happening for a
particular workload.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/0a165ac0-1912-aebd-8a0d-b42e7cd1aea1@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove the unnecessary chksum_err and checksum_seen variables as well as
some redundant code to make the function easier to understand.
[ With changes suggested by jack@ and tytso@ ]
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200819122955.33526-1-luoshijie1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
io_rw_prep_async() goes through a dance of clearing req->io, calling
the iovec import, then re-setting req->io. Provide an internal helper
that does the right thing without needing state tweaked to get there.
This enables further cleanups in io_read, io_write, and
io_resubmit_prep(), but that's left for another time.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ext4_generic_delete_entry function does not use the parameter
handle, so it can be removed.
Signed-off-by: Kyoungho Koo <rnrudgh@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200810080701.GA14160@koo-Z370-HD3
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
seq_puts is a lot cheaper than seq_printf, so use that to print
literal strings.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200810022158.9167-1-vulab@iscas.ac.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It might be better to adjust the code in two places:
1. Determine whether grp is currupt or not should be placed first.
2. (cr<=2 && free <ac->ac_g_ex.fe_len)should may belong to the crx
strategy, and it may be more appropriate to put it in the
subsequent switch statement block. For cr1, cr2, the conditions
in switch potentially realize the above judgment. For cr0, we
should add (free <ac->ac_g_ex.fe_len) judgment, and then delete
(free / fragments) >= ac->ac_g_ex.fe_len), because cr0 returns
true by default.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/e20b2d8f-1154-adb7-3831-a9e11ba842e9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
These comments do not seem to be related to ext4_mb_check_limits(),
it may be invalid.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c49faf0c-d5d5-9c51-6911-9e0ff57c6bfa@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The 5.9 merge moved this function io_uring, which means that we don't
need to retain the generic nature of it. Clean up this part by removing
redundant checks, and just inlining the small remainder in
io_rw_should_retry().
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit f254ac04c8 ("io_uring: enable lookup of links holding inflight files")
only handled 2 out of the three head link cases we have, we also need to
lookup and cancel work that is blocked in io-wq if that work has a link
that's holding a reference to the files structure.
Put the "cancel head links that hold this request pending" logic into
io_attempt_cancel(), which will to through the motions of finding and
canceling head links that hold the current inflight files stable request
pending.
Cc: stable@vger.kernel.org
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an NFSv2/v3 client breaks an NFSv4 client's delegation, it will hit a
NULL dereference in nfsd_breaker_owns_lease().
Easily reproduceable with for example
mount -overs=4.2 server:/export /mnt/
sleep 1h </mnt/file &
mount -overs=3 server:/export /mnt2/
touch /mnt2/file
Reported-by: Robert Dinse <nanook@eskimo.com>
Fixes: 28df3d1539 ("nfsd: clients don't need to break their own delegations")
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208807
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl84aLkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps72D/96/HCiUArhxmRltiwqB9KjemEPaY7nWDkz
0hrUtTCr/MjqFIzgPwx6pIjdiSP4GbmcMCrBO67E+mLbOdO1hKte2ElysRAsTlLE
fGrcdrs3is5+QK8aqJJks74NzM7XG6jfrR8ewV9cz6aJRWbgRjCWvSxx/03Iha2B
t897xuwJg4K30Z83IxPfnu+Xp0dIfmFLUuXQApsZ3bNTuQW3sR4CC/v418+1Wmk4
kXGbQtxcEsrhCy9OxnNyU6GPEJ3b99ANPbRE8OUNQwaHiejvMOCWpcmaoT6TwKeU
aku+XcVyoMjxJQk2k0uzr8Ecj5G1FJv4fUHhTZBxcGqkrxhkTjQ520HtrqPlc7uV
BjyXutZ8yjmeCbvGrsXu8f8ktjHHkntkRDA8hgzW1OpmWYuWKF/2OIjNpmcmtvbj
XqwBDEKdQW9X4dHoQKsVExtzeT6nNP0dxaeZX8OeB2GGkitP7rCm8k/SOuDPTCLi
MX/qWpERo4hRfCLjY+4nezxkFMLIF7Jej3tzwuVRshFYsRVQzTPQpbnkmkuwibhi
ObEwVI+lLkbatnR2wmJwoVKcywH13U68VNJXACyw1GZnPlp2lYylT/MV80y/iELE
mj4zDklqwfIrnoHEuCkERwgXrYsffhUrFvajmAMnJncUFOI4khYJ+dWBVVlBkyn0
e7UK1Sd7Jw==
=T+fx
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few differerent things in here.
Seems like syzbot got some more io_uring bits wired up, and we got a
handful of reports and the associated fixes are in here.
General fixes too, and a lot of them marked for stable.
Lastly, a bit of fallout from the async buffered reads, where we now
more easily trigger short reads. Some applications don't really like
that, so the io_read() code now handles short reads internally, and
got a cleanup along the way so that it's now easier to read (and
documented). We're now passing tests that failed before"
* tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
io_uring: short circuit -EAGAIN for blocking read attempt
io_uring: sanitize double poll handling
io_uring: internally retry short reads
io_uring: retain iov_iter state over io_read/io_write calls
task_work: only grab task signal lock when needed
io_uring: enable lookup of links holding inflight files
io_uring: fail poll arm on queue proc failure
io_uring: hold 'ctx' reference around task_work queue + execute
fs: RWF_NOWAIT should imply IOCB_NOIO
io_uring: defer file table grabbing request cleanup for locked requests
io_uring: add missing REQ_F_COMP_LOCKED for nested requests
io_uring: fix recursive completion locking on oveflow flush
io_uring: use TWA_SIGNAL for task_work uncondtionally
io_uring: account locked memory before potential error case
io_uring: set ctx sq/cq entry count earlier
io_uring: Fix NULL pointer dereference in loop_rw_iter()
io_uring: add comments on how the async buffered read retry works
io_uring: io_async_buf_func() need not test page bit
One case was missed in the short IO retry handling, and that's hitting
-EAGAIN on a blocking attempt read (eg from io-wq context). This is a
problem on sockets that are marked as non-blocking when created, they
don't carry any REQ_F_NOWAIT information to help us terminate them
instead of perpetually retrying.
Fixes: 227c0c9673 ("io_uring: internally retry short reads")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a bit of confusion on the matching pairs of poll vs double poll,
depending on if the request is a pure poll (IORING_OP_POLL_ADD) or
poll driven retry.
Add io_poll_get_double() that returns the double poll waitqueue, if any,
and io_poll_get_single() that returns the original poll waitqueue. With
that, remove the argument to io_poll_remove_double().
Finally ensure that wait->private is cleared once the double poll handler
has run, so that remove knows it's already been seen.
Cc: stable@vger.kernel.org # v5.8
Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com
Fixes: 18bceab101 ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- some code cleanup
- a couple of static analysis fixes
- setattr: try to pick a fid associated with the file rather than the
dentry, which might sometimes matter
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl83dvkACgkQq06b7GqY
5nDk4g/+KNpbIFHhDyhCzmZ1WbaUYlZseFm8VBR9dzwOkXCuVl+WQd9ducYlEifT
cZ29xmEuom01LhFOs6VzC+IODQq3h1WvEbVRfCSlLEWvfHyPODSDY8mrn+1TdOZL
UOOP1wUjaDqmAcBQ4PSDNmj2cONNWPqkgnuyswjsBpabOY/U8FxG9J/QiPfC12zs
QoF5800OiuLYS5SefI9kKJ47DkTLpm/aq7/Z1liAxJBzHrguORjg16cssXhWeysD
jEMD3pIrsqMDX/vNkAO0DaR8FWvxTFZR1Bza1ylJiUy3K3rLXKfo+42jiCnuo5e9
uhA012RfiCZB3kIsLe5hCPXUWz0MhizaJAPkceHsgUtkuiU2qXyY05HEvS2HHl0s
inqR8gRkZ2NhROx1R/0q+Wu2posCBderDwg7Mkzt/7j5Ue3cTIKvbeBxH9iLLaK+
rHp0NbfP6wYN5qKjH1rru1IOmaUCV2HHV6x/alnwaVBfZqg5HYKkzlS7M+5hAbtb
niLFXXCtPBGqS+u9w7sruj6utuiJjXBfB1ejaawY+frS41+v3NkHjjUWCH67u8I9
u3pmPpvm5o7gRoRmvmxEdSb+jAYZwLTcbkUhRMHLQOHCBZXtcul41p2BUpoBm0zw
42mlMMMBAjaxtzLlryvACg5lx6x7ykE4pExKAc0dzrzWfgzwEZQ=
=hwGq
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.9-rc1' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- some code cleanup
- a couple of static analysis fixes
- setattr: try to pick a fid associated with the file rather than the
dentry, which might sometimes matter
* tag '9p-for-5.9-rc1' of git://github.com/martinetd/linux:
9p: Remove unneeded cast from memory allocation
9p: remove unused code in 9p
net/9p: Fix sparse endian warning in trans_fd.c
9p: Fix memory leak in v9fs_mount
9p: retrieve fid from file when file instance exist.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl83ch8ACgkQiiy9cAdy
T1ExdAwAqHLP2o4lADcHsACPWxswMwuSwMpNQhSOV4bnRaVGkh3EGbvFZ8dmop1u
XrEyppM8TlbMUzIW6OO/Rntbl91Ll4Mi7d+15vdBgwOSKS2dJ+tkbV1iWQoJffUv
cxfPSLlnAe8dFaitRky3HWcYi4+vwY/TksaaQpdbVFVGJpvdKG8zu/wYmJEZ57TD
0Hjy4BhbGH1iu6X7mutslRCJlx1kOw5jyvqsECXqhF80+tKze4QlC5dZ6hjMPb+6
C9sp03/KMe84/1GNDbsTxc0Vjd0K0+1E+lgDl2N7XVC6HNTg47Smzm7X+SgOc7J9
rF2keAadbvD+prNx52nojI27y4JW99mGwuSpjZBIFJrJ6hq2ukbH7DBOQXSJ7Ovt
+0OgreUDlZS2FuLaVAIQak5lQbQ34xd/zIrQ1bQHefkmGuAE+2Dm86GIP4/WUSlY
RTNzvUrCq0EnVQncMlG0pPiZg6Yiz3ie3nSI9MohYl+niqhpsd3M/uFh68CNnSgV
aZovWPp4
=FHq0
-----END PGP SIGNATURE-----
Merge tag '5.9-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small cifs/smb3 fixes, one for stable fixing mkdir path with
the 'idsfromsid' mount option"
* tag '5.9-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
SMB3: Fix mkdir when idsfromsid configured on mount
cifs: Convert to use the fallthrough macro
cifs: Fix an error pointer dereference in cifs_mount()
Highlights include:
Stable fixes:
- pNFS: Don't return layout segments that are being used for I/O
- pNFS: Don't move layout segments off the active list when being used for I/O
Features:
- NFS: Add support for user xattrs through the NFSv4.2 protocol
- NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
- NFSv4.0 allow nconnect for v4.0
Bugfixes and cleanups:
- nfs: ensure correct writeback errors are returned on close()
- nfs: nfs_file_write() should check for writeback errors
- nfs: Fix getxattr kernel panic and memory overflow
- NFS: Fix the pNFS/flexfiles mirrored read failover code
- SUNRPC: dont update timeout value on connection reset
- freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
- sunrpc: destroy rpc_inode_cachep after unregister_filesystem
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl83CYEACgkQZwvnipYK
APLx4g/9EQNTG5VkUToElRHs4b3vFxbmh9Odnk/JwPHxY5GQ8/AyGqwWHBgMZc7/
2AV/C83pk7pJsDNsKbVAaFCT1cwjmItHM63vKJYGBYbE8LhkZ/1sYkdtPBYwHoVl
7CWfpVY/4NjYw5GJfrVA5Y0m7lrQInRtIMzfaENw2tpw+/cKUpadxgEJltzFNvpa
Ploinr1ZRBl1tvfeHNRP5ZMPk2AfgGWtQKQ/b2UWUk5tXALoQm2Eu+/oku39uqhy
hW5tCbU2BzR91gg5JwF9n7VowkCHXfe7lMzDBTVfwZOELOmoyys/1wKv550FWcWi
yymljWiPGZOnXGT1vfKptPESQjdtElMfanvEZ0BzS+yNR0HZGnIupaxGlYlG9ZGU
2sXHQPp3mk2Q+L1IgbTSCnSju0YlZo32JQpYCZiROjIXnPWPQ50YNhr8GL18M1FW
hTeShg2avWH+59GB6moEBmsuvui7Dy1jkimblToLEoGJ4kbvEl72FYSqTCkAXXbB
rVUzhmJFgfk/EOS4d+QKJoBqNzw3aw79wyT7PLkoCYBqPZBHQexlmI+ktMbgUEdw
c/fM7l5/Vcb9weIHKzul2Jbk5q6bFME/xPnIkr3v/oKIFlLFwQ04BX6R/42AMJHw
V5Q9Wp81Vy6RXQMn8P0ZMeY0WQC/rhpijOEVMUC+Ni+spz44AdM=
=k4pE
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Stable fixes:
- pNFS: Don't return layout segments that are being used for I/O
- pNFS: Don't move layout segments off the active list when being used for I/O
Features:
- NFS: Add support for user xattrs through the NFSv4.2 protocol
- NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
- NFSv4.0 allow nconnect for v4.0
Bugfixes and cleanups:
- nfs: ensure correct writeback errors are returned on close()
- nfs: nfs_file_write() should check for writeback errors
- nfs: Fix getxattr kernel panic and memory overflow
- NFS: Fix the pNFS/flexfiles mirrored read failover code
- SUNRPC: dont update timeout value on connection reset
- freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
- sunrpc: destroy rpc_inode_cachep after unregister_filesystem"
* tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (32 commits)
NFS: Fix flexfiles read failover
fs: nfs: delete repeated words in comments
rpc_pipefs: convert comma to semicolon
nfs: Fix getxattr kernel panic and memory overflow
NFS: Don't return layout segments that are in use
NFS: Don't move layouts to plh_return_segs list while in use
NFS: Add layout segment info to pnfs read/write/commit tracepoints
NFS: Add tracepoints for layouterror and layoutstats.
NFS: Report the stateid + status in trace_nfs4_layoutreturn_on_close()
SUNRPC dont update timeout value on connection reset
nfs: nfs_file_write() should check for writeback errors
nfs: ensure correct writeback errors are returned on close()
NFSv4.2: xattr cache: get rid of cache discard work queue
NFS: remove redundant initialization of variable result
NFSv4.0 allow nconnect for v4.0
freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
sunrpc: destroy rpc_inode_cachep after unregister_filesystem
NFSv4.2: add client side xattr caching.
NFSv4.2: hook in the user extended attribute handlers
NFSv4.2: add the extended attribute proc functions.
...
Drop duplicated words {the, at} in comments.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ian Kent <raven@themaw.net>
Link: http://lkml.kernel.org/r/20200811021817.24982-1-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Fix S_ISDIR execve() errno".
Fix an errno change for execve() of directories, noticed by Marc Zyngier.
Along with the fix, include a regression test to avoid seeing this return
in the future.
This patch (of 2):
The return code for attempting to execute a directory has always been
EACCES. Adjust the S_ISDIR exec test to reflect the old errno instead of
the general EISDIR for other kinds of "open" attempts on directories.
Fixes: 633fb6ac39 ("exec: move S_ISREG() check earlier")
Reported-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Greg Kroah-Hartman <gregkh@android.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@google.com>
Link: http://lkml.kernel.org/r/20200813231723.2725102-2-keescook@chromium.org
Link: https://lore.kernel.org/lkml/20200813151305.6191993b@why
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mkdir uses a compounded create operation which was not setting
the security descriptor on create of a directory. Fix so
mkdir now sets the mode and owner info properly when idsfromsid
and modefromsid are configured on the mount.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> # v5.8
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
We've had a few application cases of not handling short reads properly,
and it is understandable as short reads aren't really expected if the
application isn't doing non-blocking IO.
Now that we retain the iov_iter over retries, we can implement internal
retry pretty trivially. This ensures that we don't return a short read,
even for buffered reads on page cache conflicts.
Cleanup the deep nesting and hard to read nature of io_read() as well,
it's much more straight forward now to read and understand. Added a
few comments explaining the logic as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of maintaining (and setting/remembering) iov_iter size and
segment counts, just put the iov_iter in the async part of the IO
structure.
This is mostly a preparation patch for doing appropriate internal retries
for short reads, but it also cleans up the state handling nicely and
simplifies it quite a bit.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl81Q0wACgkQxWXV+ddt
WDtbqw/+NeFlvQzsCeQV9PX7RjYf9MFEQIThxo33xDl+ersgcOD8MuPa/hY1hoO0
gOn2eRPcVe/RIPBezRbxX9bnqlfW6N0VnBNLJHypMapB2hR6WFcFt7CAMoXKRmHV
RDM37pA2TNULr8XYrJ0+J5Vy1NWp5HdKzEV6bXfsOSzMSdAVMheXNec93suLEB/g
9QGXX6kaaq0Hcpy7tQQBtm2lbVj8/M3LOUAmYOB/JNCPtsJEB/2EO2b63TB4s2cW
0lpiPehW2m/Pv5GjqQM+iN5fbt9yhKB6lqEEgoHZPgI2tLFyh5WlTWKET7uxqj7G
YBzZjiq1WREEl9KWLYZuthcXPLX2XgJ4gLSlckygi1e4MpPlJ4pa30Bj9OyIEIjP
FOeR0lelRYcjmZrQW4Kana0qq8K0JJzvo2dSqaJBGF9CaveN3BAGQ9ttNhgIIpS5
4kBKlv2SCJ9Anhn8la6bFwlfuR2ggMhDShxIGBQpA1OKf0oJyi2dtavSIbuXwFbd
6KA37cyp4cDK9ycmTN5YxZSndzZSqUEh5Wt4gLk32NeIxhyCX4aTvjQj5KqM1MNw
N/WrTJQ27D6jfi+PBRBmT7U6qEujySXUimJRFTJzk+Px8Q/QMzGAFPCqz6iXv3u3
lX1Ywha9iQ0g2IZVoaq1ZjDDp4xOqIakAjaXez3dFhu3Mq3Kc70=
=7b8U
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"One minor update, the rest are fixes that have arrived a bit late for
the first batch. There are also some recent fixes for bugs that were
discovered during the merge window and pop up during testing.
User visible change:
- show correct subvolume path in /proc/mounts for bind mounts
Fixes:
- fix compression messages when remounting with different level or
compression algorithm
- tree-log: fix some memory leaks on error handling paths
- restore I_VERSION on remount
- fix return values and error code mixups
- fix umount crash with quotas enabled when removing sysfs files
- fix trim range on a shrunk device"
* tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: trim: fix underflow in trim length to prevent access beyond device boundary
btrfs: fix return value mixup in btrfs_get_extent
btrfs: sysfs: fix NULL pointer dereference at btrfs_sysfs_del_qgroups()
btrfs: check correct variable after allocation in btrfs_backref_iter_alloc
btrfs: make sure SB_I_VERSION doesn't get unset by remount
btrfs: fix memory leaks after failure to lookup checksums during inode logging
btrfs: don't show full path of bind mounts in subvol=
btrfs: fix messages after changing compression level by remount
btrfs: only search for left_info if there is no right_info in try_merge_free_space
btrfs: inode: fix NULL pointer dereference if inode doesn't need compression
- Fix duplicated words in comments.
- Fix an ubsan complaint about null pointer arithmetic.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl8toskACgkQ+H93GTRK
tOuu5A/9FBW8sYn9I4op5WXJy2q+R1JBa6P80Y2Aoz6jfE0b2yW1TZeQ4n1GElhY
WsAFAEq7pUFH0p1YbYDNXCcENdzI+taWA7G40JFJdO8MeM+WNcctu860kjYU+RM+
Ue3nPM/Oh1qaBEkdThxwyZ9mBBIdVKMIwGJ5LutPxx8mSwIPS3ibwAN83uKeu4NE
IOjJ6hm50ox1XmowkD64TYrSf1UlxBx8Yw21l/gdzvxkmSRwKVQLY/gEsXgvdjV7
wsQBSNN0ihYDXJ4aDlMGwvGIOgit3ig2oqERHYK6J2QKDS3YrlmFmBLKHwb+7ZOd
i+P/P1Op8OHFCBVk1S6QhhBEBEqFggsgzybAo0v6FIZXCEMgTgg1m0DfFlp6LK6T
6D82LENEQpEq+rKFNaP2r5c6PbwheWSMhHjWCBY6ciJsuZ3O8/tYV4FSR7+1YM2c
wUupGaYfYdMUe0AL1xvItCX+wmL8HBAoq7QHw0VSRqrL2uZovBpSoi7vP4c5iUZK
ZyzQc4ZL1Yt+1g8NT06dd+xJfLe4M/sbYOzQmoYj+qT68qd9nZSH2t055G/gI+kq
DhZXhrfY8NFZWtuCLZ+wf+1tysgeBxyD6S9fh69GtlBW1ZKL+4dnP2XrXUhe/6ES
L2SyXvmjeYhdE4EBBm0zi4V4J6eGSJYblA0/cVHQy9j7SZwI79w=
=GBnh
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Two small fixes that have come in during the past week:
- Fix duplicated words in comments
- Fix an ubsan complaint about null pointer arithmetic"
* tag 'xfs-5.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init
xfs: delete duplicated words + other fixes
- Don't clear MediaFailure and VolumeDirty bit in volume flags
if these were already set before mounting.
- Write multiple dirty buffers at once in sync mode.
- Remove unneeded EXFAT_SB_DIRTY bit set.
-----BEGIN PGP SIGNATURE-----
iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl8zgbYYHG5hbWphZS5q
ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEI8TkP/ROoQRwZSFM9NX0jtI5h6Mgs
MX7oLtMATKRewQkHtmJ4ZeFrDZhwrzZsvXifrxlt4XEQ/c2Jnl9x08lVCdc3NXaB
DM1hRwdLomztzBsLhmTSFvBczOMdY8+FVP6tOVgzCFKIs86pGBj2dekM9oYDw4jW
hDu5yfFtTB81CELuKZzSOwes3IRAtoK7RXGHHw/IXBzTdNT6LuShWgPU5cey99fB
ce9weKK8JrY7529h5wfTwfJN7Nh/G5+xtmbzjKSrkmb1XPy82KKYR8/SDIG03VNG
v1JL8VmI4rHUh8AmvwClt50GmeLMbrkko753xOuKOtcUorEYopXXbR2S7ApVzXPv
e6bpuWeV3kDPFhYsgtJQFgML9oZ+AiahHv2WTx0h5TX2cv0DuG4zBI3HhM/yRr0D
HDE3FO7qu9TT5OLQoLObIOwYnOu3uPtyHcas9UVM7ypyYlOTqWN3eXonBCYgoc7B
EicfRy/ayAmbxdq/DmMpzZ4ql/cxljjs/JGQdoURIDGNJvae68Y/SSnt4LIaV1Xu
sFS2b5shkASBMZTQe1FEr9FA07TXWlCMxcpcpLa6+iwIsQ2xWbKqrg/MwU7HiRsd
pNE87lS1JknF+LgXRt5+99fPFwnX0Oxh9Nedy8yvHl/Ozntzcsismj4ghXkDn44w
98ZulbX/WwTUqh2VdnpV
=85Zl
-----END PGP SIGNATURE-----
Merge tag 'exfat-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- don't clear MediaFailure and VolumeDirty bit in volume flags if these
were already set before mounting
- write multiple dirty buffers at once in sync mode
- remove unneeded EXFAT_SB_DIRTY bit set
* tag 'exfat-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: retain 'VolumeFlags' properly
exfat: optimize exfat_zeroed_cluster()
exfat: add error check when updating dir-entries
exfat: write multiple sectors at once
exfat: remove EXFAT_SB_DIRTY flag
When a process exits, we cancel whatever requests it has pending that
are referencing the file table. However, if a link is holding a
reference, then we cannot find it by simply looking at the inflight
list.
Enable checking of the poll and timeout list to find the link, and
cancel it appropriately.
Cc: stable@vger.kernel.org
Reported-by: Josef <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
sent to all available MDSes once per second now. Other than that, we
have a lot of fixes and cleanups all around the filesystem, including
a tweak to cut down on MDS request resends in multi-MDS setups from
Yanhu and fixups for SELinux symlink labeling and MClientSession
message decoding from Jeff.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl80I3MTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi7rlB/4/tegnfsB+O6MQPt+DkSxjQ6+90J/O
aDj3jnwHoUBUitTP92wHdy/TanlZMX0UQqq4caWt1Ot0Rveb6XTfTIzEDtG2ejYQ
dcH6LB3dS9Qvjrwh7oRawx8Xux/RdIIA99OvAX7K+4IeLKP5PjmwfdM1Mm0agfoS
V8qXm8cGyqF2lWown092Znr8s6OSsDEaUbjB2tlz5msWeK9rSFD5bDYOBdOTkBCU
5v5BIUwWWoU9YFMWDJgVNS5l2Scq5yTDEleZcCCJMovKNrhBi9DcwUDKrDQJ3+5v
UmfI5M/L99Ng7yAcPqbeOZ1HibuOrdmqAoR6lCiD+PPEPrf99rVLRinQ
=2NtI
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"Xiubo has completed his work on filesystem client metrics, they are
sent to all available MDSes once per second now.
Other than that, we have a lot of fixes and cleanups all around the
filesystem, including a tweak to cut down on MDS request resends in
multi-MDS setups from Yanhu and fixups for SELinux symlink labeling
and MClientSession message decoding from Jeff"
* tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-client: (22 commits)
ceph: handle zero-length feature mask in session messages
ceph: use frag's MDS in either mode
ceph: move sb->wb_pagevec_pool to be a global mempool
ceph: set sec_context xattr on symlink creation
ceph: remove redundant initialization of variable mds
ceph: fix use-after-free for fsc->mdsc
ceph: remove unused variables in ceph_mdsmap_decode()
ceph: delete repeated words in fs/ceph/
ceph: send client provided metric flags in client metadata
ceph: periodically send perf metrics to MDSes
ceph: check the sesion state and return false in case it is closed
libceph: replace HTTP links with HTTPS ones
ceph: remove unnecessary cast in kfree()
libceph: just have osd_req_op_init() return a pointer
ceph: do not access the kiocb after aio requests
ceph: clean up and optimize ceph_check_delayed_caps()
ceph: fix potential mdsc use-after-free crash
ceph: switch to WARN_ON_ONCE in encode_supported_features()
ceph: add global total_caps to count the mdsc's total caps number
ceph: add check_session_state() helper and make it global
...
Merge more updates from Andrew Morton:
- most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
memory-hotplug, cleanups, uaccess, migration, gup, pagemap),
- various other subsystems (alpha, misc, sparse, bitmap, lib, bitops,
checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump,
exec, kdump, rapidio, panic, kcov, kgdb, ipc).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
mm/gup: remove task_struct pointer for all gup code
mm: clean up the last pieces of page fault accountings
mm/xtensa: use general page fault accounting
mm/x86: use general page fault accounting
mm/sparc64: use general page fault accounting
mm/sparc32: use general page fault accounting
mm/sh: use general page fault accounting
mm/s390: use general page fault accounting
mm/riscv: use general page fault accounting
mm/powerpc: use general page fault accounting
mm/parisc: use general page fault accounting
mm/openrisc: use general page fault accounting
mm/nios2: use general page fault accounting
mm/nds32: use general page fault accounting
mm/mips: use general page fault accounting
mm/microblaze: use general page fault accounting
mm/m68k: use general page fault accounting
mm/ia64: use general page fault accounting
mm/hexagon: use general page fault accounting
mm/csky: use general page fault accounting
...
After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more. Remove that parameter in the whole gup
stack.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The path_noexec() check, like the regular file check, was happening too
late, letting LSMs see impossible execve()s. Check it earlier as well in
may_open() and collect the redundant fs/exec.c path_noexec() test under
the same robustness comment as the S_ISREG() check.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs path_noexec() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
security_file_open(f)
open()
/* old location of path_noexec() test */
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Relocate execve() sanity checks", v2.
While looking at the code paths for the proposed O_MAYEXEC flag, I saw
some things that looked like they should be fixed up.
exec: Change uselib(2) IS_SREG() failure to EACCES
This just regularizes the return code on uselib(2).
exec: Move S_ISREG() check earlier
This moves the S_ISREG() check even earlier than it was already.
exec: Move path_noexec() check earlier
This adds the path_noexec() check to the same place as the
S_ISREG() check.
This patch (of 3):
Change uselib(2)' S_ISREG() error return to EACCES instead of EINVAL so
the behavior matches execve(2), and the seemingly documented value. The
"not a regular file" failure mode of execve(2) is explicitly
documented[1], but it is not mentioned in uselib(2)[2] which does,
however, say that open(2) and mmap(2) errors may apply. The documentation
for open(2) does not include a "not a regular file" error[3], but mmap(2)
does[4], and it is EACCES.
[1] http://man7.org/linux/man-pages/man2/execve.2.html#ERRORS
[2] http://man7.org/linux/man-pages/man2/uselib.2.html#ERRORS
[3] http://man7.org/linux/man-pages/man2/open.2.html#ERRORS
[4] http://man7.org/linux/man-pages/man2/mmap.2.html#ERRORS
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-1-keescook@chromium.org
Link: http://lkml.kernel.org/r/20200605160013.3954297-2-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The document reads "%e" should be "executable filename" while actually it
could be changed by things like pr_ctl PR_SET_NAME. People who uses "%e"
in core_pattern get surprised when they find out they get thread name
instead of executable filename.
This is either a bug of document or a bug of code. Since the behavior of
"%e" is there for long time, it could bring another surprise for users if
we "fix" the code.
So we just "fix" the document. And more, for users who really need the
"executable filename" in core_pattern, we introduce a new "%f" for the
real executable filename. We already have "%E" for executable path in
kernel, so just reuse most of its code for the new added "%f" format.
Signed-off-by: Lepton Wu <ytht.net@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200701031432.2978761-1-ytht.net@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel signalfd4() syscall returns different error codes when called
either in compat or native mode. This behaviour makes correct emulation
in qemu and testing programs like LTP more complicated.
Fix the code to always return -in both modes- EFAULT for unaccessible user
memory, and EINVAL when called with an invalid signal mask.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Laurent Vivier <laurent@vivier.eu>
Link: http://lkml.kernel.org/r/20200530100707.GA10159@ls3530.fritz.box
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If data clusters == 0, fat_ra_init() calls the ->ent_blocknr() for the
cluster beyond ->max_clusters.
This checks the limit before initialization to suppress the warning.
Reported-by: syzbot+756199124937b31a9b7e@syzkaller.appspotmail.com
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/87mu462sv4.fsf@mail.parknet.co.jp
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `xmlns`:
For each link, `http://[^# ]*(?:\w|/)`:
If neither `gnu\.org/license`, nor `mozilla\.org/MPL`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Link: http://lkml.kernel.org/r/20200708200409.22293-1-grandmaster@al2klimov.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to hold write_lock in fat_ioctl_get_attributes.
write_lock may make an impact on concurrency of fat_ioctl_get_attributes.
Signed-off-by: Yubo Feng <fengyubo3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Link: http://lkml.kernel.org/r/1593308053-12702-1-git-send-email-fengyubo3@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 64 bit ino is being compared to the product of two u32 values,
however, the multiplication is being performed using a 32 bit multiply so
there is a potential of an overflow. To be fully safe, cast uspi->s_ncg
to a u64 to ensure a 64 bit multiplication occurs to avoid any chance of
overflow.
Fixes: f3e2a520f5 ("ufs: NFS support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Link: http://lkml.kernel.org/r/20200715170355.1081713-1-colin.king@canonical.com
Addresses-Coverity: ("Unintentional integer overflow")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add macros for nilfs_<level>(sb, fmt, ...) and convert the uses of
'nilfs_msg(sb, KERN_<LEVEL>, ...)' to 'nilfs_<level>(sb, ...)' so nilfs2
uses a logging style more like the typical kernel logging style.
Miscellanea:
o Realign arguments for these uses
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1595860111-3920-4-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reduce object size a bit by removing the KERN_<LEVEL> as a separate
argument and adding it to the format string.
Reduce overall object size by about ~.5% (x86-64 defconfig w/ nilfs2)
old:
$ size -t fs/nilfs2/built-in.a | tail -1
191738 8676 44 200458 30f0a (TOTALS)
new:
$ size -t fs/nilfs2/built-in.a | tail -1
190971 8676 44 199691 30c0b (TOTALS)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1595860111-3920-3-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "nilfs2 updates".
This patch (of 3):
unlock_new_inode() is only meant to be called after a new inode has
already been inserted into the hash table. But nilfs_new_inode() can call
it even before it has inserted the inode, triggering the WARNING in
unlock_new_inode(). Fix this by only calling unlock_new_inode() if the
inode has the I_NEW flag set, indicating that it's in the table.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1595860111-3920-1-git-send-email-konishi.ryusuke@gmail.com
Link: http://lkml.kernel.org/r/1595860111-3920-2-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When truncating a file to a size within the last allowed logical block,
block_to_path() is called with the *next* block. This exceeds the limit,
causing the "block %ld too big" error message to be printed.
This case isn't actually an error; there are just no more blocks past that
point. So, remove this error message.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qiujun Huang <anenbupt@gmail.com>
Link: http://lkml.kernel.org/r/20200628060846.682158-7-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The minix filesystem reads its maximum file size from its on-disk
superblock. This value isn't necessarily a multiple of the block size.
When it's not, the V1 block mapping code doesn't allow mapping the last
possible block. Commit 6ed6a722f9 ("minixfs: fix block limit check")
fixed this in the V2 mapping code. Fix it in the V1 mapping code too.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qiujun Huang <anenbupt@gmail.com>
Link: http://lkml.kernel.org/r/20200628060846.682158-6-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The minix filesystem leaves super_block::s_maxbytes at MAX_NON_LFS rather
than setting it to the actual filesystem-specific limit. This is broken
because it means userspace doesn't see the standard behavior like getting
EFBIG and SIGXFSZ when exceeding the maximum file size.
Fix this by setting s_maxbytes correctly.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qiujun Huang <anenbupt@gmail.com>
Link: http://lkml.kernel.org/r/20200628060846.682158-5-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the minix filesystem tries to map a very large logical block number to
its on-disk location, block_to_path() can return offsets that are too
large, causing out-of-bounds memory accesses when accessing indirect index
blocks. This should be prevented by the check against the maximum file
size, but this doesn't work because the maximum file size is read directly
from the on-disk superblock and isn't validated itself.
Fix this by validating the maximum file size at mount time.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+c7d9ec7a1a7272dd71b3@syzkaller.appspotmail.com
Reported-by: syzbot+3b7b03a0c28948054fb5@syzkaller.appspotmail.com
Reported-by: syzbot+6e056ee473568865f3e6@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qiujun Huang <anenbupt@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200628060846.682158-4-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fs/minix: fix syzbot bugs and set s_maxbytes".
This series fixes all syzbot bugs in the minix filesystem:
KASAN: null-ptr-deref Write in get_block
KASAN: use-after-free Write in get_block
KASAN: use-after-free Read in get_block
WARNING in inc_nlink
KMSAN: uninit-value in get_block
WARNING in drop_nlink
It also fixes the minix filesystem to set s_maxbytes correctly, so that
userspace sees the correct behavior when exceeding the max file size.
This patch (of 6):
sb_getblk() can fail, so check its return value.
This fixes a NULL pointer dereference.
Originally from Qiujun Huang.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+4a88b2b9dc280f47baf4@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qiujun Huang <anenbupt@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200628060846.682158-1-ebiggers@kernel.org
Link: http://lkml.kernel.org/r/20200628060846.682158-2-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both exec and exit want to ensure that the uaccess routines actually do
access user pointers. Use the newly added force_uaccess_begin helper
instead of an open coded set_fs for that to prepare for kernel builds
where set_fs() does not exist.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot found issues with having hugetlbfs on a union/overlay as reported
in [1]. Due to the limitations (no write) and special functionality of
hugetlbfs, it does not work well in filesystem stacking. There are no
know use cases for hugetlbfs stacking. Rather than making modifications
to get hugetlbfs working in such environments, simply prevent stacking.
[1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/
Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Colin Walters <walters@verbum.org>
Link: http://lkml.kernel.org/r/80f869aa-810d-ef6c-8888-b46cee135907@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.
Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.
The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.
To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.
[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The keys in smaps output are padded to fixed width with spaces. All
except for THPeligible that uses tabs (only since commit c06306696f
("mm: thp: fix false negative of shmem vma's THP eligibility")).
Unify the output formatting to save time debugging some naïve parsers.
(Part of the unification is also aligning FilePmdMapped with others.)
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200728083207.17531-1-mkoutny@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot reported and bisected a use-after-free due to the recent init
cleanups.
The putname() should happen only after we'd *not* branched to retry,
same as it's done in do_unlinkat().
Reported-by: syzbot+bbeb1c88016c7db4aa24@syzkaller.appspotmail.com
Fixes: e24ab0ef68 "fs: push the getname from do_rmdir into the callers"
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current mirrored read failover code is correctly resetting the mirror
index between failed reads, however it is not able to actually flip the
RPC call over to the next RPC client.
The end result is that we keep resending the RPC call to the same client
over and over.
The fix is to use the pnfs_read_resend_pnfs() mechanism to schedule a
new RPC call, but we need to add the ability to pass in a mirror
index so that we always retry the next mirror in the list.
Fixes: 166bd5b889 ("pNFS/flexfiles: Fix layoutstats handling during read failovers")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Check the ipt.error value, it must have been either cleared to zero or
set to another error than the default -EINVAL if we don't go through the
waitqueue proc addition. Just give up on poll at that point and return
failure, this will fallback to async work.
io_poll_add() doesn't suffer from this failure case, as it returns the
error value directly.
Cc: stable@vger.kernel.org # v5.7+
Reported-by: syzbot+a730016dc0bdce4f6ff5@syzkaller.appspotmail.com
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop duplicated words {the, and} in comments.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Replace a comma between expression statements by a semicolon.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Move the buffer size check to decode_attr_security_label() before memcpy()
Only call memcpy() if the buffer is large enough
Fixes: aa9c266962 ("NFS: Client implementation of Labeled-NFS")
Signed-off-by: Jeffrey Mitchell <jeffrey.mitchell@starlab.io>
[Trond: clean up duplicate test of label->len != 0]
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the NFS_LAYOUT_RETURN_REQUESTED flag is set, we want to return the
layout as soon as possible, meaning that the affected layout segments
should be marked as invalid, and should no longer be in use for I/O.
Fixes: f0b429819b ("pNFS: Ignore non-recalled layouts in pnfs_layout_need_return()")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the layout segment is still in use for a read or a write, we should
not move it to the layout plh_return_segs list. If we do, we can end
up returning the layout while I/O is still in progress.
Fixes: e0b7d420f7 ("pNFS: Don't discard layout segments that are marked for return")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[BUG]
The following script can lead to tons of beyond device boundary access:
mkfs.btrfs -f $dev -b 10G
mount $dev $mnt
trimfs $mnt
btrfs filesystem resize 1:-1G $mnt
trimfs $mnt
[CAUSE]
Since commit 929be17a9b ("btrfs: Switch btrfs_trim_free_extents to
find_first_clear_extent_bit"), we try to avoid trimming ranges that's
already trimmed.
So we check device->alloc_state by finding the first range which doesn't
have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
But if we shrunk the device, that bits are not cleared, thus we could
easily got a range starts beyond the shrunk device size.
This results the returned @start and @end are all beyond device size,
then we call "end = min(end, device->total_bytes -1);" making @end
smaller than device size.
Then finally we goes "len = end - start + 1", totally underflow the
result, and lead to the beyond-device-boundary access.
[FIX]
This patch will fix the problem in two ways:
- Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
This is the root fix
- Add extra safety check when trimming free device extents
We check and warn if the returned range is already beyond current
device.
Link: https://github.com/kdave/btrfs-progs/issues/282
Fixes: 929be17a9b ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix: Al Viro pointed out that I had broken some acl functionality
with one of my previous patches.
cleanup: Jing Xiangfeng found and removed a needless variable assignment.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAl8y91kACgkQz0QOqevO
Db7YQQ/8CSNG8qJjcTc+bc/Z72LatJnlpiGFEhdQ8jTKOgt5rlaPo2pigKfEgWaF
lLbKDQozrwR2deQEtdP6a3TRskzk51WVVECg8VshTcmM3shZ/yN7lP02AG1AZS8k
JJmo70qSWw/QTHItfGkLjHm5qOAOPpQYc3xrTUXkRuj9jGdZgQmMN4Ode8VFsAcm
DWbCjO9YfwHzGOng2NA1T66Nw5cJg9Y0yz5Vd6kMTORjlSF5oxOms/lxoBS5UWzN
8FGScxZiGJ5VdlWETNpgZCFfTOWj39ad9oyckj62uJR6YDIClPOxXwGS5hXvf0dG
N29P/PJM620yMEaXbR4yNDP8yzsLd3Z4GwTwCBiqgwlj/LredCidGtHHPPOR8fRa
rzcSt4d74FTBWUQSUs77dIimg3S32N863KrisdK4ZWadYppLivbV0gmrt8RUODTm
WxtdOQdm8k7DIBV/ulCar+LIKImtT4dHIHUR+FSNd+mTev7HJSuyo7+77EigVZtF
LWXXhWULF1Y0l6AN5CqFKemsyXUEt/spqe+klWY8908sgpK66w+VCjMoYnRKIVoE
tBcXzIs9AsUdMyUtlEfryk86XuttirMujtel4duS+eaARbsCXQI7p4vn3/e8jxWW
UTwtm35JqXCBuwdWwrTFhOMyCCAbOuwdpPzhJhTE1yCzgdvsVqs=
=dmdA
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"A fix and a cleanup...
The fix: Al Viro pointed out that I had broken some acl functionality
with one of my previous patches.
And the cleanup: Jing Xiangfeng found and removed a needless variable
assignment"
* tag 'for-linus-5.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: remove unnecessary assignment to variable ret
orangefs: posix acl fix...
A single change for this cycle adding support for zone capacities
smaller than the zone size, from Johannes.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXzJaHAAKCRDdoc3SxdoY
dmNYAQC6xbszMxQNQPeCKatMctxGPYpZUkXbvHpsTPYaKrgnIwD+MB8Qx7ybC+99
WWB2ab4oXCWs/b/dRkgUWx1U0HdibwQ=
=ynQl
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs update from Damien Le Moal:
"A single change for this cycle adding support for zone capacities
smaller than the zone size, from Johannes"
* tag 'zonefs-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: update documentation to reflect zone size vs capacity
zonefs: add zone-capacity support
MediaFailure and VolumeDirty should be retained if these are set before
mounting.
In '3.1.13.3 Media Failure Field' of exfat specification describe:
If, upon mounting a volume, the value of this field is 1,
implementations which scan the entire volume for media failures and
record all failures as "bad" clusters in the FAT (or otherwise resolve
media failures) may clear the value of this field to 0.
Therefore, We should not clear MediaFailure without scanning volume.
In '8.1 Recommended Write Ordering' of exfat specification describe:
Clear the value of the VolumeDirty field to 0, if its value prior to
the first step was 0.
Therefore, We should not clear VolumeDirty after mounting.
Also rename ERR_MEDIUM to MEDIA_FAILURE.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Replace part of exfat_zeroed_cluster() with exfat_update_bhs().
And remove exfat_sync_bhs().
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Write multiple sectors at once when updating dir-entries.
Add exfat_update_bhs() for that. It wait for write completion once
instead of sector by sector.
It's only effective if sync enabled.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
This flag is set/reset in exfat_put_super()/exfat_sync_fs()
to avoid sync_blockdev().
- exfat_put_super():
Before calling this, the VFS has already called sync_filesystem(),
so sync is never performed here.
- exfat_sync_fs():
After calling this, the VFS calls sync_blockdev(), so, it is meaningless
to check EXFAT_SB_DIRTY or to bypass sync_blockdev() here.
Remove the EXFAT_SB_DIRTY check to ensure synchronization.
And remove the code related to the flag.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
IRQ bypass support for vdpa and IFC
MLX5 vdpa driver
Endian-ness fixes for virtio drivers
Misc other fixes
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAl8yVEwPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpNPEH/0Dtq1s1V4r/kxtLUoMophv9wuORpWCr98BQ
2aOveTmwTOVdZVOiw2tzTgO9nbWx+cL2HvkU7Aajfpz5hh93Z2VOo2n4a7hBC79f
rlc3GXiG+pMk5RfmqGofIHTU+D6ony4D5SXlUDurLdtEwunyuqZwABiWkZjdclZJ
bv90IL8Upzbz0rxYr7k3z8UepdOCt7r4QS/o7STHZBjJRyylxmO/R2yTnh6PtpRK
Q/z35wJBJ3SKc8X3Fi0VOOSeGNZOiypkkl9ZnLVY5lExNAU1+2MMn2UK119SlCDV
MSxb7quYFF4cksXH1g77GMBNi1uADRh1dtFMZdkKhZGljGxKLxo=
=6VTZ
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- IRQ bypass support for vdpa and IFC
- MLX5 vdpa driver
- Endianness fixes for virtio drivers
- Misc other fixes
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits)
vdpa/mlx5: fix up endian-ness for mtu
vdpa: Fix pointer math bug in vdpasim_get_config()
vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config()
vdpa/mlx5: fix memory allocation failure checks
vdpa/mlx5: Fix uninitialised variable in core/mr.c
vdpa_sim: init iommu lock
virtio_config: fix up warnings on parisc
vdpa/mlx5: Add VDPA driver for supported mlx5 devices
vdpa/mlx5: Add shared memory registration code
vdpa/mlx5: Add support library for mlx5 VDPA implementation
vdpa/mlx5: Add hardware descriptive header file
vdpa: Modify get_vq_state() to return error code
net/vdpa: Use struct for set/get vq state
vdpa: remove hard coded virtq num
vdpasim: support batch updating
vhost-vdpa: support IOTLB batching hints
vhost-vdpa: support get/set backend features
vhost: generialize backend features setting/getting
vhost-vdpa: refine ioctl pre-processing
vDPA: dont change vq irq after DRIVER_OK
...
- Add 'Runtime Firmware Activation' support for NVDIMMs that advertise
the relevant capability
- Misc libnvdimm and DAX cleanups
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQT9vPEBxh63bwxRYEEPzq5USduLdgUCXzHodgAKCRAPzq5USduL
djTjAQD1THDmizHn16zd94ueygh/BXfN0zyeVvQH352ol7kdfQEAj2A7YJ9XBbBY
JC6/CNd+OiB9W88lLOUf3Waj1a7cUQ8=
=Q6qn
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updayes from Vishal Verma:
"You'd normally receive this pull request from Dan Williams, but he's
busy watching a newborn (Congrats Dan!), so I'm watching libnvdimm
this cycle.
This adds a new feature in libnvdimm - 'Runtime Firmware Activation',
and a few small cleanups and fixes in libnvdimm and DAX. I'd
originally intended to make separate topic-based pull requests - one
for libnvdimm, and one for DAX, but some of the DAX material fell out
since it wasn't quite ready.
Summary:
- add 'Runtime Firmware Activation' support for NVDIMMs that
advertise the relevant capability
- misc libnvdimm and DAX cleanups"
* tag 'libnvdimm-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr
libnvdimm/security: the 'security' attr never show 'overwrite' state
libnvdimm/security: fix a typo
ACPI: NFIT: Fix ARS zero-sized allocation
dax: Fix incorrect argument passed to xas_set_err()
ACPI: NFIT: Add runtime firmware activate support
PM, libnvdimm: Add runtime firmware activation support
libnvdimm: Convert to DEVICE_ATTR_ADMIN_RO()
drivers/dax: Expand lock scope to cover the use of addresses
fs/dax: Remove unused size parameter
dax: print error message by pr_info() in __generic_fsdax_supported()
driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}
tools/testing/nvdimm: Emulate firmware activation commands
tools/testing/nvdimm: Prepare nfit_ctl_test() for ND_CMD_CALL emulation
tools/testing/nvdimm: Add command debug messages
tools/testing/nvdimm: Cleanup dimm index passing
ACPI: NFIT: Define runtime firmware activation commands
ACPI: NFIT: Move bus_dsm_mask out of generic nvdimm_bus_descriptor
libnvdimm: Validate command family indices
We're holding the request reference, but we need to go one higher
to ensure that the ctx remains valid after the request has finished.
If the ring is closed with pending task_work inflight, and the
given io_kiocb finishes sync during issue, then we need a reference
to the ring itself around the task_work execution cycle.
Cc: stable@vger.kernel.org # v5.7+
Reported-by: syzbot+9b260fc33297966f5a8e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs_get_extent() sets variable ret, but out: error path expect error
to be in variable err so the error code is lost.
Fixes: 6bf9e4bd6a ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Pavel Machek (CIP) <pavel@denx.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the zoned storage model, the sectors within a zone are typically all
writeable. With the introduction of the Zoned Namespace (ZNS) Command
Set in the NVM Express organization, the model was extended to have a
specific writeable capacity.
This zone capacity can be less than the overall zone size for a NVMe ZNS
device or null_blk in zoned-mode. For other ZBC/ZAC devices the zone
capacity is always equal to the zone size.
Use the zone capacity field instead from blk_zone for determining the
maximum inode size and inode blocks in zonefs.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
- Untangle the header spaghetti which causes build failures in various
situations caused by the lockdep additions to seqcount to validate that
the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict per
CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored and
write_seqcount_begin() has a lockdep assertion to validate that the
lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API is
unchanged and determines the type at compile time with the help of
_Generic which is possible now that the minimal GCC version has been
moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs which
have been addressed already independent of this.
While generaly useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if the
writers are serialized by an associated lock, which leads to the well
known reader preempts writer livelock. RT prevents this by storing the
associated lock pointer independent of lockdep in the seqcount and
changing the reader side to block on the lock when a reader detects
that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and initializers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
81ppn2KlvzUK8g==
=7Gj+
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.
While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and
initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...
In this round, we've added two small interfaces, 1) GC_URGENT_LOW mode for
performance, and 2) F2FS_IOC_SEC_TRIM_FILE ioctl for security. The new GC
mode allows Android to run some lower priority GCs in background, while new
ioctl discards user information without race condition when the account is
removed. In addition, some patches were merged to address latency-related
issues. We've fixed some compression-related bug fixes as well as edge race
conditions.
Enhancement:
- add GC_URGENT_LOW mode in gc_urgent
- introduce F2FS_IOC_SEC_TRIM_FILE ioctl
- bypass racy readahead to improve read latencies
- shrink node_write lock coverage to avoid long latency
Bug fix:
- fix missing compression flag control, i_size, and mount option
- fix deadlock between quota writes and checkpoint
- remove inode eviction path in synchronous path to avoid deadlock
- fix to wait GCed compressed page writeback
- fix a kernel panic in f2fs_is_compressed_page
- check page dirty status before writeback
- wait page writeback before update in node page write flow
- fix a race condition between f2fs_write_end_io and f2fs_del_fsync_node_entry
We've added some minor sanity checks and refactored trivial code blocks for
better readability and debugging information.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl8xhqMACgkQQBSofoJI
UNLYgA//WMoOqBACDuOWwYmgQ8oq4vrH2LOwssZF9/77vEfaHKc+TSq1il54lcUl
MPEx7FK54CnfT8VLLR5ByobZFyH9FFeAw4FN4LBhcfE8jh8ysAGjeoZjwfcmJF6R
cVKtn8ltUpgH3IEUuPjTiKkVNHfVJxuuL3zHbg1CEl+AkR6NJ/U9kNLwf7ZgPWq2
I0qwyXRlUIEChhyPZB+Y6RsdGjkeievKld56DMCgG73f4yHRO/yBcrfsN875sGdM
ALL+mYiunMT6aXcfoiQiAjeImoNajuflI6Zso2Sk8Vl6sBj0QwAuEBF5x1Z5e1mt
YVYNuC4ucqsDBKpOqtsPP0MFTC2T5Rr9wWXjqv+9TjN7zvJ8xx+zDWtQxvI2bpqB
4ZRxaJP45aThLYh/SEYDmj+ppyPtfLDeG0HzUkwMmuopf9eg+kxGPjBsZewgkCKg
kmMKU0P7deGlkrWLUcz2vm0Lso+ieGm0IeLOQl9/rOLu3IQQFia0Vla7dLDgqF0P
sz+udIiBztC3JPEmEZhfayA6P6e1TyWQUdquL08jp+DZD17gPqcaZDhHr62U5rmK
7zoiZqkR03SbNaFhBhhoVOaAVcAnF0pSIgqzkCa3dVXxp1QV+JfD9CGR9NFyiIqB
HK5RPFskIUCg0K2LSaAKbyoFWa/fJ8ZD8/CbFWcnXfWzoaaSkmc=
=PjaF
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added two small interfaces: (a) GC_URGENT_LOW
mode for performance and (b) F2FS_IOC_SEC_TRIM_FILE ioctl for
security.
The new GC mode allows Android to run some lower priority GCs in
background, while new ioctl discards user information without race
condition when the account is removed.
In addition, some patches were merged to address latency-related
issues. We've fixed some compression-related bug fixes as well as edge
race conditions.
Enhancements:
- add GC_URGENT_LOW mode in gc_urgent
- introduce F2FS_IOC_SEC_TRIM_FILE ioctl
- bypass racy readahead to improve read latencies
- shrink node_write lock coverage to avoid long latency
Bug fixes:
- fix missing compression flag control, i_size, and mount option
- fix deadlock between quota writes and checkpoint
- remove inode eviction path in synchronous path to avoid deadlock
- fix to wait GCed compressed page writeback
- fix a kernel panic in f2fs_is_compressed_page
- check page dirty status before writeback
- wait page writeback before update in node page write flow
- fix a race condition between f2fs_write_end_io and f2fs_del_fsync_node_entry
We've added some minor sanity checks and refactored trivial code
blocks for better readability and debugging information"
* tag 'f2fs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (52 commits)
f2fs: prepare a waiter before entering io_schedule
f2fs: update_sit_entry: Make the judgment condition of f2fs_bug_on more intuitive
f2fs: replace test_and_set/clear_bit() with set/clear_bit()
f2fs: make file immutable even if releasing zero compression block
f2fs: compress: disable compression mount option if compression is off
f2fs: compress: add sanity check during compressed cluster read
f2fs: use macro instead of f2fs verity version
f2fs: fix deadlock between quota writes and checkpoint
f2fs: correct comment of f2fs_exist_written_data
f2fs: compress: delay temp page allocation
f2fs: compress: fix to update isize when overwriting compressed file
f2fs: space related cleanup
f2fs: fix use-after-free issue
f2fs: Change the type of f2fs_flush_inline_data() to void
f2fs: add F2FS_IOC_SEC_TRIM_FILE ioctl
f2fs: should avoid inode eviction in synchronous path
f2fs: segment.h: delete a duplicated word
f2fs: compress: fix to avoid memory leak on cc->cpages
f2fs: use generic names for generic ioctls
f2fs: don't keep meta inode pages used for compressed block migration
...
- Make sure transactions won't be started recursively in gfs2_block_zero_range.
(Bug introduced in 5.4 when switching to iomap_zero_range.)
- Fix a glock holder refcount leak introduced in the iopen glock locking
scheme rework merged in 5.8.
- A few other small improvements (debugging, stack usage, comment fixes).
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl8xkmAUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZToB0w/9FANsxraTsNxxZUQuP2uyPiB/TRxY
Vutp63Pe+eE0GNxgpfLWT+QgO99BQlGZ8WmDJ49ex0dZXZY4lv4L7JoGTV5VwAyh
CFqRcWQkAb7zVvOxAeKfyXT0CwhS7O0avvlubSpEHfQgPkGQa3q3vu18vg1jHfB3
OsFjZGCoyiJKmNFX005euhebnStabaGeP0+8uSz/5xHS+NQUbB3jPlgycUMvTVBe
Lhl052rVQzlULNUCkehKnaBxzN0/8K55iFOaOSd2kzZ5BMfRabvskZEyJFf4T75p
VJyElekk5mGV0k/FpGL03Me1oqMddDLpAFCGh9Hw51o02i8wZ3RderfQHfUmojku
5/lLEcEJV+oC7gJR7IsGRme71De9y+uLnKvod+ayBw+9us1ZbEm4zJY7hLLGyq03
piuFEAo0UGUmxGn8/s1RBT0lMYKjEDGIGjockaz/XzEQMSip27JZq4ATnA5CHaSy
8q4PFflxEaXWJtPEjiY7DW1xQhYW/3cEDfd7kjSBw/GUxk1ILXRMnzCOsjrYuWfH
ykH0gzVq7/uJln/otYa39RBdt/GQXJCzhvODeizF6B36seVX9oBBEc6TtohxREwt
aIpupz6iGQ6GXddg1Sq6p2w3xyW8e8bG4YislEyiCyxpdOjvcYdM6cVxF7UBgtgu
fwEbjgGO34pAO1E=
=+Nig
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Make sure transactions won't be started recursively in
gfs2_block_zero_range (bug introduced in 5.4 when switching to
iomap_zero_range)
- Fix a glock holder refcount leak introduced in the iopen glock
locking scheme rework merged in 5.8.
- A few other small improvements (debugging, stack usage, comment
fixes).
* tag 'gfs2-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: When gfs2_dirty_inode gets a glock error, dump the glock
gfs2: Never call gfs2_block_zero_range with an open transaction
gfs2: print details on transactions that aren't properly ended
gfs2: Fix inaccurate comment
fs: Fix typo in comment
gfs2: Fix refcount leak in gfs2_glock_poke
gfs2: Pass glock holder to gfs2_file_direct_{read,write}
gfs2: Add some flags missing from glock output
JFFS2:
- Fix for a corner case while mounting
- Fix for an use-after-free issue
UBI:
- Fix for a memory load while attaching
- Don't produce an anchor PEB with fastmap being disabled
UBIFS:
- Fix for orphan inode logic
- Spelling fixes
- New mount option to specify filesystem version
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl8xkPIWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wZJbD/9B5wLKRa0iG+LSiiY2WztnYKZ3
l4WjK+/2RbOxYPKuu+DaC2x8SpF0MHvk0mVcc03Az9R3xmzKHcC7OXszqHTJ0uet
JRvFMfLc2C7HoPHvcUQ+zT6bAZzZk/rUhd2uO6+7wHPb8ayqjOEVbDERKCOLOQsb
Z4HQyDuub7BcK554/9P4cUjX+mAcv6N4ILqIbofLq0OclZEceoH+6JlTFxElervO
X4Vfkiw6iZemqRZ118LnU6Ds0BG2sg0z7nlO6H6mBqRpR7iwHaBG88rqJrNyJlZN
9KPRXEUKnSXTzNVrOqHFkgbwpE1HZfVdIoiLW+ghrLBzVD7HmD1GM3wxYgpqh+jo
xXa6jU2RVH1va6YgjDB3qDekWqEDBoEMRnHOHaglViZvsH2BcIP8KUGVWAqHfPvv
AeraksbbVhQ8Vfa8lHAlGPq/kCAjDhdLOfc/clid2YSGZNuMoM5seSAbWD9dHWFR
KYzF+X11hh1BcGo8KttQHDjed3cMs9IuU9tXmBQy65W+ZYqDHU6NS53tN2YD4bbK
bS2Qd6EJUGQDPauxRxKHkwKODUC67sUA3GbGXBAmWhMdu5T2BAoU3jbU2NLtQiZa
yHOaiDDKwbQvpwcd1Ev1gIeihGfKEpt0T6zqABB3YxFtm9le4AEYyFKlhUXNTrX1
GwEhi9jtKznUzoW/CQ==
=iztS
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
"JFFS2:
- Fix for a corner case while mounting
- Fix for an use-after-free issue
UBI:
- Fix for a memory load while attaching
- Don't produce an anchor PEB with fastmap being disabled
UBIFS:
- Fix for orphan inode logic
- Spelling fixes
- New mount option to specify filesystem version"
* tag 'for-linus-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
jffs2: fix UAF problem
jffs2: fix jffs2 mounting failure
ubifs: Fix wrong orphan node deletion in ubifs_jnl_update|rename
ubi: fastmap: Free fastmap next anchor peb during detach
ubi: fastmap: Don't produce the initial next anchor PEB when fastmap is disabled
ubifs: misc.h: delete a duplicated word
ubifs: add option to specify version for new file systems
If we're in the error path failing links and we have a link that has
grabbed a reference to the fs_struct, then we cannot safely drop our
reference to the table if we already hold the completion lock. This
adds a hardirq dependency to the fs_struct->lock, which it currently
doesn't have.
Defer the final cleanup and free of such requests to avoid adding this
dependency.
Reported-by: syzbot+ef4b654b49ed7ff049bf@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we traverse into failing links or timeouts, we need to ensure we
propagate the REQ_F_COMP_LOCKED flag to ensure that we correctly signal
to the completion side that we already hold the completion lock.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syszbot reports a scenario where we recurse on the completion lock
when flushing an overflow:
1 lock held by syz-executor287/6816:
#0: ffff888093cdb4d8 (&ctx->completion_lock){....}-{2:2}, at: io_cqring_overflow_flush+0xc6/0xab0 fs/io_uring.c:1333
stack backtrace:
CPU: 1 PID: 6816 Comm: syz-executor287 Not tainted 5.8.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1f0/0x31e lib/dump_stack.c:118
print_deadlock_bug kernel/locking/lockdep.c:2391 [inline]
check_deadlock kernel/locking/lockdep.c:2432 [inline]
validate_chain+0x69a4/0x88a0 kernel/locking/lockdep.c:3202
__lock_acquire+0x1161/0x2ab0 kernel/locking/lockdep.c:4426
lock_acquire+0x160/0x730 kernel/locking/lockdep.c:5005
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
_raw_spin_lock_irq+0x67/0x80 kernel/locking/spinlock.c:167
spin_lock_irq include/linux/spinlock.h:379 [inline]
io_queue_linked_timeout fs/io_uring.c:5928 [inline]
__io_queue_async_work fs/io_uring.c:1192 [inline]
__io_queue_deferred+0x36a/0x790 fs/io_uring.c:1237
io_cqring_overflow_flush+0x774/0xab0 fs/io_uring.c:1359
io_ring_ctx_wait_and_kill+0x2a1/0x570 fs/io_uring.c:7808
io_uring_release+0x59/0x70 fs/io_uring.c:7829
__fput+0x34f/0x7b0 fs/file_table.c:281
task_work_run+0x137/0x1c0 kernel/task_work.c:135
exit_task_work include/linux/task_work.h:25 [inline]
do_exit+0x5f3/0x1f20 kernel/exit.c:806
do_group_exit+0x161/0x2d0 kernel/exit.c:903
__do_sys_exit_group+0x13/0x20 kernel/exit.c:914
__se_sys_exit_group+0x10/0x10 kernel/exit.c:912
__x64_sys_exit_group+0x37/0x40 kernel/exit.c:912
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by passing back the link from __io_queue_async_work(), and
then let the caller handle the queueing of the link. Take care to also
punt the submission reference put to the caller, as we're holding the
completion lock for the __io_queue_defer() case. Hence we need to mark
the io_kiocb appropriately for that case.
Reported-by: syzbot+996f91b6ec3812c48042@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
An earlier commit:
b7db41c9e0 ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
ensured that we didn't get stuck waiting for eventfd reads when it's
registered with the io_uring ring for event notification, but we still
have cases where the task can be waiting on other events in the kernel and
need a bigger nudge to make forward progress. Or the task could be in the
kernel and running, but on its way to blocking.
This means that TWA_RESUME cannot reliably be used to ensure we make
progress. Use TWA_SIGNAL unconditionally.
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Josef <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[BUG]
Unmounting a btrfs filesystem with quota disabled will cause the
following NULL pointer dereference:
BTRFS info (device dm-5): has skinny extents
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 7 PID: 637 Comm: umount Not tainted 5.8.0-rc7-next-20200731-custom #76
RIP: 0010:kobject_del+0x6/0x20
Call Trace:
btrfs_sysfs_del_qgroups+0xac/0xf0 [btrfs]
btrfs_free_qgroup_config+0x63/0x70 [btrfs]
close_ctree+0x1f5/0x323 [btrfs]
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
exit_to_user_mode_prepare+0x18a/0x190
syscall_exit_to_user_mode+0x4f/0x270
do_syscall_64+0x45/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 37b7adca5c1d5c5d ]---
[CAUSE]
Commit 079ad2fb4b ("kobject: Avoid premature parent object freeing in
kobject_cleanup()") changed kobject_del() that it no longer accepts NULL
pointer.
Before that commit, kobject_del() and kobject_put() all accept NULL
pointers and just ignore such NULL pointers.
But that mentioned commit needs to access the parent node, killing the
old NULL pointer behavior.
Unfortunately btrfs is relying on that hidden feature thus we will
trigger such NULL pointer dereference.
[FIX]
Instead of just saving several lines, do proper fs_info->qgroups_kobj
check before calling kobject_del() and kobject_put().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The `if (!ret)` check will always be false and it may result in
ret->path being dereferenced while it is a NULL pointer.
Fixes: a37f232b7b ("btrfs: backref: introduce the skeleton of btrfs_backref_iter")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boleyn Su <boleynsu@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert the uses of fallthrough comments to fallthrough macro.
Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
There's some inconsistency around SB_I_VERSION handling with mount and
remount. Since we don't really want it to be off ever just work around
this by making sure we don't get the flag cleared on remount.
There's a tiny cpu cost of setting the bit, otherwise all changes to
i_version also change some of the times (ctime/mtime) so the inode needs
to be synced. We wouldn't save anything by disabling it.
Reported-by: Eric Sandeen <sandeen@redhat.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add perf impact analysis ]
Signed-off-by: David Sterba <dsterba@suse.com>
While logging an inode, at copy_items(), if we fail to lookup the checksums
for an extent we release the destination path, free the ins_data array and
then return immediately. However a previous iteration of the for loop may
have added checksums to the ordered_sums list, in which case we leak the
memory used by them.
So fix this by making sure we iterate the ordered_sums list and free all
its checksums before returning.
Fixes: 3650860b90 ("Btrfs: remove almost all of the BUG()'s from tree-log.c")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Murphy reported a problem where rpm ostree will bind mount a bunch
of things for whatever voodoo it's doing. But when it does this
/proc/mounts shows something like
/dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
/dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo/bar 0 0
Despite subvolid=256 being subvol=/foo. This is because we're just
spitting out the dentry of the mount point, which in the case of bind
mounts is the source path for the mountpoint. Instead we should spit
out the path to the actual subvol. Fix this by looking up the name for
the subvolid we have mounted. With this fix the same test looks like
this
/dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
/dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
Reported-by: Chris Murphy <chris@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reported by Forza on IRC that remounting with compression options does
not reflect the change in level, or at least it does not appear to do so
according to the messages:
mount -o compress=zstd:1 /dev/sda /mnt
mount -o remount,compress=zstd:15 /mnt
does not print the change to the level to syslog:
[ 41.366060] BTRFS info (device vda): use zstd compression, level 1
[ 41.368254] BTRFS info (device vda): disk space caching is enabled
[ 41.390429] BTRFS info (device vda): disk space caching is enabled
What really happens is that the message is lost but the level is actualy
changed.
There's another weird output, if compression is reset to 'no':
[ 45.413776] BTRFS info (device vda): use no compression, level 4
To fix that, save the previous compression level and print the message
in that case too and use separate message for 'no' compression.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: David Sterba <dsterba@suse.com>
In try_to_merge_free_space we attempt to find entries to the left and
right of the entry we are adding to see if they can be merged. We
search for an entry past our current info (saved into right_info), and
then if right_info exists and it has a rb_prev() we save the rb_prev()
into left_info.
However there's a slight problem in the case that we have a right_info,
but no entry previous to that entry. At that point we will search for
an entry just before the info we're attempting to insert. This will
simply find right_info again, and assign it to left_info, making them
both the same pointer.
Now if right_info _can_ be merged with the range we're inserting, we'll
add it to the info and free right_info. However further down we'll
access left_info, which was right_info, and thus get a use-after-free.
Fix this by only searching for the left entry if we don't find a right
entry at all.
The CVE referenced had a specially crafted file system that could
trigger this use-after-free. However with the tree checker improvements
we no longer trigger the conditions for the UAF. But the original
conditions still apply, hence this fix.
Reference: CVE-2019-19448
Fixes: 9630308170 ("Btrfs: use hybrid extents+bitmap rb tree for free space")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report of NULL pointer dereference caused in
compress_file_extent():
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs]
LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs]
Call Trace:
[c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable)
[c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs]
[c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs]
[c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0
[c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660
[c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0
[c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c
---[ end trace f16954aa20d822f6 ]---
[CAUSE]
For the following execution route of compress_file_range(), it's
possible to hit NULL pointer dereference:
compress_file_extent()
|- pages = NULL;
|- start = async_chunk->start = 0;
|- end = async_chunk = 4095;
|- nr_pages = 1;
|- inode_need_compress() == false; <<< Possible, see later explanation
| Now, we have nr_pages = 1, pages = NULL
|- cont:
|- ret = cow_file_range_inline();
|- if (ret <= 0) {
|- for (i = 0; i < nr_pages; i++) {
|- WARN_ON(pages[i]->mapping); <<< Crash
To enter above call execution branch, we need the following race:
Thread 1 (chattr) | Thread 2 (writeback)
--------------------------+------------------------------
| btrfs_run_delalloc_range
| |- inode_need_compress = true
| |- cow_file_range_async()
btrfs_ioctl_set_flag() |
|- binode_flags |= |
BTRFS_INODE_NOCOMPRESS |
| compress_file_range()
| |- inode_need_compress = false
| |- nr_page = 1 while pages = NULL
| | Then hit the crash
[FIX]
This patch will fix it by checking @pages before doing accessing it.
This patch is only designed as a hot fix and easy to backport.
More elegant fix may make btrfs only check inode_need_compress() once to
avoid such race, but that would be another story.
Reported-by: Luciano Chavez <chavez@us.ibm.com>
Fixes: 4d3a800ebb ("btrfs: merge nr_pages input and output parameter in compress_pages")
CC: stable@vger.kernel.org # 4.14.x: cecc8d9038: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Support for user extended attributes on NFS (RFC 8276)
- Further reduce unnecessary NFSv4 delegation recalls
Notable fixes:
- Fix recent krb5p regression
- Address a few resource leaks and a rare NULL dereference
Other:
- De-duplicate RPC/RDMA error handling and other utility functions
- Replace storage and display of kernel memory addresses by tracepoints
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl8oBt0ACgkQM2qzM29m
f5dTFQ/9H72E6gr1onsia0/Py0CO8F9qzLgmUBl1vVYAh2/vPqUL1ypxrC5OYrAy
TOqESTsJvmGluCFc/77XUTD7NvJY3znIWim49okwDiyee4Y14ZfRhhCxyyA6Z94E
FjJQb5TbF1Mti4X3dN8Gn7O1Y/BfTjDAAXnXGlTA1xoLcxM5idWIj+G8x0bPmeDb
2fTbgsoETu6MpS2/L6mraXVh3d5ESOJH+73YvpBl0AhYPzlNASJZMLtHtd+A/JbO
IPkMP/7UA5DuJtWGeuQ4I4D5bQNpNWMfN6zhwtih4IV5bkRC7vyAOLG1R7w9+Ufq
58cxPiorMcsg1cHnXG0Z6WVtbMEdWTP/FzmJdE5RC7DEJhmmSUG/R0OmgDcsDZET
GovPARho01yp80GwTjCIctDHRRFRL4pdPfr8PjVHetSnx9+zoRUT+D70Zeg/KSy2
99gmCxqSY9BZeHoiVPEX/HbhXrkuDjUSshwl98OAzOFmv6kbwtLntgFbWlBdE6dB
mqOxBb73zEoZ5P9GA2l2ShU3GbzMzDebHBb9EyomXHZrLejoXeUNA28VJ+8vPP5S
IVHnEwOkdJrNe/7cH4jd/B0NR6f8Da/F9kmkLiG2GNPMqQ8bnVhxTUtZkcAE+fd4
f34qLxsoht70wSSfISjBs7hP5KxEM1lOAf0w0RpycPUKJNV1FB0=
=OEpF
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull NFS server updates from Chuck Lever:
"Highlights:
- Support for user extended attributes on NFS (RFC 8276)
- Further reduce unnecessary NFSv4 delegation recalls
Notable fixes:
- Fix recent krb5p regression
- Address a few resource leaks and a rare NULL dereference
Other:
- De-duplicate RPC/RDMA error handling and other utility functions
- Replace storage and display of kernel memory addresses by tracepoints"
* tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6: (38 commits)
svcrdma: CM event handler clean up
svcrdma: Remove transport reference counting
svcrdma: Fix another Receive buffer leak
SUNRPC: Refresh the show_rqstp_flags() macro
nfsd: netns.h: delete a duplicated word
SUNRPC: Fix ("SUNRPC: Add "@len" parameter to gss_unwrap()")
nfsd: avoid a NULL dereference in __cld_pipe_upcall()
nfsd4: a client's own opens needn't prevent delegations
nfsd: Use seq_putc() in two functions
svcrdma: Display chunk completion ID when posting a rw_ctxt
svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send()
svcrdma: Introduce Send completion IDs
svcrdma: Record Receive completion ID in svc_rdma_decode_rqst
svcrdma: Introduce Receive completion IDs
svcrdma: Introduce infrastructure to support completion IDs
svcrdma: Add common XDR encoders for RDMA and Read segments
svcrdma: Add common XDR decoders for RDMA and Read segments
SUNRPC: Add helpers for decoding list discriminators symbolically
svcrdma: Remove declarations for functions long removed
svcrdma: Clean up trace_svcrdma_send_failed() tracepoint
...
Pull misc vfs updates from Al Viro:
"No common topic whatsoever in those, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: define inode flags using bit numbers
iov_iter: Move unnecessary inclusion of crypto/hash.h
dlmfs: clean up dlmfs_file_{read,write}() a bit
Pull mount leak fix from Al Viro:
"Regression fix for the syscalls-for-init series - fix a leak of a 'struct path'"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: fix a struct path leak in path_umount
Make sure we also put the dentry and vfsmnt in the illegal flags
and !may_umount cases.
Fixes: 41525f56e2 ("fs: refactor ksys_umount")
Reported-by: Vikas Kumar <vikas.kumar2@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull fdpick coredump update from Al Viro:
"Switches fdpic coredumps away from original aout dumping primitives to
the same kind of regset use as regular elf coredumps do"
* 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[elf-fdpic] switch coredump to regsets
[elf-fdpic] use elf_dump_thread_status() for the dumper thread as well
[elf-fdpic] move allocation of elf_thread_status into elf_dump_thread_status()
[elf-fdpic] coredump: don't bother with cyclic list for per-thread objects
kill elf_fpxregs_t
take fdpic-related parts of elf_prstatus out
unexport linux/elfcore.h
ext4_search_dir() and ext4_generic_delete_entry() can be called both for
standard director blocks and for inline directories stored inside inode
or inline xattr space. For the second case we didn't call
ext4_check_dir_entry() with proper constraints that could result in
accepting corrupted directory entry as well as false positive filesystem
errors like:
EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400:
block 113246792: comm dockerd: bad entry in directory: directory entry too
close to block end - offset=0, inode=28320403, rec_len=32, name_len=8,
size=4096
Fix the arguments passed to ext4_check_dir_entry().
Fixes: 109ba779d6 ("ext4: check for directory entries too close to block end")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If a device is hot-removed --- for example, when a physical device is
unplugged from pcie slot or a nbd device's network is shutdown ---
this can result in a BUG_ON() crash in submit_bh_wbc(). This is
because the when the block device dies, the buffer heads will have
their Buffer_Mapped flag get cleared, leading to the crash in
submit_bh_wbc.
We had attempted to work around this problem in commit a17712c8
("ext4: check superblock mapped prior to committing"). Unfortunately,
it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
device dies between when the work-around check in ext4_commit_super()
and when submit_bh_wbh() is finally called:
Code path:
ext4_commit_super
judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8
lock_buffer(sbh)
...
unlock_buffer(sbh)
__sync_dirty_buffer(sbh,...
lock_buffer(sbh)
judge if 'buffer_mapped(sbh))' is false, return <== added by this patch
submit_bh(...,sbh)
submit_bh_wbc(...,sbh,...)
[100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc()
[100722.966503] invalid opcode: 0000 [#1] SMP
[100722.966566] task: ffff8817e15a9e40 task.stack: ffffc90024744000
[100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190
[100722.966575] RSP: 0018:ffffc90024747a90 EFLAGS: 00010246
[100722.966576] RAX: 0000000000620005 RBX: ffff8818a80603a8 RCX: 0000000000000000
[100722.966576] RDX: ffff8818a80603a8 RSI: 0000000000020800 RDI: 0000000000000001
[100722.966577] RBP: ffffc90024747ac0 R08: 0000000000000000 R09: ffff88207f94170d
[100722.966578] R10: 00000000000437c8 R11: 0000000000000001 R12: 0000000000020800
[100722.966578] R13: 0000000000000001 R14: 000000000bf9a438 R15: ffff88195f333000
[100722.966580] FS: 00007fa2eee27700(0000) GS:ffff88203d840000(0000) knlGS:0000000000000000
[100722.966580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[100722.966581] CR2: 0000000000f0b008 CR3: 000000201a622003 CR4: 00000000007606e0
[100722.966582] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[100722.966583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[100722.966583] PKRU: 55555554
[100722.966583] Call Trace:
[100722.966588] __sync_dirty_buffer+0x6e/0xd0
[100722.966614] ext4_commit_super+0x1d8/0x290 [ext4]
[100722.966626] __ext4_std_error+0x78/0x100 [ext4]
[100722.966635] ? __ext4_journal_get_write_access+0xca/0x120 [ext4]
[100722.966646] ext4_reserve_inode_write+0x58/0xb0 [ext4]
[100722.966655] ? ext4_dirty_inode+0x48/0x70 [ext4]
[100722.966663] ext4_mark_inode_dirty+0x53/0x1e0 [ext4]
[100722.966671] ? __ext4_journal_start_sb+0x6d/0xf0 [ext4]
[100722.966679] ext4_dirty_inode+0x48/0x70 [ext4]
[100722.966682] __mark_inode_dirty+0x17f/0x350
[100722.966686] generic_update_time+0x87/0xd0
[100722.966687] touch_atime+0xa9/0xd0
[100722.966690] generic_file_read_iter+0xa09/0xcd0
[100722.966694] ? page_cache_tree_insert+0xb0/0xb0
[100722.966704] ext4_file_read_iter+0x4a/0x100 [ext4]
[100722.966707] ? __inode_security_revalidate+0x4f/0x60
[100722.966709] __vfs_read+0xec/0x160
[100722.966711] vfs_read+0x8c/0x130
[100722.966712] SyS_pread64+0x87/0xb0
[100722.966716] do_syscall_64+0x67/0x1b0
[100722.966719] entry_SYSCALL64_slow_path+0x25/0x25
To address this, add the check of 'buffer_mapped(bh)' to
__sync_dirty_buffer(). This also has the benefit of fixing this for
other file systems.
With this addition, we can drop the workaround in ext4_commit_supper().
[ Commit description rewritten by tytso. ]
Signed-off-by: Xianting Tian <xianting_tian@126.com>
Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_tian@126.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN
shows the following warning:
UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23
member access within null pointer of type 'struct xfs_kobj'
Call Trace:
dump_stack+0x10e/0x195
ubsan_type_mismatch_common+0x241/0x280
__ubsan_handle_type_mismatch_v1+0x32/0x40
init_xfs_fs+0x12b/0x28f
do_one_initcall+0xdd/0x1d0
do_initcall_level+0x151/0x1b6
do_initcalls+0x50/0x8f
do_basic_setup+0x29/0x2b
kernel_init_freeable+0x19f/0x20b
kernel_init+0x11/0x1e0
ret_from_fork+0x22/0x30
Fix it by checking parent_kobj before the code accesses its member.
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor whitespace edits]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge misc updates from Andrew Morton:
- a few MM hotfixes
- kthread, tools, scripts, ntfs and ocfs2
- some of MM
Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
mm: vmscan: consistent update to pgrefill
mm/vmscan.c: fix typo
khugepaged: khugepaged_test_exit() check mmget_still_valid()
khugepaged: retract_page_tables() remember to test exit
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
khugepaged: collapse_pte_mapped_thp() flush the right range
mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
mm: thp: replace HTTP links with HTTPS ones
mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
mm/page_alloc.c: skip setting nodemask when we are in interrupt
mm/page_alloc: fallbacks at most has 3 elements
mm/page_alloc: silence a KASAN false positive
mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
mm/page_alloc.c: simplify pageblock bitmap access
mm/page_alloc.c: extract the common part in pfn_to_bitidx()
mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
mm/shuffle: remove dynamic reconfiguration
mm/memory_hotplug: document why shuffle_zone() is relevant
mm/page_alloc: remove nr_free_pagecache_pages()
mm: remove vm_total_pages
...
The current split between do_mmap() and do_mmap_pgoff() was introduced in
commit 1fcfd8db7f ("mm, mpx: add "vm_flags_t vm_flags" arg to
do_mmap_pgoff()") to support MPX.
The wrapper function do_mmap_pgoff() always passed 0 as the value of the
vm_flags argument to do_mmap(). However, MPX support has subsequently
been removed from the kernel and there were no more direct callers of
do_mmap(); all calls were going via do_mmap_pgoff().
Simplify the code by removing do_mmap_pgoff() and changing all callers to
directly call do_mmap(), which now no longer takes a vm_flags argument.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "make vm_committed_as_batch aware of vm overcommit policy", v6.
When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':
94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.
So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
enlarge it for not-so-strict OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS
policies.
Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. And for that case,
whether it shows improvements depends on if the test mmap size is bigger
than the batch number computed.
We tested 10+ platforms in 0day (server, desktop and laptop). If we lift
it to 64X, 80%+ platforms show improvements, and for 16X lift, 1/3 of the
platforms will show improvements.
And generally it should help the mmap/unmap usage,as Michal Hocko
mentioned:
: I believe that there are non-synthetic worklaods which would benefit
: from a larger batch. E.g. large in memory databases which do large
: mmaps during startups from multiple threads.
Note: There are some style complain from checkpatch for patch 4, as sysctl
handler declaration follows the similar format of sibling functions
[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
This patch (of 4):
Use the existing vm_memory_committed() instead, which is also convenient
for future change.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1594389708-60781-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-2-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: cleanup usage of <asm/pgalloc.h>"
Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.
In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>
In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.
This patch (of 8):
In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.
As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.
The process was somewhat automated using
sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))
where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.
[rppt@linux.ibm.com: fix powerpc warning]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes. This scheme may look weird, but
only for now. As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages. However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.
The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The default is still set to inode32 for backwards compatibility, but
system administrators can opt in to the new 64-bit inode numbers by
either:
1. Passing inode64 on the command line when mounting, or
2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
The inode64 and inode32 names are used based on existing precedent from
XFS.
[hughd@google.com: Kconfig fixes]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvils
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As said by Linus:
A symmetric naming is only helpful if it implies symmetries in use.
Otherwise it's actively misleading.
In "kzalloc()", the z is meaningful and an important part of what the
caller wants.
In "kzfree()", the z is actively detrimental, because maybe in the
future we really _might_ want to use that "memfill(0xdeadbeef)" or
something. The "zero" part of the interface isn't even _relevant_.
The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.
Rename kzfree() to kfree_sensitive() to follow the example of the recently
added kvfree_sensitive() and make the intention of the API more explicit.
In addition, memzero_explicit() is used to clear the memory to make sure
that it won't get optimized away by the compiler.
The renaming is done by using the command sequence:
git grep -w --name-only kzfree |\
xargs sed -i 's/kzfree/kfree_sensitive/'
followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.
[akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
[akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on what fails, function can return with nfs_sync_rwlock either
locked or unlocked. That can not be right.
Always return with lock unlocked on error.
Fixes: 4cd9973f9f ("ocfs2: avoid inode removal while nfsd is accessing it")
Signed-off-by: Pavel Machek (CIP) <pavel@denx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200724124443.GA28164@duo.ucw.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `xmlns`:
For each link, `http://[^# ]*(?:\w|/)`:
If neither `gnu\.org/license`, nor `mozilla\.org/MPL`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200713174456.36596-1-grandmaster@al2klimov.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter reported the following static checker warning.
fs/ocfs2/super.c:1269 ocfs2_parse_options() warn: '(-1)' 65535 can't fit into 32767 'mopt->slot'
fs/ocfs2/suballoc.c:859 ocfs2_init_inode_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_inode_steal_slot'
fs/ocfs2/suballoc.c:867 ocfs2_init_meta_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_meta_steal_slot'
That's because OCFS2_INVALID_SLOT is (u16)-1. Slot number in ocfs2 can be
never negative, so change s16 to u16.
Fixes: 9277f8334f ("ocfs2: fix value of OCFS2_INVALID_SLOT")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200627001259.19757-1-junxiao.bi@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drop the repeated word "is" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: http://lkml.kernel.org/r/20200720001421.28823-1-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When use setfacl command to change a file's acl, the user cannot get the
latest acl information from the file via getfacl command, until remounting
the file system.
e.g.
setfacl -m u:ivan:rw /ocfs2/ivan
getfacl /ocfs2/ivan
getfacl: Removing leading '/' from absolute path names
file: ocfs2/ivan
owner: root
group: root
user::rw-
group::r--
mask::r--
other::r--
The latest acl record("u:ivan:rw") cannot be returned via getfacl
command until remounting.
Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200717023751.9922-1-ghe@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clang's Control Flow Integrity (CFI) is a security mechanism that can help
prevent JOP chains, deployed extensively in downstream kernels used in
Android.
Its deployment is hindered by mismatches in function signatures. For this
case, we make callbacks match their intended function signature, and cast
parameters within them rather than casting the callback when passed as a
parameter.
When running `mount -t ntfs ...` we observe the following trace:
Call trace:
__cfi_check_fail+0x1c/0x24
name_to_dev_t+0x0/0x404
iget5_locked+0x594/0x5e8
ntfs_fill_super+0xbfc/0x43ec
mount_bdev+0x30c/0x3cc
ntfs_mount+0x18/0x24
mount_fs+0x1b0/0x380
vfs_kern_mount+0x90/0x398
do_mount+0x5d8/0x1a10
SyS_mount+0x108/0x144
el0_svc_naked+0x34/0x38
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: freak07 <michalechner92@googlemail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Link: http://lkml.kernel.org/r/20200718112513.533800-1-luca.stefani.ge1@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When remounting filesystem fails late during remount handling and
block_validity mount option is also changed during the remount, we fail
to restore system zone information to a state matching the mount option.
This is mostly harmless, just the block validity checking will not match
the situation described by the mount option. Make sure these two are always
consistent.
Reported-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There's one place that fails to handle error from add_system_zone() call
and thus we can fail to protect superblock and group-descriptor blocks
properly in case of ENOMEM. Fix it.
Reported-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After the previous patch, ext4_data_block_valid_rcu() has a single
caller. Fold it into it.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, system zones just track ranges of block, that are "important"
fs metadata (bitmaps, group descriptors, journal blocks, etc.). This
however complicates how extent tree (or indirect blocks) can be checked
for inodes that actually track such metadata - currently the journal
inode but arguably we should be treating quota files or resize inode
similarly. We cannot run __ext4_ext_check() on such metadata inodes when
loading their extents as that would immediately trigger the validity
checks and so we just hack around that and special-case the journal
inode. This however leads to a situation that a journal inode which has
extent tree of depth at least one can have invalid extent tree that gets
unnoticed until ext4_cache_extents() crashes.
To overcome this limitation, track inode number each system zone belongs
to (0 is used for zones not belonging to any inode). We can then verify
inode number matches the expected one when verifying extent tree and
thus avoid the false errors. With this there's no need to to
special-case journal inode during extent tree checking anymore so remove
it.
Fixes: 0a944e8a6c ("ext4: don't perform block validity checks on the journal inode")
Reported-by: Wolfgang Frisch <wolfgang.frisch@suse.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, add_system_zone() just silently merges two added system zones
that overlap. However the overlap should not happen and it generally
suggests that some unrelated metadata overlap which indicates the fs is
corrupted. We should have caught such problems earlier (e.g. in
ext4_check_descriptors()) but add this check as another line of defense.
In later patch we also use this for stricter checking of journal inode
extent tree.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_setup_system_zone() can fail. Handle the failure in ext4_remount().
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This numbers can be analized by system automation similar to errors_count.
In ideal world it would be nice to have separate counters for different
log-levels, but this makes this patch too intrusive.
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/20200725123313.4467-1-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently there is a problem with mount options that can be both set by
vfs using mount flags or by a string parsing in ext4.
i_version/iversion options gets lost after remount, for example
$ mount -o i_version /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version
$ mount -o remount,ro /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
nolazytime gets ignored by ext4 on remount, for example
$ mount -o lazytime /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
$ mount -o remount,nolazytime /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to
s_flags before we parse the option and use the resulting state of the
same flags in *flags at the end of successful remount.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For file systems where we can afford to keep the buddy bitmaps cached,
we can speed up initial writes to large file systems by starting to
load the block allocation bitmaps as soon as the file system is
mounted. This won't work well for _super_ large file systems, or
memory constrained systems, so we only enable this when it is
requested via a mount option.
Addresses-Google-Bug: 159488342
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Modify the ext4_read_block_bitmap_load tracepoint so that it tells us
whether a block bitmap is being prefetched.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Parameter gfp_mask in jbd2_journal_try_to_free_buffers() is no longer
used after commit <536fc240e7147> ("jbd2: clean up
jbd2_journal_try_to_free_buffers()"), so just remove it.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200620025427.1756360-6-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we free a metadata buffer which has been failed to async write out
in the background, the jbd2 checkpoint procedure will not detect this
failure in jbd2_log_do_checkpoint(), so it may lead to filesystem
inconsistency after cleanup journal tail. This patch abort the journal
if free a buffer has write_io_error flag to prevent potential further
inconsistency.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200620025427.1756360-5-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There is a risk of filesystem inconsistency if we failed to async write
back metadata buffer in the background. Because of current buffer's end
io procedure is handled by end_buffer_async_write() in the block layer,
and it only clear the buffer's uptodate flag and mark the write_io_error
flag, so ext4 cannot detect such failure immediately. In most cases of
getting metadata buffer (e.g. ext4_read_inode_bitmap()), although the
buffer's data is actually uptodate, it may still read data from disk
because the buffer's uptodate flag has been cleared. Finally, it may
lead to on-disk filesystem inconsistency if reading old data from the
disk successfully and write them out again.
This patch detect bdev mapping->wb_err when getting journal's write
access and mark the filesystem error if bdev's mapping->wb_err was
increased, this could prevent further writing and potential
inconsistency.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200620025427.1756360-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Fix some btree block pingponging problems when swapping extents
- Redesign the reflink copy loop so that we only run one remapping
operation per transaction. This helps us avoid running out of block
reservation on highly deduped filesystems.
- Take the MMAPLOCK around filemap_map_pages.
- Make inode reclaim fully async so that we avoid stalling processes on
flushing inodes to disk.
- Reduce inode cluster buffer RMW cycles by attaching the buffer to
dirty inodes so we won't let go of the cluster buffer when we know
we're going to need it soon.
- Add some more checks to the realtime bitmap file scrubber.
- Don't trip false lockdep warnings in fs freeze.
- Remove various redundant lines of code.
- Remove unnecessary calls to xfs_perag_{get,put}.
- Preserve I_VERSION state across remounts.
- Fix an unmount hang due to AIL going to sleep with a non-empty delwri
buffer list.
- Fix an error in the inode allocation space reservation macro that
caused regressions in generic/531.
- Fix a potential livelock when dquot flush fails because the dquot
buffer is locked.
- Fix a miscalculation when reserving inode quota that could cause users
to exceed a hardlimit.
- Refactor struct xfs_dquot to use native types for incore fields
instead of abusing the ondisk struct for this purpose. This will
eventually enable proper y2038+ support, but for now it merely cleans
up the quota function declarations.
- Actually increment the quota softlimit warning counter so that soft
failures turn into hard(er) failures when they exceed the softlimit
warning counter limits set by the administrator.
- Split incore dquot state flags into their own field and namespace, to
avoid mixing them with quota type flags.
- Create a new quota type flags namespace so that we can make it obvious
when a quota function takes a quota type (user, group, project) as an
argument.
- Rename the ondisk dquot flags field to type, as that more accurately
represents what we store in it.
- Drop our bespoke memory allocation flags in favor of GFP_*.
- Rearrange the xattr functions so that we no longer mix metadata
updates and transaction management (e.g. rolling complex transactions)
in the same functions. This work will prepare us for atomic xattr
operations (itself a prerequisite for directory backrefs) in future
release cycles.
- Support FS_DAX_FL (aka FS_XFLAG_DAX) via GETFLAGS/SETFLAGS.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl8hmOsACgkQ+H93GTRK
tOvCbQ//ax1IR0Bdfz0G85ouNqOqjqpKmtjLJWl4tK0e99AtIFFyjyst0BmiPM61
M2ebqrQ4KtwGcnqPMVczxift4MRfsK1T2WmivuF6GpUTJjEcfo/qDjwPgFT7Gdfc
gVCKWozFnv7z8cOVmRxP3jQR+r32FMnc4Nf8ZZ4LO2gGAqfDySZKFJXjkywR5ETk
rE0BivsXKqldbSA0nibMwmxNIWn+tBE+Bv3rSDAd6ZWEKbBZkrxf5GW+GkBD/xon
HT+T8lKFG5F+9kmL+BRhtV2eZkcbAOmP6x6NTX0SZcXlX2BBT7ltBusjal0lK0uh
IizqZv5UG6S2cv0j69EbXm3gvXBs+okAGLRtIPAExfWXf/x0JZp6dNbsnwwI0ZSC
N1Uy3lAcyF1Wybuf/6w4bMu8zrVcty8wSOD5psL4GXnhvw9c8iASszwGOUnOUH84
jRZYJNE9jsdItjP/5hNANaidPBnapPxWY4nvqJN4H3FjqTQOakE4X2OSB0LREIDp
avAFISrlU2dl0AwMHxCDOlv56FkHsIZ8aJmCpGPQdYpGu8jjYWTUYKvUvcjH/NBW
aVL7pUgP5r3vIxawS9tJcy1t3t7JDZmC+w5oasmQ0Rsk1r5mTgNrP6xue6rZzaRg
wo6mWZvteoWtEZJnT+L4Glx7i1srMj++Dqu24cJ92o/omA+fmr4=
=nhPQ
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"There are quite a few changes in this release, the most notable of
which is that we've made inode flushing fully asynchronous, and we no
longer block memory reclaim on this.
Furthermore, we have fixed a long-standing bug in the quota code where
soft limit warnings and inode limits were never tracked properly.
Moving further down the line, the reflink control loops have been
redesigned to behave more efficiently; and numerous small bugs have
been fixed (see below). The xattr and quota code have been extensively
refactored in preparation for more new features coming down the line.
Finally, the behavior of DAX between ext4 and xfs has been stabilized,
which gets us a step closer to removing the experimental tag from that
feature.
We have a few new contributors this time around. Welcome, all!
I anticipate a second pull request next week for a few small bugfixes
that have been trickling in, but this is it for big changes.
Summary:
- Fix some btree block pingponging problems when swapping extents
- Redesign the reflink copy loop so that we only run one remapping
operation per transaction. This helps us avoid running out of block
reservation on highly deduped filesystems.
- Take the MMAPLOCK around filemap_map_pages.
- Make inode reclaim fully async so that we avoid stalling processes
on flushing inodes to disk.
- Reduce inode cluster buffer RMW cycles by attaching the buffer to
dirty inodes so we won't let go of the cluster buffer when we know
we're going to need it soon.
- Add some more checks to the realtime bitmap file scrubber.
- Don't trip false lockdep warnings in fs freeze.
- Remove various redundant lines of code.
- Remove unnecessary calls to xfs_perag_{get,put}.
- Preserve I_VERSION state across remounts.
- Fix an unmount hang due to AIL going to sleep with a non-empty
delwri buffer list.
- Fix an error in the inode allocation space reservation macro that
caused regressions in generic/531.
- Fix a potential livelock when dquot flush fails because the dquot
buffer is locked.
- Fix a miscalculation when reserving inode quota that could cause
users to exceed a hardlimit.
- Refactor struct xfs_dquot to use native types for incore fields
instead of abusing the ondisk struct for this purpose. This will
eventually enable proper y2038+ support, but for now it merely
cleans up the quota function declarations.
- Actually increment the quota softlimit warning counter so that soft
failures turn into hard(er) failures when they exceed the softlimit
warning counter limits set by the administrator.
- Split incore dquot state flags into their own field and namespace,
to avoid mixing them with quota type flags.
- Create a new quota type flags namespace so that we can make it
obvious when a quota function takes a quota type (user, group,
project) as an argument.
- Rename the ondisk dquot flags field to type, as that more
accurately represents what we store in it.
- Drop our bespoke memory allocation flags in favor of GFP_*.
- Rearrange the xattr functions so that we no longer mix metadata
updates and transaction management (e.g. rolling complex
transactions) in the same functions. This work will prepare us for
atomic xattr operations (itself a prerequisite for directory
backrefs) in future release cycles.
- Support FS_DAX_FL (aka FS_XFLAG_DAX) via GETFLAGS/SETFLAGS"
* tag 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (117 commits)
fs/xfs: Support that ioctl(SETXFLAGS/GETXFLAGS) can set/get inode DAX on XFS.
xfs: Lift -ENOSPC handler from xfs_attr_leaf_addname
xfs: Simplify xfs_attr_node_addname
xfs: Simplify xfs_attr_leaf_addname
xfs: Add helper function xfs_attr_node_removename_rmt
xfs: Add helper function xfs_attr_node_removename_setup
xfs: Add remote block helper functions
xfs: Add helper function xfs_attr_leaf_mark_incomplete
xfs: Add helpers xfs_attr_is_shortform and xfs_attr_set_shortform
xfs: Remove xfs_trans_roll in xfs_attr_node_removename
xfs: Remove unneeded xfs_trans_roll_inode calls
xfs: Add helper function xfs_attr_node_shrink
xfs: Pull up xfs_attr_rmtval_invalidate
xfs: Refactor xfs_attr_rmtval_remove
xfs: Pull up trans roll in xfs_attr3_leaf_clearflag
xfs: Factor out xfs_attr_rmtval_invalidate
xfs: Pull up trans roll from xfs_attr3_leaf_setflag
xfs: Refactor xfs_attr_try_sf_addname
xfs: Split apart xfs_attr_leaf_addname
xfs: Pull up trans handling in xfs_attr3_leaf_flipflags
...
Pull init and set_fs() cleanups from Al Viro:
"Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"
* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
init: add an init_dup helper
init: add an init_utimes helper
init: add an init_stat helper
init: add an init_mknod helper
init: add an init_mkdir helper
init: add an init_symlink helper
init: add an init_link helper
init: add an init_eaccess helper
init: add an init_chmod helper
init: add an init_chown helper
init: add an init_chroot helper
init: add an init_chdir helper
init: add an init_rmdir helper
init: add an init_unlink helper
init: add an init_umount helper
init: add an init_mount helper
init: mark create_dev as __init
init: mark console_on_rootfs as __init
init: initialize ramdisk_execute_command at compile time
devtmpfs: refactor devtmpfsd()
...
Pull ptrace regset updates from Al Viro:
"Internal regset API changes:
- regularize copy_regset_{to,from}_user() callers
- switch to saner calling conventions for ->get()
- kill user_regset_copyout()
The ->put() side of things will have to wait for the next cycle,
unfortunately.
The balance is about -1KLoC and replacements for ->get() instances are
a lot saner"
* 'work.regset' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
regset: kill user_regset_copyout{,_zero}()
regset(): kill ->get_size()
regset: kill ->get()
csky: switch to ->regset_get()
xtensa: switch to ->regset_get()
parisc: switch to ->regset_get()
nds32: switch to ->regset_get()
nios2: switch to ->regset_get()
hexagon: switch to ->regset_get()
h8300: switch to ->regset_get()
openrisc: switch to ->regset_get()
riscv: switch to ->regset_get()
c6x: switch to ->regset_get()
ia64: switch to ->regset_get()
arc: switch to ->regset_get()
arm: switch to ->regset_get()
sh: convert to ->regset_get()
arm64: switch to ->regset_get()
mips: switch to ->regset_get()
sparc: switch to ->regset_get()
...
Before this patch, if function gfs2_dirty_inode got an error when
trying to lock the inode glock, it complained, but it didn't say
what glock or inode had the problem.
In this case, it almost always means that dinode_in found an error
with the dinode in the file system. So it makes sense to dump the
glock, which tells us the location of the dinode in the file system.
That will allow us to analyze the corruption from the metadata.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, some functions started transactions then they called
gfs2_block_zero_range. However, gfs2_block_zero_range, like writes, can
start transactions, which results in a recursive transaction error.
For example:
do_shrink
trunc_start
gfs2_trans_begin <------------------------------------------------
gfs2_block_zero_range
iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops);
iomap_apply ... iomap_zero_range_actor
iomap_begin
gfs2_iomap_begin
gfs2_iomap_begin_write
actor (iomap_zero_range_actor)
iomap_zero
iomap_write_begin
gfs2_iomap_page_prepare
gfs2_trans_begin <------------------------
This patch reorders the callers of gfs2_block_zero_range so that they
only start their transactions after the call. It also adds a BUG_ON to
ensure this doesn't happen again.
Fixes: 2257e468a6 ("gfs2: implement gfs2_block_zero_range using iomap_zero_range")
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
If function gfs2_trans_begin is called with another transaction active
it BUGs out, but it doesn't give any details about the duplicate.
This patch moves function gfs2_print_trans and calls it when this
situation arises for better debugging.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The comment regarding journal flush thresholds is wrong. This patch fixes it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The error handling calls kfree(full_path) so we can't let it be a NULL
pointer. There used to be a NULL assignment here but we accidentally
deleted it. Add it back.
Fixes: 7efd081582 ("cifs: document and cleanup dfs mount")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
This set includes a some improvements to the dlm
networking layer: improving the ability to trace
dlm messages for debugging, and improved handling
of bad messages or disrupted connections.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJfLCPxAAoJEDgbc8f8gGmqz04P/2hvv/4rXo9AOgnnstvZV1Qy
Yo01Cy807vB1c3jhIJryM2gG61GNH22RAHc2NcfjJwy04HH/1IEr6P48Po3qYEnS
8fZ8B9msxpsujVOrRoeBuLN8elI1HftyNVWaVjH7xtD+fLCDLu9i10kv3aeS+DiB
T6f7yQQv7hgXS3xGvlMr2//aLwGD2ZdcRbkOEGo+k7yUjQbIDH/wdZWcPLh6y4yT
p20i2ulYKjEZFmXDMa17diONISeGO6iaDhee24XPDwNDp8qI1iPGJsmxltMmn8Qf
d2HPF1IDh4eM8lCwmqBtjYTnJd6rAW0v3+Ek1+wzQKVeXLFiz/MEyuOldtpsqmMO
8Og0vr6zfTCjFo8uvyj+cF7Fcj0yIPWg1yb7EauqqxreK8V9GBA1V2ZXYVd8xwea
thrAUaq8f+PYQ9uy1FsN3xaO3BFN1VpcvHu4/3gU3OudnZZt2Ae670RYHKC0bq8D
2tSsqaiDnlvniHgh4xvtNIvRANkDS1ZSbkUPZhMHL7DnRJn66oDIfCr7NMbZwvCa
AS0q6suUFyXFbAEJcY6XWxe3aQ3WuxIClT84MgzX/dAK2Qcl8ryWGGSVc0dp4Vl1
cd8MtmpnIWsnxqNRl4jn6cfolDheaxL8nouLtJ+3/dC9VkyDyfmrtnM+8aTZKHoa
3/xrBuVkEJAwkAAr8Pb8
=qgti
-----END PGP SIGNATURE-----
Merge tag 'dlm-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes a some improvements to the dlm networking layer:
improving the ability to trace dlm messages for debugging, and
improved handling of bad messages or disrupted connections"
* tag 'dlm-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: implement tcp graceful shutdown
fs: dlm: change handling of reconnects
fs: dlm: don't close socket on invalid message
fs: dlm: set skb mark per peer socket
fs: dlm: set skb mark for listen socket
net: sock: add sock_set_mark
dlm: Fix kobject memleak
- Make sure we call ->iomap_end with a failure code if ->iomap_begin
failed in any way; some filesystems need to try to undo things.
- Don't invalidate the page cache during direct reads since we already
sync'd the cache with disk.
- Make direct writes fall back to the page cache if the pre-write
cache invalidation fails. This avoids a cache coherency problem.
- Fix some idiotic virus scanner warning bs in the previous tag.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl8q3aMACgkQ+H93GTRK
tOvh9BAAkUF11er5pSefKdM1t2WGlSjMgPRmMHELRFwLhJBPK1zyIIs+kCz9+k1/
lGe8mAEI7cA06jiUXYCbHZW1Cgno46VYZxWVnIE3i7c3xYt8pwApqwY+ATqH6X75
7peax9L0Dn8DK7mzw6ihcO6LCIH0iyfHeBpWyKN87APBhKU6nNtVah/I/3NGnbWJ
EbH6TSf4FWqzBvYJZKUQRqrGZRJWUinrRAqLnh2fWxVcjUDLVTnbjWxAuL0StgWB
H1AY3dof9ROYK3SKFNPqtur8nXcrHNCvnOSgQmB8F++ZkfsubR1MREWpndBJTHnd
/a5zNveiQGvA8drM1+2v/QLd30yp3I+LHSlM+BY5Bc/Xl8c2ZanwAhu+x40Ha6qq
rjsh31Hdn6E4qzP+ne+eVSWyPPHNZCK0i7gBWlTBodlJyHN70N1RBCfGBnO2VVbt
fZCxn6kxLYrfKEoQVQS+9QGu3cRSh7yYsLGjWoK5iynsVJCOvMjmTZ6uPL2EWAEY
9oz/QRxyTaVit1sgk0ypsrfZ4yFafI5QIDCLM9pHpxgj0QNddO2smAyKO2WItZ90
ERz/0UYg1LJoEl4lmBwHoYAI3aU37FyO9UhjgTIJSZeLZbnK1aba9uikwgrSmS/c
XLVy0WyPWd/JMBhA0EAAaQFBa1D6gTdTskSG8Djl1saiYNu6kGs=
=rjsZ
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.9-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The most notable changes are:
- iomap no longer invalidates the page cache when performing a direct
read, since doing so is unnecessary and the old directio code
doesn't do that either.
- iomap embraced the use of returning ENOTBLK from a direct write to
trigger falling back to a buffered write since ext4 already did
this and btrfs wants it for their port.
- iomap falls back to buffered writes if we're doing a direct write
and the page cache invalidation after the flush fails; this was
necessary to handle a corner case in the btrfs port.
- Remove email virus scanner detritus that was accidentally included
in yesterday's pull request. Clearly I need(ed) to update my git
branch checker scripts. :("
* tag 'iomap-5.9-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fall back to buffered writes for invalidation failures
xfs: use ENOTBLK for direct I/O to buffered I/O fallback
iomap: Only invalidate page cache pages on direct IO writes
iomap: Make sure iomap_end is called after iomap_begin
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl8qeCkACgkQnJ2qBz9k
QNlAGQf/YVruyVLZ7kCv6EMCHauXm3K1lEGpbXsTW04HpStxGx7mtLGN/Au+EYJR
VnRkCMt6TSMQGMBkNF83dUCwXHkeL1rd6frJBLVOErkg50nUuD4kjTVw9Lzw9itx
CPhKnPPlsRkDkZPxkg3WEdqPgzJREWBZUaB38QUPjYN46q7HfPYDANTh5wI1GiGs
27+PvzlttjhkQpQ14pYU/nu4xf/nmgmmHhgfsJArQP2EzYOrKxsWKhXS5uPdtNlf
mXiZMaqW2AlyDGlw3myOEySrrSuaR77M2bzDo7mjqffI9wSVTytKEhtg0i8OMWmv
pZ38OQobznnFoqzc1GL70IE0DEU48g==
=d81d
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
- fanotify fix for softlockups when there are many queued events
- performance improvement to reduce fsnotify overhead when not used
- Amir's implementation of fanotify events with names. With these you
can now efficiently monitor whole filesystem, eg to mirror changes to
another machine.
* tag 'fsnotify_for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits)
fanotify: compare fsid when merging name event
fsnotify: create method handle_inode_event() in fsnotify_operations
fanotify: report parent fid + child fid
fanotify: report parent fid + name + child fid
fanotify: add support for FAN_REPORT_NAME
fanotify: report events with parent dir fid to sb/mount/non-dir marks
fanotify: add basic support for FAN_REPORT_DIR_FID
fsnotify: remove check that source dentry is positive
fsnotify: send event with parent/name info to sb/mount/non-dir marks
audit: do not set FS_EVENT_ON_CHILD in audit marks mask
inotify: do not set FS_EVENT_ON_CHILD in non-dir mark mask
fsnotify: pass dir and inode arguments to fsnotify()
fsnotify: create helper fsnotify_inode()
fsnotify: send event to parent and child with single callback
inotify: report both events on parent and child with single callback
dnotify: report both events on parent and child with single callback
fanotify: no external fh buffer in fanotify_name_event
fanotify: use struct fanotify_info to parcel the variable size buffer
fsnotify: add object type "child" to object type iterator
fanotify: use FAN_EVENT_ON_CHILD as implicit flag on sb/mount/non-dir marks
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl8qdtkACgkQnJ2qBz9k
QNkNbQgAiLy3zzqBT9noZ5WEI8VzStsRDUyccbzaCIbSrqv7sBbf2ey+iaE9V5gR
HCNZtTSBChMyzpGt1j9l+1/a/0ntzcypb74+kRWi6eApqGh6X8tCggjqIKloy5Bg
jAkYHpvjz1Dpv1qdOWgcCI76XkF8Q+bID4HjsbvxKr4dEVaqlTictZhwtk2oonRN
paREsiwSvjdCEZ/3r2FO4kYAtxMD+x2KhImu/UHJKG92GsQiC4IY5zJmy9aV4gw+
16Z46PtYmzvYli59m2NQgCY5j95dL2VBmjtjFoxMOsUgb76PcqVAhfNeYVo0rmYU
vfs5ngYdxDjYFBCbg45Fu+zO3ploTQ==
=zoom
-----END PGP SIGNATURE-----
Merge tag 'for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, udf, reiserfs, quota cleanups and minor fixes from Jan Kara:
"A few ext2 fixups and then several (mostly comment and documentation)
cleanups in ext2, udf, reiserfs, and quota"
* tag 'for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: delete duplicated words
udf: osta_udf.h: delete a duplicated word
reiserfs: reiserfs.h: delete a duplicated word
ext2: ext2.h: fix duplicated word + typos
udf: Replace HTTP links with HTTPS ones
quota: Fixup http links in quota doc
Replace HTTP links with HTTPS ones: DISKQUOTA
ext2: initialize quota info in ext2_xattr_set()
ext2: fix some incorrect comments in inode.c
ext2: remove nocheck option
ext2: fix missing percpu_counter_inc
ext2: ext2_find_entry() return -ENOENT if no entry found
ext2: propagate errors up to ext2_find_entry()'s callers
ext2: fix improper assignment for e_value_offs
- use HTTPS links instead of insecure HTTP ones;
- fix crossing page boundary on specific extended inodes;
- remove useless WQ_CPU_INTENSIVE flag for unbound wq;
- minor cleanup.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXytmoBUcaHNpYW5na2Fv
QHJlZGhhdC5jb20ACgkQOTcx3B+15gSg8gEA/LwZy3e/Tnor9CP2Mc+QSMPmuhvX
ZwsxOyYqYGkVtlcBAMLKiBu96hqH+V3AOPHNfqS19N3fdjs34CEp/wbl1x8G
=I/Yp
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"This cycle mainly addresses an issue out of some extended inode with
designated location, which are not generated by current mkfs but need
to handled at runtime anyway. The others are quite trivial ones.
- use HTTPS links instead of insecure HTTP ones;
- fix crossing page boundary on specific extended inodes;
- remove useless WQ_CPU_INTENSIVE flag for unbound wq;
- minor cleanup"
* tag 'erofs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: remove WQ_CPU_INTENSIVE flag from unbound wq's
erofs: fold in used-once helper erofs_workgroup_unfreeze_final()
erofs: fix extended inode could cross boundary
erofs: Replace HTTP links with HTTPS ones
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl8ojNUACgkQiiy9cAdy
T1Fkxwv/fwoI+RGQyIRYeFSr85yqnbDRG/tBFUdsiOxjvKwzzINg0ECGpmkxpuwb
npICFmjmhzApXiTnEj9Jm81Qk50W0L1sQtjgPe14ROuRdzbhooVdQM2Y7LCCUwjm
8TzVWRyNkYrwg95hhfYTlqXCIrH9aNYv6UbFlRgeoT4fTgqiWZysDOlnt9wVwXtu
O5SXy8DBn9SRehtzVvXkMXosirwgMJ1QjHwyGuyVpsiQRGEPy7jWe1s90fxhWUNA
CejrD/OURNCgVO8B2DlygCl0RqsE3yqQA8IM+0tFnEeOXkYW9GdRoEZlLHGRT2Nf
HqE/lxpJqip1ykmAtbzLH1g+nmEkrXjQB4+q7krZnuw1L132dbP5hl/x+NJR8wsU
ymjuou96dCedNVemwbqjRvNBSHA1IiqQP0B9bPJCqsBMgX6HAZgDK3gVrlDqR8gt
7M/f935YRu2I/aE810Bw4SOTYkEdLKJyrw5fQFQO+mZGDe6zbtabMvflpxW5jz5m
sRLjOY/X
=X8aQ
-----END PGP SIGNATURE-----
Merge tag '5.9-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"16 cifs/smb3 fixes, about half DFS related, two fixes for stable.
Still working on and testing an additional set of fixes (including
updates to mount, and some fallocate scenario improvements) for later
in the merge window"
* tag '5.9-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
cifs: document and cleanup dfs mount
cifs: only update prefix path of DFS links in cifs_tree_connect()
cifs: fix double free error on share and prefix
cifs: handle RESP_GET_DFS_REFERRAL.PathConsumed in reconnect
cifs: handle empty list of targets in cifs_reconnect()
cifs: rename reconn_inval_dfs_target()
cifs: reduce number of referral requests in DFS link lookups
cifs: merge __{cifs,smb2}_reconnect[_tcon]() into cifs_tree_connect()
cifs: convert to use be32_add_cpu()
cifs: delete duplicated words in header files
cifs: Remove the superfluous break
cifs: smb1: Try failing back to SetFileInfo if SetPathInfo fails
cifs`: handle ERRBaduid for SMB1
cifs: remove unused variable 'server'
smb3: warn on confusing error scenario with sec=krb5
cifs: Fix leak when handling lease break for cached root fid
During my code inspection I saw there is no implementation of a graceful
shutdown for tcp. This patch will introduce a graceful shutdown for tcp
connections. The shutdown is implemented synchronized as
dlm_lowcomms_stop() is called to end all dlm communication. After shutdown
is done, a lot of flush and closing functionality will be called. However
I don't see a problem with that.
The waitqueue for synchronize the shutdown has a timeout of 10 seconds, if
timeout a force close will be exectued.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch changes the handling of reconnects. At first we only close
the connection related to the communication failure. If we get a new
connection for an already existing connection we close the existing
connection and take the new one.
This patch improves significantly the stability of tcp connections while
running "tcpkill -9 -i $IFACE port 21064" while generating a lot of dlm
messages e.g. on a gfs2 mount with many files. My test setup shows that a
deadlock is "more" unlikely. Before this patch I wasn't able to get
not a deadlock after 5 seconds. After this patch my observation is
that it's more likely to survive after 5 seconds and more, but still a
deadlock occurs after certain time. My guess is that there are still
"segments" inside the tcp writequeue or retransmit queue which get dropped
when receiving a tcp reset [1]. Hard to reproduce because the right message
need to be inside these queues, which might even be in the 5 first seconds
with this patch.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_input.c?h=v5.8-rc6#n4122
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch doesn't close sockets when there is an invalid dlm message
received. The connection will probably reconnect anyway so. To not
close the connection will reduce the number of possible failtures.
As we don't have a different strategy to react on such scenario
just keep going the connection and ignore the message.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch adds support to set the skb mark value for the DLM tcp and
sctp socket per peer. The mark value will be offered as per comm value
of configfs. At creation time of the peer socket it will be set as
socket option.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch adds support to set the skb mark value for the DLM listen
tcp and sctp sockets. The mark value will be offered as cluster
configuration. At creation time of the listen socket it will be set as
socket option.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Currently the error return path from kobject_init_and_add() is not
followed by a call to kobject_put() - which means we are leaking
the kobject.
Set do_unreg = 1 before kobject_init_and_add() to ensure that
kobject_put() can be called in its error patch.
Fixes: 901195ed7f ("Kobject: change GFS2 to use kobject_init_and_add")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The tear down path will always unaccount the memory, so ensure that we
have accounted it before hitting any of them.
Reported-by: Tomáš Chaloupka <chalucha@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we hit an earlier error path in io_uring_create(), then we will have
accounted memory, but not set ctx->{sq,cq}_entries yet. Then when the
ring is torn down in error, we use those values to unaccount the memory.
Ensure we set the ctx entries before we're able to hit a potential error
path.
Cc: stable@vger.kernel.org
Reported-by: Tomáš Chaloupka <chalucha@gmail.com>
Tested-by: Tomáš Chaloupka <chalucha@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cr=0 is supposed to be an optimization to save CPU cycles, but if
buddy data (in memory) is not initialized then all this makes no sense
as we have to do sync IO taking a lot of cycles. Also, at cr=0
mballoc doesn't choose any available chunk. cr=1 also skips groups
using heuristic based on avg. fragment size. It's more useful to skip
such groups and switch to cr=2 where groups will be scanned for
available chunks. However, we always read the first block group in a
flex_bg so metadata blocks will get read into the first flex_bg if
possible.
Using sparse image and dm-slow virtual device of 120TB was
simulated, then the image was formatted and filled using debugfs to
mark ~85% of available space as busy. mount process w/o the patch
couldn't complete in half an hour (according to vmstat it would take
~10-11 hours). With the patch applied mount took ~20 seconds.
Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988
Signed-off-by: Alex Zhuravlev <azhuravlev@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
This should significantly improve bitmap loading, especially for flex
groups as it tries to load all bitmaps within a flex.group instead of
one by one synchronously.
Prefetching is done in 8 * flex_bg groups, so it should be 8
read-ahead reads for a single allocating thread. At the end of
allocation the thread waits for read-ahead completion and initializes
buddy information so that read-aheads are not lost in case of memory
pressure.
At cr=0 the number of prefetching IOs is limited per allocation
context to prevent a situation when mballoc loads thousands of bitmaps
looking for a perfect group and ignoring groups with good chunks.
Together with the patch "ext4: limit scanning of uninitialized groups"
the mount time (which includes few tiny allocations) of a 1PB
filesystem is reduced significantly:
0% full 50%-full unpatched patched
mount time 33s 9279s 563s
[ Restructured by tytso; removed the state flags in the allocation
context, so it can be used to lazily prefetch the allocation bitmaps
immediately after the file system is mounted. Skip prefetching
block groups which are uninitialized. Finally pass in the
REQ_RAHEAD flag to the block layer while prefetching. ]
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't define EXT4_IOC_* aliases to ioctls that already have a generic
FS_IOC_* name. These aliases are unnecessary, and they make it unclear
which ioctls are ext4-specific and which are generic.
Exception: leave EXT4_IOC_GETVERSION_OLD and EXT4_IOC_SETVERSION_OLD
as-is for now, since renaming them to FS_IOC_GETVERSION and
FS_IOC_SETVERSION would probably make them more likely to be confused
with EXT4_IOC_GETVERSION and EXT4_IOC_SETVERSION which also exist.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200714230909.56349-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Define the EXT4_FL_USER_* constants by OR-ing together the appropriate
flags, rather than hard-coding a numeric value. This makes it much
easier to see which flags are listed.
No change in the actual values.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200713031012.192440-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A customer has reported a BUG_ON in ext4_clear_journal_err() hitting
during an LTP testing. Either this has been caused by a test setup
issue where the filesystem was being overwritten while LTP was mounting
it or the journal replay has overwritten the superblock with invalid
data. In either case it is preferable we don't take the machine down
with a BUG_ON. So handle the situation of unexpectedly missing
has_journal feature more gracefully. We issue warning and fail the mount
in the cases where the race window is narrow and the failed check is
most likely a programming error. In cases where fs corruption is more
likely, we do full ext4_error() handling before failing mount / remount.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200710140759.18031-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since commit 378f32bab3 ("ext4: introduce direct I/O write using iomap
infrastructure") we don't properly bail out of RWF_NOWAIT direct IO
write if underlying blocks are not allocated. Also
ext4_dio_write_checks() does not honor RWF_NOWAIT when re-acquiring
i_rwsem. Fix both issues.
Fixes: 378f32bab3 ("ext4: introduce direct I/O write using iomap infrastructure")
Cc: stable@kernel.org
Reported-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200708153516.9507-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Link: https://lore.kernel.org/r/20200706190339.20709-1-grandmaster@al2klimov.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If dquot_initialize() return non-zero and trace of ext4_unlink_enter/exit
enabled then the matching-pair of trace_exit will lost in log.
Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200629122621.129953-1-zhuangyi1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It should call trace exit in all return path for ext4_truncate.
Signed-off-by: zhengliang <zhengliang6@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200701083027.45996-1-zhengliang6@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
jbd2_write_superblock() is under the buffer lock of journal superblock
before ending that superblock write, so add a missing unlock_buffer() in
in the error path before submitting buffer.
Fixes: 742b06b562 ("jbd2: check superblock mapped prior to committing")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200620061948.2049579-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If for any reason a directory passed to do_split() does not have enough
active entries to exceed half the size of the block, we can end up
iterating over all "count" entries without finding a split point.
In this case, count == move, and split will be zero, and we will
attempt a negative index into map[].
Guard against this by detecting this case, and falling back to
split-to-half-of-count instead; in this case we will still have
plenty of space (> half blocksize) in each split block.
Fixes: ef2b02d3e6 ("ext34: ensure do_split leaves enough free space in both blocks")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Callers of __jbd2_journal_unfile_buffer() and
__jbd2_journal_refile_buffer() assume that the b_transaction is set. In
fact if it's not, we can end up with journal_head refcounting errors
leading to crash much later that might be very hard to track down. Add
asserts to make sure that is the case.
We also make sure that b_next_transaction is NULL in
__jbd2_journal_unfile_buffer() since the callers expect that as well and
we should not get into that stage in this state anyway, leading to
problems later on if we do.
Tested with fstests.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200617092549.6712-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The brelse() function tests whether its argument is NULL
and then returns immediately.
Thus remove the tests which are not needed around the shown calls.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/0d713702-072f-a89c-20ec-ca70aa83a432@web.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull networking updates from David Miller:
1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.
2) Support UDP segmentation in code TSO code, from Eric Dumazet.
3) Allow flashing different flash images in cxgb4 driver, from Vishal
Kulkarni.
4) Add drop frames counter and flow status to tc flower offloading,
from Po Liu.
5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.
6) Various new indirect call avoidance, from Eric Dumazet and Brian
Vazquez.
7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
Yonghong Song.
8) Support querying and setting hardware address of a port function via
devlink, use this in mlx5, from Parav Pandit.
9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.
10) Switch qca8k driver over to phylink, from Jonathan McDowell.
11) In bpftool, show list of processes holding BPF FD references to
maps, programs, links, and btf objects. From Andrii Nakryiko.
12) Several conversions over to generic power management, from Vaibhav
Gupta.
13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
Yakunin.
14) Various https url conversions, from Alexander A. Klimov.
15) Timestamping and PHC support for mscc PHY driver, from Antoine
Tenart.
16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.
17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.
18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.
19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
drivers. From Luc Van Oostenryck.
20) XDP support for xen-netfront, from Denis Kirjanov.
21) Support receive buffer autotuning in MPTCP, from Florian Westphal.
22) Support EF100 chip in sfc driver, from Edward Cree.
23) Add XDP support to mvpp2 driver, from Matteo Croce.
24) Support MPTCP in sock_diag, from Paolo Abeni.
25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
infrastructure, from Jakub Kicinski.
26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.
27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.
28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.
29) Refactor a lot of networking socket option handling code in order to
avoid set_fs() calls, from Christoph Hellwig.
30) Add rfc4884 support to icmp code, from Willem de Bruijn.
31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.
32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.
33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.
34) Support TCP syncookies in MPTCP, from Flowian Westphal.
35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
Brivio.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
net: thunderx: initialize VF's mailbox mutex before first usage
usb: hso: remove bogus check for EINPROGRESS
usb: hso: no complaint about kmalloc failure
hso: fix bailout in error case of probe
ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
selftests/net: relax cpu affinity requirement in msg_zerocopy test
mptcp: be careful on subflow creation
selftests: rtnetlink: make kci_test_encap() return sub-test result
selftests: rtnetlink: correct the final return value for the test
net: dsa: sja1105: use detected device id instead of DT one on mismatch
tipc: set ub->ifindex for local ipv6 address
ipv6: add ipv6_dev_find()
net: openvswitch: silence suspicious RCU usage warning
Revert "vxlan: fix tos value before xmit"
ptp: only allow phase values lower than 1 period
farsync: switch from 'pci_' to 'dma_' API
wan: wanxl: switch from 'pci_' to 'dma_' API
hv_netvsc: do not use VF device if link is down
dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
net: macb: Properly handle phylink on at91sam9x
...
Here is the "big" set of changes to the driver core, and some drivers
using the changes, for 5.9-rc1.
"Biggest" thing in here is the device link exposure in sysfs, to help
to tame the madness that is SoC device tree representations and driver
interactions with it.
Other stuff in here that is interesting is:
- device probe log helper so that drivers can report problems in
a unified way easier.
- devres functions added
- DEVICE_ATTR_ADMIN_* macro added to make it harder to write
incorrect sysfs file permissions
- documentation cleanups
- ability for debugfs to be present in the kernel, yet not
exposed to userspace. Needed for systems that want it
enabled, but do not trust users, so they can still use some
kernel functions that were otherwise disabled.
- other minor fixes and cleanups
The patches outside of drivers/base/ all have acks from the respective
subsystem maintainers to go through this tree instead of theirs.
All of these have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXylhOQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylGdACeKqxm8IIDZycj0QjLUlPiEwVIROgAnjpf5jAB
mb4jMvgEGsB6/FwxypPG
=RUss
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of changes to the driver core, and some drivers
using the changes, for 5.9-rc1.
"Biggest" thing in here is the device link exposure in sysfs, to help
to tame the madness that is SoC device tree representations and driver
interactions with it.
Other stuff in here that is interesting is:
- device probe log helper so that drivers can report problems in a
unified way easier.
- devres functions added
- DEVICE_ATTR_ADMIN_* macro added to make it harder to write
incorrect sysfs file permissions
- documentation cleanups
- ability for debugfs to be present in the kernel, yet not exposed to
userspace. Needed for systems that want it enabled, but do not
trust users, so they can still use some kernel functions that were
otherwise disabled.
- other minor fixes and cleanups
The patches outside of drivers/base/ all have acks from the respective
subsystem maintainers to go through this tree instead of theirs.
All of these have been in linux-next with no reported issues"
* tag 'driver-core-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (39 commits)
drm/bridge: lvds-codec: simplify error handling
drm/bridge/sii8620: fix resource acquisition error handling
driver core: add deferring probe reason to devices_deferred property
driver core: add device probe log helper
driver core: Avoid binding drivers to dead devices
Revert "test_firmware: Test platform fw loading on non-EFI systems"
firmware_loader: EFI firmware loader must handle pre-allocated buffer
selftest/firmware: Add selftest timeout in settings
test_firmware: Test platform fw loading on non-EFI systems
driver core: Change delimiter in devlink device's name to "--"
debugfs: Add access restriction option
tracefs: Remove unnecessary debug_fs checks.
driver core: Fix probe_count imbalance in really_probe()
kobject: remove unused KOBJ_MAX action
driver core: Fix sleeping in invalid context during device link deletion
driver core: Add waiting_for_supplier sysfs file for devices
driver core: Add state_synced sysfs file for devices that support it
driver core: Expose device link details in sysfs
driver core: Drop mention of obsolete bus rwsem from kernel-doc
debugfs: file: Remove unnecessary cast in kfree()
...
Failing to invalid the page cache means data in incoherent, which is
a very bad state for the system. Always fall back to buffered I/O
through the page cache if we can't invalidate mappings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
This is what the classic fs/direct-io.c implementation and thuse other
file systems use.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The historic requirement for XFS to invalidate cached pages on
direct IO reads has been lost in the twisty pages of history - it was
inherited from Irix, which implemented page cache invalidation on
read as a method of working around problems synchronising page
cache state with uncached IO.
XFS has carried this ever since. In the initial linux ports it was
necessary to get mmap and DIO to play "ok" together and not
immediately corrupt data. This was the state of play until the linux
kernel had infrastructure to track unwritten extents and synchronise
page faults with allocations and unwritten extent conversions
(->page_mkwrite infrastructure). IOws, the page cache invalidation
on DIO read was necessary to prevent trivial data corruptions. This
didn't solve all the problems, though.
There were peformance problems if we didn't invalidate the entire
page cache over the file on read - we couldn't easily determine if
the cached pages were over the range of the IO, and invalidation
required taking a serialising lock (i_mutex) on the inode. This
serialising lock was an issue for XFS, as it was the only exclusive
lock in the direct Io read path.
Hence if there were any cached pages, we'd just invalidate the
entire file in one go so that subsequent IOs didn't need to take the
serialising lock. This was a problem that prevented ranged
invalidation from being particularly useful for avoiding the
remaining coherency issues. This was solved with the conversion of
i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
use i_rwsem. Hence we could now just do ranged invalidation and the
performance problem went away.
However, page cache invalidation was still needed to serialise
sub-page/sub-block zeroing via direct IO against buffered IO because
bufferhead state attached to the cached page could get out of whack
when direct IOs were issued. We've removed bufferheads from the
XFS code, and we don't carry any extent state on the cached pages
anymore, and so this problem has gone away, too.
IOWs, it would appear that we don't have any good reason to be
invalidating the page cache on DIO reads anymore. Hence remove the
invalidation on read because it is unnecessary overhead,
not needed to maintain coherency between mmap/buffered access and
direct IO anymore, and prevents anyone from using direct IO reads
from intentionally invalidating the page cache of a file.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Delete repeated words in fs/xfs/.
{we, that, the, a, to, fork}
Change "it it" to "it is" in one location.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
To: linux-fsdevel@vger.kernel.org
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Most session messages contain a feature mask, but the MDS will
routinely send a REJECT message with one that is zero-length.
Commit 0fa8263367 ("ceph: fix endianness bug when handling MDS
session feature bits") fixed the decoding of the feature mask,
but failed to account for the MDS sending a zero-length feature
mask. This causes REJECT message decoding to fail.
Skip trying to decode a feature mask if the word count is zero.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46823
Fixes: 0fa8263367 ("ceph: fix endianness bug when handling MDS session feature bits")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ensure we correctly report the stateid and status in the layoutreturn on
close tracepoint.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
while to come. Changes include:
- Some new Chinese translations
- Progress on the battle against double words words and non-HTTPS URLs
- Some block-mq documentation
- More RST conversions from Mauro. At this point, that task is
essentially complete, so we shouldn't see this kind of churn again for a
while. Unless we decide to switch to asciidoc or something...:)
- Lots of typo fixes, warning fixes, and more.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl8oVkwPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YoW8H/jJ/xnXFn7tkgVPQAlL3k5HCnK7A5nDP9RVR
cg1pTx1cEFdjzxPlJyExU6/v+AImOvtweHXC+JDK7YcJ6XFUNYXJI3LxL5KwUXbY
BL/xRFszDSXH2C7SJF5GECcFYp01e/FWSLN3yWAh+g+XwsKiTJ8q9+CoIDkHfPGO
7oQsHKFu6s36Af0LfSgxk4sVB7EJbo8e4psuPsP5SUrl+oXRO43Put0rXkR4yJoH
9oOaB51Do5fZp8I4JVAqGXvpXoExyLMO4yw0mASm6YSZ3KyjR8Fae+HD9Cq4ZuwY
0uzb9K+9NEhqbfwtyBsi99S64/6Zo/MonwKwevZuhtsDTK4l4iU=
=JQLZ
-----END PGP SIGNATURE-----
Merge tag 'docs-5.9' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"It's been a busy cycle for documentation - hopefully the busiest for a
while to come. Changes include:
- Some new Chinese translations
- Progress on the battle against double words words and non-HTTPS
URLs
- Some block-mq documentation
- More RST conversions from Mauro. At this point, that task is
essentially complete, so we shouldn't see this kind of churn again
for a while. Unless we decide to switch to asciidoc or
something...:)
- Lots of typo fixes, warning fixes, and more"
* tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits)
scripts/kernel-doc: optionally treat warnings as errors
docs: ia64: correct typo
mailmap: add entry for <alobakin@marvell.com>
doc/zh_CN: add cpu-load Chinese version
Documentation/admin-guide: tainted-kernels: fix spelling mistake
MAINTAINERS: adjust kprobes.rst entry to new location
devices.txt: document rfkill allocation
PCI: correct flag name
docs: filesystems: vfs: correct flag name
docs: filesystems: vfs: correct sync_mode flag names
docs: path-lookup: markup fixes for emphasis
docs: path-lookup: more markup fixes
docs: path-lookup: fix HTML entity mojibake
CREDITS: Replace HTTP links with HTTPS ones
docs: process: Add an example for creating a fixes tag
doc/zh_CN: add Chinese translation prefer section
doc/zh_CN: add clearing-warn-once Chinese version
doc/zh_CN: add admin-guide index
doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label
futex: MAINTAINERS: Re-add selftests directory
...
The NFS_CONTEXT_ERROR_WRITE flag (as well as the check of said flag) was
removed by commit 6fbda89b25. The absence of an error check allows
writes to be continually queued up for a server that may no longer be
able to handle them. Fix it by adding an error check using the generic
error reporting functions.
Fixes: 6fbda89b25 ("NFS: Replace custom error reporting mechanism with generic one")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a simple helper to grab a reference to a file and install it at
the next available fd, and switch the early init code over to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygcpgAKCRCRxhvAZXjc
ogPeAQDv1ncqtNroFAC4pJ4tQhH7JSjW0OltiMk/AocY/J2SdQD9GJ15luYJ0/om
697q/Z68sndRynhdoZlMuf3oYuBlHQw=
=3ZhE
-----END PGP SIGNATURE-----
Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull close_range() implementation from Christian Brauner:
"This adds the close_range() syscall. It allows to efficiently close a
range of file descriptors up to all file descriptors of a calling
task.
This is coordinated with the FreeBSD folks which have copied our
version of this syscall and in the meantime have already merged it in
April 2019:
https://reviews.freebsd.org/D21627https://svnweb.freebsd.org/base?view=revision&revision=359836
The syscall originally came up in a discussion around the new mount
API and making new file descriptor types cloexec by default. During
this discussion, Al suggested the close_range() syscall.
First, it helps to close all file descriptors of an exec()ing task.
This can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that
file descriptors are not cloexec by default. This is aggravated by the
fact that we can't just switch them over without massively regressing
userspace. For a whole class of programs having an in-kernel method of
closing all file descriptors is very helpful (e.g. demons, service
managers, programming language standard libraries, container managers
etc.).
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on
each file descriptor and other hacks. From looking at various
large(ish) userspace code bases this or similar patterns are very
common in service managers, container runtimes, and programming
language runtimes/standard libraries such as Python or Rust.
In addition, the syscall will also work for tasks that do not have
procfs mounted and on kernels that do not have procfs support compiled
in. In such situations the only way to make sure that all file
descriptors are closed is to call close() on each file descriptor up
to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.
Based on Linus' suggestion close_range() also comes with a new flag
CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
right before exec. This would usually be expressed in the sequence:
unshare(CLONE_FILES);
close_range(3, ~0U);
as pointed out by Linus it might be desirable to have this be a part
of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
gets especially handy when we're closing all file descriptors above a
certain threshold.
Test-suite as always included"
* tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add CLOSE_RANGE_UNSHARE tests
close_range: add CLOSE_RANGE_UNSHARE
tests: add close_range() tests
arch: wire-up close_range()
open: add close_range()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygegQAKCRCRxhvAZXjc
olWZAQCMPbhI/20LA3OYJ6s+BgBEnm89PymvlHcym6Z4AvTungD+KqZonIYuxWgi
6Ttlv/fzgFFbXgJgbuass5mwFVoN5wM=
=oK7d
-----END PGP SIGNATURE-----
Merge tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull checkpoint-restore updates from Christian Brauner:
"This enables unprivileged checkpoint/restore of processes.
Given that this work has been going on for quite some time the first
sentence in this summary is hopefully more exciting than the actual
final code changes required. Unprivileged checkpoint/restore has seen
a frequent increase in interest over the last two years and has thus
been one of the main topics for the combined containers &
checkpoint/restore microconference since at least 2018 (cf. [1]).
Here are just the three most frequent use-cases that were brought forward:
- The JVM developers are integrating checkpoint/restore into a Java
VM to significantly decrease the startup time.
- In high-performance computing environment a resource manager will
typically be distributing jobs where users are always running as
non-root. Long-running and "large" processes with significant
startup times are supposed to be checkpointed and restored with
CRIU.
- Container migration as a non-root user.
In all of these scenarios it is either desirable or required to run
without CAP_SYS_ADMIN. The userspace implementation of
checkpoint/restore CRIU already has the pull request for supporting
unprivileged checkpoint/restore up (cf. [2]).
To enable unprivileged checkpoint/restore a new dedicated capability
CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
"Update on Task Migration at Google Using CRIU") with Adrian and
Nicolas providing the implementation now over the last months. In
essence, this allows the CRIU binary to be installed with the
CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
unprivileged users to restore processes.
To make this possible the following permissions are altered:
- Selecting a specific PID via clone3() set_tid relaxed from userns
CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.
- Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.
- Accessing /proc/pid/map_files relaxed from init userns
CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.
- Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
CAP_CHECKPOINT_RESTORE.
Of these four changes the /proc/self/exe change deserves a few words
because the reasoning behind even restricting /proc/self/exe changes
in the first place is just full of historical quirks and tracking this
down was a questionable version of fun that I'd like to spare others.
In short, it is trivial to change /proc/self/exe as an unprivileged
user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
or by simply intercepting the elf loader in userspace during exec.
Nicolas was nice enough to even provide a POC for the latter (cf. [3])
to illustrate this fact.
The original patchset which introduced PR_SET_MM_MAP had no
permissions around changing the exe link. They too argued that it is
trivial to spoof the exe link already which is true. The argument
brought up against this was that the Tomoyo LSM uses the exe link in
tomoyo_manager() to detect whether the calling process is a policy
manager. This caused changing the exe links to be guarded by userns
CAP_SYS_ADMIN.
All in all this rather seems like a "better guard it with something
rather than nothing" argument which imho doesn't qualify as a great
security policy. Again, because spoofing the exe link is possible for
the calling process so even if this were security relevant it was
broken back then and would be broken today. So technically, dropping
all permissions around changing the exe link would probably be
possible and would send a clearer message to any userspace that relies
on /proc/self/exe for security reasons that they should stop doing
this but for now we're only relaxing the exe link permissions from
userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.
There's a final uapi change in here. Changing the exe link used to
accidently return EINVAL when the caller lacked the necessary
permissions instead of the more correct EPERM. This pr contains a
commit fixing this. I assume that userspace won't notice or care and
if they do I will revert this commit. But since we are changing the
permissions anyway it seems like a good opportunity to try this fix.
With these changes merged unprivileged checkpoint/restore will be
possible and has already been tested by various users"
[1] LPC 2018
1. "Task Migration at Google Using CRIU"
https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
2. "Securely Migrating Untrusted Workloads with CRIU"
https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
LPC 2019
1. "CRIU and the PID dance"
https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
2. "Update on Task Migration at Google Using CRIU"
https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s
[2] https://github.com/checkpoint-restore/criu/pull/1155
[3] https://github.com/nviennot/run_as_exe
* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests: add clone3() CAP_CHECKPOINT_RESTORE test
prctl: exe link permission error changed from -EINVAL to -EPERM
prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
pid: use checkpoint_restore_ns_capable() for set_tid
capabilities: Introduce CAP_CHECKPOINT_RESTORE
Pull execve updates from Eric Biederman:
"During the development of v5.7 I ran into bugs and quality of
implementation issues related to exec that could not be easily fixed
because of the way exec is implemented. So I have been diggin into
exec and cleaning up what I can.
This cycle I have been looking at different ideas and different
implementations to see what is possible to improve exec, and cleaning
the way exec interfaces with in kernel users. Only cleaning up the
interfaces of exec with rest of the kernel has managed to stabalize
and make it through review in time for v5.9-rc1 resulting in 2 sets of
changes this cycle.
- Implement kernel_execve
- Make the user mode driver code a better citizen
With kernel_execve the code size got a little larger as the copying of
parameters from userspace and copying of parameters from userspace is
now separate. The good news is kernel threads no longer need to play
games with set_fs to use exec. Which when combined with the rest of
Christophs set_fs changes should security bugs with set_fs much more
difficult"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
exec: Implement kernel_execve
exec: Factor bprm_stack_limits out of prepare_arg_pages
exec: Factor bprm_execve out of do_execve_common
exec: Move bprm_mm_init into alloc_bprm
exec: Move initialization of bprm->filename into alloc_bprm
exec: Factor out alloc_bprm
exec: Remove unnecessary spaces from binfmts.h
umd: Stop using split_argv
umd: Remove exit_umh
bpfilter: Take advantage of the facilities of struct pid
exit: Factor thread_group_exited out of pidfd_poll
umd: Track user space drivers with struct pid
bpfilter: Move bpfilter_umh back into init data
exec: Remove do_execve_file
umh: Stop calling do_execve_file
umd: Transform fork_usermode_blob into fork_usermode_driver
umd: Rename umd_info.cmdline umd_info.driver_name
umd: For clarity rename umh_info umd_info
umh: Separate the user mode driver and the user mode helper support
umh: Remove call_usermodehelper_setup_file.
...
- Improved selftest coverage, timeouts, and reporting
- Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
- Refactor __scm_install_fd() into __receive_fd() and fix buggy callers
- Introduce "addfd" command for SECCOMP_RET_USER_NOTIF (Sargun Dhillon)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oZcQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJomDD/4x3j7eXREcXDsHOmlgEaHWGx4l
JldHFQhV5GjmD7gOkPcoZSG7NfG7F6VpwAJg7ZoR3qUkem7K8DFucxqgo1RldCot
nigleeLX6JeMS0Z+iwjAVZd+5t4xG4J/7GGDHIIMiG5qvwJ0Yf64o1bkjaB2Q/Bv
tluBg0WF32kFMG/ZwyY/V2QDbbue97CFPflybOh1o2nWbVzmUlFEEum3UUvZsxc8
smMsattJyuAV7kcEKzKrs8b010NdFZqwdbub5Np9W3XEXGBYMdIPoNsOQGmB9wby
j2ui0lzboXRG997jM7TCd1l/XZAv8aAwvPplw3FJRybzkOGs9NDyLMoz87yJpR1T
xp511vnMyMbyKIGdungkt7cIyzaictHwaYzznsmuNdCPEjTaIQJr1ctsa4GEgtqf
pnkktZ9YbMCcHU0CtZ8GlOVqA9wE+FUm0/u0zgikzJQsB+HcNItiARTTTHRyco7p
VJCqK8o4Zx4ELV7QNkSH4nhFkVgRopvrvBiPAGro/qwGOofBg8W8wM8O1+V/MDmp
zSU22v4SncT1Xb7dtmdJqDEeHfDikhaCAb4Je2hsGQWzbdAqwHGlpa7vpk9x3Q5r
L+XyP+Z+rPHlXYyypJwUvvOQhXOmP0zYxcEHxByqIBfXiwy+3dN4tDDfatWbccwl
uTlTDM8kmQn6QzSztA==
=yb55
-----END PGP SIGNATURE-----
Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
"There are a bunch of clean ups and selftest improvements along with
two major updates to the SECCOMP_RET_USER_NOTIF filter return:
EPOLLHUP support to more easily detect the death of a monitored
process, and being able to inject fds when intercepting syscalls that
expect an fd-opening side-effect (needed by both container folks and
Chrome). The latter continued the refactoring of __scm_install_fd()
started by Christoph, and in the process found and fixed a handful of
bugs in various callers.
- Improved selftest coverage, timeouts, and reporting
- Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
- Refactor __scm_install_fd() into __receive_fd() and fix buggy
callers
- Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
Dhillon)"
* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
seccomp: Introduce addfd ioctl to seccomp user notifier
fs: Expand __receive_fd() to accept existing fd
pidfd: Replace open-coded receive_fd()
fs: Add receive_fd() wrapper for __receive_fd()
fs: Move __scm_install_fd() to __receive_fd()
net/scm: Regularize compat handling of scm_detach_fds()
pidfd: Add missing sock updates for pidfd_getfd()
net/compat: Add missing sock updates for SCM_RIGHTS
selftests/seccomp: Check ENOSYS under tracing
selftests/seccomp: Refactor to use fixture variants
selftests/harness: Clean up kern-doc for fixtures
seccomp: Use -1 marker for end of mode 1 syscall list
seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
selftests/seccomp: Make kcmp() less required
seccomp: Use pr_fmt
selftests/seccomp: Improve calibration loop
selftests/seccomp: use 90s as timeout
selftests/seccomp: Expand benchmark to per-filter measurements
...
- Fix linking when crypto API disabled (Matteo Croce)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oWy4WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJtKnD/9k74pkcsl/xeB4KR4XRxmdKxwq
1wCfhrd7PRoUu1it7rrhGeAxFIyWfu+WTZKRqce5OwKgqx8BF4T1dpCuyOqHSMSz
O5UyRHiPl43EJHs7jvOVR7V5cjQx57SeaHxxtV/PGofNUTFqLVa0w9Pxh5Ma4nfT
R8j71qXOceHiwU/roHY+52vvwIMiixrgKmFfQb5klmoAQsUGXMiZYPkoelA7P4+S
M7OUL/GfLBFLH2IYbCEB4YBhX127PJIL74jIOpdvT/KsAFep4PCOD0a2qPH+FrF5
DVlF2BbGpbPg+uvFWu6gU6AZC/S+D7ZnV4cDVvg4ZNknJS8XEbZNIM1EgLPbKy0C
GzblUZdlA6KyEwIh9oAMiL9zjL3dianLq/mlSi8kKdiFmI2zNmwhQzW+xFgoEbRS
fgGG9wcAejM5X9YqHs2BO5TLvJ7qBLHzaQceEL2Z6ZNIm7rU4grX6HD7MD93SMOu
BA9O6tliYEDApSBNFLUKAlE8CGlwKjFdwgzbw7malm254uVQrCeICWVdo+caKFZr
JXjEgls2gYNM7oAME1MUksy5xzwqLjXmSWJXVCud+CjYaKoyAM5Po97sQr4aYXcu
F8zUdFGoy148IFi59xYZZKczLlyItCEr9OpCDA78V6MNiup0in4LhAERSPC8Ljw+
5LpiI0IIJFshcGhriA==
=gGdN
-----END PGP SIGNATURE-----
Merge tag 'pstore-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore update from Kees Cook:
"A tiny pstore update which fixes a very corner-case build failure:
- Fix linking when crypto API disabled (Matteo Croce)"
* tag 'pstore-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore: Fix linking when crypto API disabled
The variable ret is guaranteed to be 0 in this if (). So we can remove
this assignement.
Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When doing some tests with multiple mds, we were seeing many mds
forwarding requests between them, causing clients to resend.
If the request is a modification operation and the mode is set to
USE_AUTH_MDS, then the auth mds should be selected to handle the
request. If auth mds for frag is already set, then it should be returned
directly without further processing.
The current logic is wrong because it only returns directly if
mode is USE_AUTH_MDS, but we want to do that for all modes. If we don't,
then when the frag's mds is not equal to cap session's mds, the request
will get sent to the wrong MDS needlessly.
Drop the mode check in this condition.
Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When doing some testing recently, I hit some page allocation failures
on mount, when creating the wb_pagevec_pool for the mount. That
requires 128k (32 contiguous pages), and after thrashing the memory
during an xfstests run, sometimes that would fail.
128k for each mount seems like a lot to hold in reserve for a rainy
day, so let's change this to a global mempool that gets allocated
when the module is plugged in.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Symlink inodes should have the security context set in their xattrs on
creation. We already set the context on creation, but we don't attach
the pagelist. The effect is that symlink inodes don't get an SELinux
context set on them at creation, so they end up unlabeled instead of
inheriting the proper context. Make it do so.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- Make the Energy Model cover non-CPU devices (Lukasz Luba).
- Add Ice Lake server idle states table to the intel_idle driver
and eliminate a redundant static variable from it (Chen Yu,
Rafael Wysocki).
- Eliminate all W=1 build warnings from cpufreq (Lee Jones).
- Add support for Sapphire Rapids and for Power Limit 4 to the
Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).
- Fix function name in kerneldoc comments in the idle_inject power
capping driver (Yangtao Li).
- Fix locking issues with cpufreq governors and drop a redundant
"weak" function definition from cpufreq (Viresh Kumar).
- Rearrange cpufreq to register non-modular governors at the
core_initcall level and allow the default cpufreq governor to
be specified in the kernel command line (Quentin Perret).
- Extend, fix and clean up the intel_pstate driver (Srinivas
Pandruvada, Rafael Wysocki):
* Add a new sysfs attribute for disabling/enabling CPU
energy-efficiency optimizations in the processor.
* Make the driver avoid enabling HWP if EPP is not supported.
* Allow the driver to handle numeric EPP values in the sysfs
interface and fix the setting of EPP via sysfs in the active
mode.
* Eliminate a static checker warning and clean up a kerneldoc
comment.
- Clean up some variable declarations in the powernv cpufreq
driver (Wei Yongjun).
- Fix up the ->enter_s2idle callback definition to cover the case
when it points to the same function as ->idle correctly (Neal
Liu).
- Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).
- Make the PM core emit "changed" uevent when adding/removing the
"wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).
- Add a helper macro for declaring PM callbacks and use it in the
MMC jz4740 driver (Paul Cercueil).
- Fix white space in some places in the hibernate code and make the
system-wide PM code use "const char *" where appropriate (Xiang
Chen, Alexey Dobriyan).
- Add one more "unsafe" helper macro to the freezer to cover the NFS
use case (He Zhe).
- Change the language in the generic PM domains framework to use
parent/child terminology and clean up a typo and some comment
fromatting in that code (Kees Cook, Geert Uytterhoeven).
- Update the operating performance points OPP framework (Lukasz
Luba, Andrew-sh.Cheng, Valdis Kletnieks):
* Refactor dev_pm_opp_of_register_em() and update related drivers.
* Add a missing function export.
* Allow disabled OPPs in dev_pm_opp_get_freq().
- Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):
* Add support for delayed timers to the devfreq core and make the
Samsung exynos5422-dmc driver use it.
* Unify sysfs interface to use "df-" as a prefix in instance names
consistently.
* Fix devfreq_summary debugfs node indentation.
* Add the rockchip,pmu phandle to the rk3399_dmc driver DT
bindings.
* List Dmitry Osipenko as the Tegra devfreq driver maintainer.
* Fix typos in the core devfreq code.
- Update the pm-graph utility to version 5.7 including a number of
fixes related to suspend-to-idle (Todd Brandt).
- Fix coccicheck errors and warnings in the cpupower utility (Shuah
Khan).
- Replace HTTP links with HTTPs ones in multiple places (Alexander
A. Klimov).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl8oO24SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx7ZQP/0lQ0yABnASnwomdOH6+K/m7rvc+e9FE
zx5pTDQswhU5tM7SQAIKqe0uSI+okF2UrBrT5onA16F+JUbnrbexJLazBPfVTTGF
AKpKEQ7Wh69Wz+Y6cQZjm1dTuRL+dlBJuBrzR2tLSnONPMMHuFcO3xd7lgE9UAxC
oGEf393taA6OqcUNRQIa2gqbq+k1qhKjeDucGkbOaoJ6CL0ZyWI+Tfw1WWaBBGv0
/2wBd6V513OH8WtQCW6H3YpHmhYW6OwL8w19KyGcjPRGJaeaIP4W/Ng7mkvgL5ZB
vZqg3XiufFV9uTe8W1NQaVv/NjlN256OteuK809aosTVjD0dhFkhBYg5TLu6HbQq
C/NciZ+78oLedWLT73EUfw3NyS+V0jk6X2EIlBUwNi0Qw1B1pCifGOCKzWFFe5cr
ci4xr4FG7dBkxScOxwFAU2s5TdPHLOkGkQtg4jZr0OYDrzkyLEdsnZEUjLPORo+0
6EBXGfTOSy2CBHcYswRtzJr/1pUTzj7oejhTAMCCuYW2r3VyQtnYcVjlehtp20if
6BfmGisk8nmtxlSm+/Y2FqKa4bNnSTMmr0UJQ+Rjp0tHs47QeucI0ORfZ5nPaBac
+ptvIjWmn3xejT/+oAehpH9066Iuy66vzHdnj7x5+WAsmYS8n8OFtlBFkYELmLJB
3xI5hIl7WtGo
=8cUO
-----END PGP SIGNATURE-----
Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The most significant change here is the extension of the Energy Model
to cover non-CPU devices (as well as CPUs) from Lukasz Luba.
There is also some new hardware support (Ice Lake server idle states
table for intel_idle, Sapphire Rapids and Power Limit 4 support in the
RAPL driver), some new functionality in the existing drivers (eg. a
new switch to disable/enable CPU energy-efficiency optimizations in
intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq
core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci,
devfreq), including the elimination of W=1 build warnings from cpufreq
done by Lee Jones.
Specifics:
- Make the Energy Model cover non-CPU devices (Lukasz Luba).
- Add Ice Lake server idle states table to the intel_idle driver and
eliminate a redundant static variable from it (Chen Yu, Rafael
Wysocki).
- Eliminate all W=1 build warnings from cpufreq (Lee Jones).
- Add support for Sapphire Rapids and for Power Limit 4 to the Intel
RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).
- Fix function name in kerneldoc comments in the idle_inject power
capping driver (Yangtao Li).
- Fix locking issues with cpufreq governors and drop a redundant
"weak" function definition from cpufreq (Viresh Kumar).
- Rearrange cpufreq to register non-modular governors at the
core_initcall level and allow the default cpufreq governor to be
specified in the kernel command line (Quentin Perret).
- Extend, fix and clean up the intel_pstate driver (Srinivas
Pandruvada, Rafael Wysocki):
* Add a new sysfs attribute for disabling/enabling CPU
energy-efficiency optimizations in the processor.
* Make the driver avoid enabling HWP if EPP is not supported.
* Allow the driver to handle numeric EPP values in the sysfs
interface and fix the setting of EPP via sysfs in the active
mode.
* Eliminate a static checker warning and clean up a kerneldoc
comment.
- Clean up some variable declarations in the powernv cpufreq driver
(Wei Yongjun).
- Fix up the ->enter_s2idle callback definition to cover the case
when it points to the same function as ->idle correctly (Neal Liu).
- Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).
- Make the PM core emit "changed" uevent when adding/removing the
"wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).
- Add a helper macro for declaring PM callbacks and use it in the MMC
jz4740 driver (Paul Cercueil).
- Fix white space in some places in the hibernate code and make the
system-wide PM code use "const char *" where appropriate (Xiang
Chen, Alexey Dobriyan).
- Add one more "unsafe" helper macro to the freezer to cover the NFS
use case (He Zhe).
- Change the language in the generic PM domains framework to use
parent/child terminology and clean up a typo and some comment
fromatting in that code (Kees Cook, Geert Uytterhoeven).
- Update the operating performance points OPP framework (Lukasz Luba,
Andrew-sh.Cheng, Valdis Kletnieks):
* Refactor dev_pm_opp_of_register_em() and update related drivers.
* Add a missing function export.
* Allow disabled OPPs in dev_pm_opp_get_freq().
- Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):
* Add support for delayed timers to the devfreq core and make the
Samsung exynos5422-dmc driver use it.
* Unify sysfs interface to use "df-" as a prefix in instance
names consistently.
* Fix devfreq_summary debugfs node indentation.
* Add the rockchip,pmu phandle to the rk3399_dmc driver DT
bindings.
* List Dmitry Osipenko as the Tegra devfreq driver maintainer.
* Fix typos in the core devfreq code.
- Update the pm-graph utility to version 5.7 including a number of
fixes related to suspend-to-idle (Todd Brandt).
- Fix coccicheck errors and warnings in the cpupower utility (Shuah
Khan).
- Replace HTTP links with HTTPs ones in multiple places (Alexander A.
Klimov)"
* tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits)
cpuidle: ACPI: fix 'return' with no value build warning
cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
cpufreq: intel_pstate: Rearrange the storing of new EPP values
intel_idle: Customize IceLake server support
PM / devfreq: Fix the wrong end with semicolon
PM / devfreq: Fix indentaion of devfreq_summary debugfs node
PM / devfreq: Clean up the devfreq instance name in sysfs attr
memory: samsung: exynos5422-dmc: Add module param to control IRQ mode
memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold
memory: samsung: exynos5422-dmc: Use delayed timer as default
PM / devfreq: Add support delayed timer for polling mode
dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle
PM / devfreq: tegra: Add Dmitry as a maintainer
PM / devfreq: event: Fix trivial spelling
PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent
cpuidle: change enter_s2idle() prototype
cpuidle: psci: Prevent domain idlestates until consumers are ready
cpuidle: psci: Convert PM domain to platform driver
cpuidle: psci: Fix error path via converting to a platform driver
cpuidle: psci: Fail cpuidle registration if set OSI mode failed
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-08-04
The following pull-request contains BPF updates for your *net-next* tree.
We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).
The main changes are:
1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.
2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
in-kernel inspection, from Yonghong Song and Alexei Starovoitov.
3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
unwinder errors, from Song Liu.
4) Allow cgroup local storage map to be shared between programs on the same
cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.
5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
load instructions, from Jean-Philippe Brucker.
6) Follow-up fixes on BPF socket lookup in combination with reuseport group
handling. Also add related BPF selftests, from Jakub Sitnicki.
7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
socket create/release as well as bind functions, from Stanislav Fomichev.
8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
xdp_statistics, from Peilin Ye.
9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.
10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6}
fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin.
11) Fix a bpftool segfault due to missing program type name and make it more robust
to prevent them in future gaps, from Quentin Monnet.
12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
resolver issue, from John Fastabend.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Current judgment condition of f2fs_bug_on in function update_sit_entry():
new_vblocks >> (sizeof(unsigned short) << 3) ||
new_vblocks > sbi->blocks_per_seg
which equivalents to:
new_vblocks < 0 || new_vblocks > sbi->blocks_per_seg
The latter is more intuitive.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reported-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since set/clear_inode_flag() don't need to return value to show
if flag is set, we can just call set/clear_bit() here.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When we use F2FS_IOC_RELEASE_COMPRESS_BLOCKS ioctl, if we can't find
any compressed blocks in the file even with large file size, the
ioctl just ends up without changing the file's status as immutable.
It makes the user, who expects that the file is immutable when it
returns successfully, confused.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The retry based logic here isn't easy to follow unless you're already
familiar with how io_uring does task_work based retries. Add some
comments explaining the flow a little better.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since we don't do exclusive waits or wakeups, we know that the bit is
always going to be set. Kill the test. Also see commit:
2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7asQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplrCD/0S17kio+k4cOJDGwl88WoJw+QiYmM5019k
decZ1JymQvV1HXRmlcZiEAu0hHDD0FoovSRrw7II3gw3GouETmYQM62f6ZTpDeMD
CED/fidnfULAkPaI6h+bj3jyI0cEuujG/R47rGSQEkIIr3RttqKZUzVkB9KN+KMw
+OBuXZtMIoFFEVJ91qwC2dm2qHLqOn1/5MlT59knso/xbPOYOXsFQpGiACJqF97x
6qSSI8uGE+HZqvL2OLWPDBbLEJhrq+dzCgxln5VlvLele4UcRhOdonUb7nUwEKCe
zwvtXzz16u1D1b8bJL4Kg5bGqyUAQUCSShsfBJJxh6vTTULiHyCX5sQaai1OEB16
4dpBL9E+nOUUix4wo9XBY0/KIYaPWg5L1CoEwkAXqkXPhFvNUucsC0u6KvmzZR3V
1OogVTjl6GhS8uEVQjTKNshkTIC9QHEMXDUOHtINDCb/sLU+ANXU5UpvsuzZ9+kt
KGc4mdyCwaKBq4YW9sVwhhq/RHLD4AUtWZiUVfOE+0cltCLJUNMbQsJ+XrcYaQnm
W4zz22Rep+SJuQNVcCW/w7N2zN3yB6gC1qeroSLvzw4b5el2TdFp+BcgVlLHK+uh
xjsGNCq++fyzNk7vvMZ5hVq4JGXYjza7AiP5HlQ8nqdiPUKUPatWCBqUm9i9Cz/B
n+0dlYbRwQ==
=2vmy
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Lots of cleanups in here, hardening the code and/or making it easier
to read and fixing bugs, but a core feature/change too adding support
for real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for
the fast path. In detail:
- Cleanup how memory accounting is done on ring setup/free (Bijan)
- sq array offset calculation fixup (Dmitry)
- Consistently handle blocking off O_DIRECT submission path (me)
- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries
from task_work, like we handle poll based retry. (me)
- IO completion optimizations (me)
- Fix race with accounting and ring fd install (me)
- Support EPOLLEXCLUSIVE (Jiufei)
- Get rid of the io_kiocb unionizing, made possible by shrinking
other bits (Pavel)
- Completion side cleanups (Pavel)
- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)
- Request environment grabbing cleanups (Pavel)
- File and socket read/write cleanups (Pavel)
- Improve kiocb_set_rw_flags() (Pavel)
- Tons of fixes and cleanups (Pavel)
- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"
* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
+BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
RzAUdZFbWw==
=abJG
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.
- Softirq completion cleanups (Christoph)
- Stop using ->queuedata (Christoph)
- Cleanup bd claiming (Christoph)
- Use check_events, moving away from the legacy media change
(Christoph)
- Use inode i_blkbits consistently (Christoph)
- Remove old unused writeback congestion bits (Christoph)
- Cleanup/unify submission path (Christoph)
- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)
- sbitmap cleared bits handling (John)
- Request merging blktrace event addition (Jan)
- sysfs add/remove race fixes (Luis)
- blk-mq tag fixes/optimizations (Ming)
- Duplicate words in comments (Randy)
- Flush deferral cleanup (Yufen)
- IO context locking/retry fixes (John)
- struct_size() usage (Gustavo)
- blk-iocost fixes (Chengming)
- blk-cgroup IO stats fixes (Boris)
- Various little fixes"
* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...
Instead of waiting in a loop for the userfaultfd condition to become
true, just wait once and return VM_FAULT_RETRY.
We've already dropped the mmap lock, we know we can't really
successfully handle the fault at this point and the caller will have to
retry anyway. So there's no point in making the wait any more
complicated than it needs to be - just schedule away.
And once you don't have that complexity with explicit looping, you can
also just lose all the 'userfaultfd_signal_pending()' complexity,
because once we've set the correct process sleeping state, and don't
loop, the act of scheduling itself will be checking if there are any
pending signals before going to sleep.
We can also drop the VM_FAULT_MAJOR games, since we'll be treating all
retried faults as major soon anyway (series to regularize and share more
of fault handling across architectures in a separate series by Peter Xu,
and in the meantime we won't worry about the possible minor - I'll be
here all week, try the veal - accounting difference).
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAl8oDgYTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFaA9D/9HzjmL8/17DdCiFFucl9fgyIUUIlqZ
mSM9RslHQuaOAM5c5RbtbifRZbh5H/pIm930at+JxFcZBN51iwB7xAc8MYEelxIy
9i3hwZJP2mmqum3GTD4QtUcoirzjmYvGffThq9Cb/XuUaXd6S/PZZPZVVk4bChIA
TDwday9Us+5Qz+NddnDPtkZbjv/edYS+gXh5NItODiV/B38yCiRVW36vazdWhZf9
UMRz7YpUT4xijjFd06rQZb6otJSAnP9BEi/4ihYAjsPuf8aot85vLfKD9CzkdLpd
+LbBkaXfoM6pb7C2QFx1PlBB4DeTkYzR7n89kp9poy/F35SyAEvj3zf12AceVG1a
4AbyVhFz6tNea5PLKBhswvGT0Kq0LfDJh6SnH03dqgcU7LQm20OMBT7ImWb3I1/3
1TMe44auGy4Ap1XgkPNq6xMNteX/XIUJIvKJ1g0sYyLppc2jLRnyH+n+aJCFyFQo
ghDKFRUYlmsYZJmzzV17rZjfnqewrlyHf6BcA1aq7C7GbdSJ8eMmxH+UaU3AgRES
Jy693Vd7XTOFPUwOGzHRKRxQ9cFQloTQxSKF6xcigBcKZE1xVZGarR8s4mRlsIU9
oqx50d37nVRVbLtC0OK2ZwD6hvtt9z4v0xM8ahF9n0XDkxnAwi7Hs3XhAvArUPnF
QLPVFaBbWDxwMQ==
=7CeF
-----END PGP SIGNATURE-----
Merge tag 'filelock-v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking fix from Jeff Layton:
"Just a single, one-line patch to fix an inefficiency in the posix
locking code that can lead to it doing more wakeups than necessary"
* tag 'filelock-v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: add locks_move_blocks in posix_lock_inode
If CONFIG_F2FS_FS_COMPRESSION is off, don't allow to configure or
show compression related mount option.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_read_multi_pages(), we don't have to check cluster's type
again, since overwrite or partial truncation need page lock in
cluster which has already been held by reader, so cluster's type
is stable, let's change check condition to sanity check.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because fsverity_descriptor_location.version is constant,
so use macro for better reading.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Function parameter mode could be TRANS_DIR_INO.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
One fix for fs/verity/ to strengthen a memory barrier which might be too
weak. This mirrors a similar fix in fs/crypto/.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXyezbRQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3geAQCT35f0xoQkOGLZVqHqlymI1otozKGP
N+arximQuWK2WAD/cKgth+/mJUBE2Ygcfef7hnFYD3maK2P6pzW1Q+GREAc=
=FeLN
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity update from Eric Biggers:
"One fix for fs/verity/ to strengthen a memory barrier which might be
too weak. This mirrors a similar fix in fs/crypto/"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: use smp_load_acquire() for ->i_verity_info
This release, we add support for inline encryption via the blk-crypto
framework which was added in 5.8. Now when an ext4 or f2fs filesystem
is mounted with '-o inlinecrypt', the contents of encrypted files will
be encrypted/decrypted via blk-crypto, instead of directly using the
crypto API. This model allows taking advantage of the inline encryption
hardware that is integrated into the UFS or eMMC host controllers on
most mobile SoCs. Note that this is just an alternate implementation;
the ciphertext written to disk stays the same.
(This pull request does *not* include support for direct I/O on
encrypted files, which blk-crypto makes possible, since that part is
still being discussed.)
Besides the above feature update, there are also a few fixes and
cleanups, e.g. strengthening some memory barriers that may be too weak.
All these patches have been in linux-next with no reported issues. I've
also tested them with the fscrypt xfstests, as usual. It's also been
tested that the inline encryption support works with the support for
Qualcomm and Mediatek inline encryption hardware that will be in the
scsi pull request for 5.9. Also, several SoC vendors are already using
a previous, functionally equivalent version of these patches.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXye2EBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK0veAQCKEnwvy+M6s2/QWhC9vo01rABMtt7h
VRAAKPiFzLNH3AD/dCnZNsFUzk3x0ZyiU1YRW3FvlxFOaEO7Ea0Pt/pyyQ0=
=g9FK
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
"This release, we add support for inline encryption via the blk-crypto
framework which was added in 5.8.
Now when an ext4 or f2fs filesystem is mounted with '-o inlinecrypt',
the contents of encrypted files will be encrypted/decrypted via
blk-crypto, instead of directly using the crypto API. This model
allows taking advantage of the inline encryption hardware that is
integrated into the UFS or eMMC host controllers on most mobile SoCs.
Note that this is just an alternate implementation; the ciphertext
written to disk stays the same.
(This pull request does *not* include support for direct I/O on
encrypted files, which blk-crypto makes possible, since that part is
still being discussed.)
Besides the above feature update, there are also a few fixes and
cleanups, e.g. strengthening some memory barriers that may be too
weak.
All these patches have been in linux-next with no reported issues.
I've also tested them with the fscrypt xfstests, as usual. It's also
been tested that the inline encryption support works with the support
for Qualcomm and Mediatek inline encryption hardware that will be in
the scsi pull request for 5.9. Also, several SoC vendors are already
using a previous, functionally equivalent version of these patches"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: don't load ->i_crypt_info before it's known to be valid
fscrypt: document inline encryption support
fscrypt: use smp_load_acquire() for ->i_crypt_info
fscrypt: use smp_load_acquire() for ->s_master_keys
fscrypt: use smp_load_acquire() for fscrypt_prepared_key
fscrypt: switch fscrypt_do_sha256() to use the SHA-256 library
fscrypt: restrict IV_INO_LBLK_* to AES-256-XTS
fscrypt: rename FS_KEY_DERIVATION_NONCE_SIZE
fscrypt: add comments that describe the HKDF info strings
ext4: add inline encryption support
f2fs: add inline encryption support
fscrypt: add inline encryption support
fs: introduce SB_INLINECRYPT
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8jAKwACgkQxWXV+ddt
WDtFvQ/7BIMM6cn+k/LoiK6cTTpq9DKTMoK64XzXJsOiY4ey6pXE0iSyVyn3rC6k
C+wAafdd7UPGPnI5z7L1lJOI7cE/X3PmADmAWB6WhARp19B2SmKfkF+jFAr+T4dE
OZw5lNqHSGv/aByBq8qegrAhWjpRR3VZtCCGW5KvN/strx7MC7t9wFZAB0zIsdKX
aK37VKYhoc+MOF1ikUDn4lRSIjqQYJetjvgC6Yt9dLfx+5oLOK8tpm1XkifN/1xs
HrRR9EpDTKlfJFDee1O+0gof6cKWTqFsbup1EFTrDbkA11zx8r6itBGY5G8P3zMh
JCsVOOJeDLecp1cz1ZWFpyBgrEAN7uHTY0hZbCZgN/dKbSKmv51iujdXB+dDOtxF
cSPywc0NxmftvBbweInwBfsA54BHI0XxCCA0U1yA8xgxPmBE15t81b7F56zmCRke
mSJxAP1dcX8gmL3mzEOUUuKkVbFJ0lIMi2YVkM1lud8Vn4xaWU9HzXlzEvkh7At0
tqlb+LHzaxxVU2m6/6W/KEuiXW1S7/q4nX87wvyMLnylHAaSlA+UtAp3t1q92rdJ
3VGzyvbgBRT2H+22DgCkrPTRlhOifeeuXT3nOwehY4AVkENYQrENb7FmqvppCEtl
v7yTBxxe4zPEjc8dm7o9RBYaVESVFXVQtpCHwz0D+p+adzIYmVM=
=HNGC
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"We don't have any big feature updates this time, there are lots of
small enhacements or fixes. A highlight perhaps is the parallel fsync
performance improvements, numbers below.
Regarding the dio/iomap that was reverted last time, the required API
changes are likely to land in the upcoming cycle, the btrfs part will
be updated afterwards.
User visible changes:
- new mount option rescue= to group all recovery-related mount
options so we don't have many specific options, currently
introducing only aliases for existing options, future extensions
are in development to allow read-only mount with partially damaged
structures:
- usebackuproot is an alias for rescue=usebackuproot
- nologreplay is an alias for rescue=nologreplay
- start deprecation of mount option inode_cache, removal scheduled to
v5.11
- removed deprecated mount options alloc_start and subvolrootid
- device stats corruption counter gets incremented when a checksum
mismatch is found
- qgroup information exported in /sys/fs/btrfs/<UUID>/qgroups/<id>
using sysfs
- add link /sys/fs/btrfs/<UUID>/bdi pointing to the associated
backing dev info
- FS_INFO ioctl enhancements:
- add flags to request/describe newly added items
- new item: numeric checksum type and checksum size
- new item: generation
- new item: metadata_uuid
- seed device: with one new read-write device added, print the new
device information in /proc/mounts
- balance: detect cancellation by Ctrl-C in existing cancellation
points
Performance improvements:
- optimized versions of various helpers on little-endian
architectures, where we don't have to do LE/BE conversion from
on-disk format
- tree-log/fsync optimizations leading to lower max latency reported
by dbench, reduced by about 12%
- all chunk tree leaves are prefetched at mount time, can improve
mount time on large (terabyte-sized) filesystems
- speed up parallel fsync of files with reflinked/deduped extents,
with jobs 16 to 1024 the throughput gets improved roughly by 50% on
average and runtime decreased roughly by 30% on average, notable
outlier is 128 jobs with +121.2% on throughput and -54.6% runtime
- another speed up of parallel fsync, reduce number of checksum tree
lookups and contention, the improvements start to show up with 2
tasks with +20% throughput and -16% runtime up to 64 with +200%
throughput and -66% runtime
Core:
- umount-time qgroup leak checker
- qgroups
- add a way to unreserve partial range after failure, avoiding
some EDQUOT errors
- improved flushing logic when EDQUOT is hit
- possible EINTR interruption caused by failed reservations after
transaction start is better handled and documented
- transaction abort errors are unified to EROFS in case it's not the
original reason of abort or we don't have other way to determine
the reason
Fixes:
- make truncate succeed on a NOCOW file even if data space is
exhausted
- fix cancelling balance on filesystem with exhausted metadata space
- anon block device:
- preallocate anon bdev when subvolume is created to report
failure early
- shorten time the anon bdev id is allocated
- don't allocate anon bdev for internal roots
- minor memory leak in ref-verify
- refuse invalid combinations of compression and NOCOW file flags
- lockdep fixes, updating the device locks
- remove obsolete fallback logic for block group profile adjustments
when switching from 1 to more devices, causing allocation of
unwanted block groups
Other cleanups, refactoring, simplifications:
- conversions from struct inode to struct btrfs_inode in internal
functions
- removal of unused struct members"
* tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (151 commits)
btrfs: do not set the full sync flag on the inode during page release
btrfs: release old extent maps during page release
btrfs: fix race between page release and a fast fsync
btrfs: open-code remount flag setting in btrfs_remount
btrfs: if we're restriping, use the target restripe profile
btrfs: don't adjust bg flags and use default allocation profiles
btrfs: fix lockdep splat from btrfs_dump_space_info
btrfs: move the chunk_mutex in btrfs_read_chunk_tree
btrfs: open device without device_list_mutex
btrfs: sysfs: use NOFS for device creation
btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases
btrfs: document special case error codes for fs errors
btrfs: don't WARN if we abort a transaction with EROFS
btrfs: reduce contention on log trees when logging checksums
btrfs: remove done label in writepage_delalloc
btrfs: add comments for btrfs_reserve_flush_enum
btrfs: relocation: review the call sites which can be interrupted by signal
btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree
btrfs: relocation: allow signal to cancel balance
btrfs: raid56: remove out label in __raid56_parity_recover
...
It's expected that erofs_workgroup_unfreeze_final() won't
be used in other places. Let's fold it to simplify the code.
Link: https://lore.kernel.org/r/20200729180235.25443-1-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Each ondisk inode should be aligned with inode slot boundary
(32-byte alignment) because of nid calculation formula, so all
compact inodes (32 byte) cannot across page boundary. However,
extended inode is now 64-byte form, which can across page boundary
in principle if the location is specified on purpose, although
it's hard to be generated by mkfs due to the allocation policy
and rarely used by Android use case now mainly for > 4GiB files.
For now, only two fields `i_ctime_nsec` and `i_nlink' couldn't
be read from disk properly and cause out-of-bound memory read
with random value.
Let's fix now.
Fixes: 431339ba90 ("staging: erofs: add inode operations")
Cc: <stable@vger.kernel.org> # 4.19+
Link: https://lore.kernel.org/r/20200729175801.GA23973@xiangao.remote.csb
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Link: https://lore.kernel.org/r/20200713130944.34419-1-grandmaster@al2klimov.de
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
In gfs2_glock_poke, make sure gfs2_holder_uninit is called on the local
glock holder. Without that, we're leaking a glock and a pid reference.
Fixes: 9e8990dea9 ("gfs2: Smarter iopen glock waiting")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pass a pointer to the existing glock holder from
gfs2_file_{read,write}_iter to gfs2_file_direct_{read,write}
to save some stack space.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, three flags were not represented in the glock output.
This patch adds them in:
c - GLF_INODE_CREATING
P - GLF_PENDING_DELETE
x - GLF_FREEING (both f and F are already used)
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
* pm-sleep:
PM: sleep: spread "const char *" correctness
PM: hibernate: fix white space in a few places
freezer: Add unsafe version of freezable_schedule_timeout_interruptible() for NFS
PM: sleep: core: Emit changed uevent on wakeup_sysfs_add/remove
* pm-domains:
PM: domains: Restore comment indentation for generic_pm_domain.child_links
PM: domains: Fix up terminology with parent/child
* powercap:
powercap: Add Power Limit4 support
powercap: idle_inject: Replace play_idle() with play_idle_precise() in comments
powercap: intel_rapl: add support for Sapphire Rapids
* pm-tools:
pm-graph v5.7 - important s2idle fixes
cpupower: Replace HTTP links with HTTPS ones
cpupower: Fix NULL but dereferenced coccicheck errors
cpupower: Fix comparing pointer to 0 coccicheck warns
The variable mds is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the ceph_mdsc_init() fails, it will free the mdsc already.
Reported-by: syzbot+b57f46d8d6ea51960b8c@syzkaller.appspotmail.com
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix build warnings:
fs/ceph/mdsmap.c: In function ‘ceph_mdsmap_decode’:
fs/ceph/mdsmap.c:192:7: warning: variable ‘info_cv’ set but not used [-Wunused-but-set-variable]
fs/ceph/mdsmap.c:177:7: warning: variable ‘state_seq’ set but not used [-Wunused-but-set-variable]
fs/ceph/mdsmap.c:123:15: warning: variable ‘mdsmap_cv’ set but not used [-Wunused-but-set-variable]
Note that p is increased in ceph_decode_*.
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Drop duplicated words "down" and "the" in fs/ceph/.
[ idryomov: merge into a single patch ]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Send metric flags to the MDS, indicating what metrics the client
supports. Currently that consists of cap statistics, and read, write and
metadata latencies.
URL: https://tracker.ceph.com/issues/43435
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will send the caps/read/write/metadata metrics to any available MDS
once per second, which will be the same as the userland client. It will
skip the MDS sessions which don't support the metric collection, as the
MDSs will close socket connections when they get an unknown type
message.
We can disable the metric sending via the disable_send_metrics module
parameter.
[ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the session is already in closed state, we should skip it.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
[ idryomov: Do the same for the CRUSH paper and replace
ceph.newdream.net with ceph.io. ]
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Remove unnecassary casts in the argument to kfree.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In aio case, if the completion comes very fast just before the
ceph_read_iter() returns to fs/aio.c, the kiocb will be freed in
the completion callback, then if ceph_read_iter() access again
we will potentially hit the use-after-free bug.
[ jlayton: initialize direct_lock early, and use it everywhere ]
URL: https://tracker.ceph.com/issues/45649
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make this loop look a bit more sane. Also optimize away the spinlock
release/reacquire if we can't get an inode reference.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make sure the delayed work stopped before releasing the resources.
cancel_delayed_work_sync() will only guarantee that the work finishes
executing if the work is already in the ->worklist. That means after
the cancel_delayed_work_sync() returns, it will leave the work requeued
if it was rearmed at the end. That can lead to a use after free once the
work struct is freed.
Fix it by flushing the delayed work instead of trying to cancel it, and
ensure that the work doesn't rearm if the mdsc is stopping.
URL: https://tracker.ceph.com/issues/46293
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
...and let the errnos bubble up to the callers.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will help to reduce using the global mdsc->mutex lock in many
places.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
And remove the unsed mdsc parameter to simplify the code.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
cifs_mount() for DFS mounts is for a long time way too complex to
follow, mostly because it lacks some documentation, does a lot of
operations like resolving DFS roots and links, checking for path
components, perform failover, crap code, etc.
Besides adding some documentation to it, do some cleanup and ensure
that the following is implemented and supported:
* non-DFS mounts
* DFS failover
* DFS root mounts
- tcon and cifs_sb must contain DFS path (NOT including prefix)
- if prefix path, then save it in cifs_sb and it must not be
changed
* DFS link mounts
- tcon and cifs_sb must contain DFS path (including prefix)
- if prefix path, then save it in cifs_sb and it may be changed
* prevent recursion on broken link referrals (MAX_NESTED_LINKS)
* check every path component of the currently resolved
target (including prefix), and chase them accordingly
* make sure that DFS referrals go through newly resolved root
servers
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
For DFS root mounts that contain a prefix path, do not change them
after failover.
E.g., if the user mounts
//srvA/root/dir1
and then lost connection to srvA, it will reconnect to
//srvB/root/dir1
In case of DFS links, which may resolve to different prefix paths
depending on their list of targets, the following must be supported:
- mount //srvA/root/link/bar
- connect to //srvA/share
- set prefix path to "bar"
- lost connection to srvA
- reconnect to next target: //srvB/share/foo
- set new prefix path to "foo/bar"
In cifs_tree_connect(), check the server_type field of the cached DFS
referral to determine whether or not prefix path should be updated.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently if the call dfs_cache_get_tgt_share fails we cannot
fully guarantee that share and prefix are set to NULL and the
next iteration of the loop can end up potentially double freeing
these pointers. Since the semantics of dfs_cache_get_tgt_share
are ambiguous for failure cases with the setting of share and
prefix (currently now and the possibly the future), it seems
prudent to set the pointers to NULL when the objects are
free'd to avoid any double frees.
Addresses-Coverity: ("Double free")
Fixes: 96296c946a2a ("cifs: handle RESP_GET_DFS_REFERRAL.PathConsumed in reconnect")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Use PathConsumed field when parsing prefixes of referral paths that
either match a cache entry or are a complete prefix path of an
existing entry.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In case there were no cached DFS referrals in
reconn_setup_dfs_targets(), set cifs_sb to NULL prior to calling
reconn_set_next_dfs_target() so it would not try to access an empty
tgt_list.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This function has nothing to do with *invalidation* but setting up the
next target server from a cached referral.
Rename it to reconn_set_next_dfs_target(). While at it, get rid of
some meaningless checks.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When looking up the DFS cache with a referral path that has more than
two path components, and is a complete prefix of an existing cache
entry, do not request another referral and just return the matched
entry as specified in MS-DFSC 3.2.5.5 Receiving a Root Referral
Request or Link Referral Request.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
They were identical execpt to CIFSTCon() vs. SMB2_tcon().
These are also available via ops->tree_connect().
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Convert cpu_to_be32(be32_to_cpu(E1) + E2) to use be32_add_cpu().
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Drop repeated words in multiple comments.
(be, use, the, See)
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Remove the superfuous break, as there is a 'return' before it.
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
RHBZ 1145308
Some very old server may not support SetPathInfo to adjust the timestamps
of directories. For these servers, try to open the directory and use SetFileInfo.
Minor correction to patch included that was
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Tested-by: Kenneth D'souza <kdsouza@redhat.com>
If server returns ERRBaduid but does not reset transport connection,
we'll keep sending command with a non-valid UID for the server as long
as transport is healthy, without actually recovering. This have been
observed on the field.
This patch adds ERRBaduid handling so that we set CifsNeedReconnect.
map_and_check_smb_error() can be modified to extend use cases.
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Fix build warning by removing unused variable 'server':
fs/cifs/inode.c:1089:26: warning:
variable server set but not used [-Wunused-but-set-variable]
1089 | struct TCP_Server_Info *server;
| ^~~~~~
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
When mounting with Kerberos, users have been confused about the
default error returned in scenarios in which either keyutils is
not installed or the user did not properly acquire a krb5 ticket.
Log a warning message in the case that "ENOKEY" is returned
from the get_spnego_key upcall so that users can better understand
why mount failed in those two cases.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The log of UAF problem is listed below.
BUG: KASAN: use-after-free in jffs2_rmdir+0xa4/0x1cc [jffs2] at addr c1f165fc
Read of size 4 by task rm/8283
=============================================================================
BUG kmalloc-32 (Tainted: P B O ): kasan: bad access detected
-----------------------------------------------------------------------------
INFO: Allocated in 0xbbbbbbbb age=3054364 cpu=0 pid=0
0xb0bba6ef
jffs2_write_dirent+0x11c/0x9c8 [jffs2]
__slab_alloc.isra.21.constprop.25+0x2c/0x44
__kmalloc+0x1dc/0x370
jffs2_write_dirent+0x11c/0x9c8 [jffs2]
jffs2_do_unlink+0x328/0x5fc [jffs2]
jffs2_rmdir+0x110/0x1cc [jffs2]
vfs_rmdir+0x180/0x268
do_rmdir+0x2cc/0x300
ret_from_syscall+0x0/0x3c
INFO: Freed in 0x205b age=3054364 cpu=0 pid=0
0x2e9173
jffs2_add_fd_to_list+0x138/0x1dc [jffs2]
jffs2_add_fd_to_list+0x138/0x1dc [jffs2]
jffs2_garbage_collect_dirent.isra.3+0x21c/0x288 [jffs2]
jffs2_garbage_collect_live+0x16bc/0x1800 [jffs2]
jffs2_garbage_collect_pass+0x678/0x11d4 [jffs2]
jffs2_garbage_collect_thread+0x1e8/0x3b0 [jffs2]
kthread+0x1a8/0x1b0
ret_from_kernel_thread+0x5c/0x64
Call Trace:
[c17ddd20] [c02452d4] kasan_report.part.0+0x298/0x72c (unreliable)
[c17ddda0] [d2509680] jffs2_rmdir+0xa4/0x1cc [jffs2]
[c17dddd0] [c026da04] vfs_rmdir+0x180/0x268
[c17dde00] [c026f4e4] do_rmdir+0x2cc/0x300
[c17ddf40] [c001a658] ret_from_syscall+0x0/0x3c
The root cause is that we don't get "jffs2_inode_info.sem" before
we scan list "jffs2_inode_info.dents" in function jffs2_rmdir.
This patch add codes to get "jffs2_inode_info.sem" before we scan
"jffs2_inode_info.dents" to slove the UAF problem.
Signed-off-by: Zhe Li <lizhe67@huawei.com>
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Thanks for the advice mentioned in the email.
This is my v3 patch for this problem.
Mounting jffs2 on nand flash will get message "failed: I/O error"
with the steps listed below.
1.umount jffs2
2.erase nand flash
3.mount jffs2 on it (this mounting operation will be successful)
4.do chown or chmod to the mount point directory
5.umount jffs2
6.mount jffs2 on nand flash
After step 6, we will get message "mount ... failed: I/O error".
Typical image of this problem is like:
Empty space found from 0x00000000 to 0x008a0000
Inode node at xx, totlen 0x00000044, #ino 1, version 1, isize 0...
The reason for this mounting failure is that at the end of function
jffs2_scan_medium(), jffs2 will check the used_size and some info
of nr_blocks.If conditions are met, it will return -EIO.
The detail is that, in the steps listed above, step 4 will write
jffs2_raw_inode into flash without jffs2_raw_dirent, which will
cause that there are some jffs2_raw_inode but no jffs2_raw_dirent
on flash. This will meet the condition at the end of function
jffs2_scan_medium() and return -EIO if we umount jffs2 and mount it
again.
We notice that jffs2 add the value of c->unchecked_size if we find
an inode node while mounting. And jffs2 will never add the value of
c->unchecked_size in other situations. So this patch add one more
condition about c->unchecked_size of the judgement to fix this problem.
Signed-off-by: Zhe Li <lizhe67@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
There a wrong orphan node deleting in error handling path in
ubifs_jnl_update() and ubifs_jnl_rename(), which may cause
following error msg:
UBIFS error (ubi0:0 pid 1522): ubifs_delete_orphan [ubifs]:
missing orphan ino 65
Fix this by checking whether the node has been operated for
adding to orphan list before being deleted,
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Fixes: 823838a486 ("ubifs: Add hashes to the tree node cache")
Signed-off-by: Richard Weinberger <richard@nod.at>
Drop the repeated word "as" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Richard Weinberger <richard@nod.at>
Instead of creating ubifs file systems with UBIFS_FORMAT_VERSION
by default, add a module parameter ubifs.default_version to allow
the user to specify the desired version. Valid values are 4 to
UBIFS_FORMAT_VERSION (currently 5).
This way, one can for example create a file system with version 4
on kernel 4.19 which can still be mounted rw when downgrading to
kernel 4.9.
Signed-off-by: Martin Kaistra <martin.kaistra@linutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
nfs_wb_all() calls filemap_write_and_wait(), which uses
filemap_check_errors() to determine the error to return.
filemap_check_errors() only looks at the mapping->flags and will
therefore only return either -ENOSPC or -EIO. To ensure that the
correct error is returned on close(), nfs{,4}_file_flush() should call
filemap_check_wb_err() which looks at the errseq value in
mapping->wb_err without consuming it.
Fixes: 6fbda89b25 ("NFS: Replace custom error reporting mechanism with
generic one")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
As recently done with with send/recv, flip the if after
rw_verify_aread() in io_{read,write}() and tabulise left bits left.
This removes mispredicted by a compiler jump on the success/fast path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As soon as we install the file descriptor, we have to assume that it
can get arbitrarily closed. We currently account memory (and note that
we did) after installing the ring fd, which means that it could be a
potential use-after-free condition if the fd is closed right after
being installed, but before we fiddle with the ctx.
In fact, syzbot reported this exact scenario:
BUG: KASAN: use-after-free in io_account_mem fs/io_uring.c:7397 [inline]
BUG: KASAN: use-after-free in io_uring_create fs/io_uring.c:8369 [inline]
BUG: KASAN: use-after-free in io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
Read of size 1 at addr ffff888087a41044 by task syz-executor.5/18145
CPU: 0 PID: 18145 Comm: syz-executor.5 Not tainted 5.8.0-rc7-next-20200729-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x18f/0x20d lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
io_account_mem fs/io_uring.c:7397 [inline]
io_uring_create fs/io_uring.c:8369 [inline]
io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45c429
Code: 8d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 5b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f8f121d0c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000008540 RCX: 000000000045c429
RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000196
RBP: 000000000078bf38 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000078bf0c
R13: 00007fff86698cff R14: 00007f8f121d19c0 R15: 000000000078bf0c
Move the accounting of the ring used locked memory before we get and
install the ring file descriptor.
Cc: stable@vger.kernel.org
Reported-by: syzbot+9d46305e76057f30c74e@syzkaller.appspotmail.com
Fixes: 309758254e ("io_uring: report pinned memory usage")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a simple helper to set timestamps with a kernel space file name and
switch the early init code over to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to mknod with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_mknod.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to mkdir with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_mkdir.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to symlink with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_symlink.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to link with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_link.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to check if a file exists based on kernel space file
name and switch the early init code over to it. Note that this
theoretically changes behavior as it always is based on the effective
permissions. But during early init that doesn't make a difference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to chroot with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_chroot.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to chdir with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_chdir.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to rmdir with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_rmdir.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to unlink with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_unlink.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Like ksys_umount, but takes a kernel pointer for the destination path.
Switch over the umount in the init code, which just happen to work due to
the implicit set_fs(KERNEL_DS) during early init right now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Like do_mount, but takes a kernel pointer for the destination path.
Switch over the mounts in the init code and devtmpfs to it, which
just happen to work due to the implicit set_fs(KERNEL_DS) during early
init right now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
This mirrors do_unlinkat and will make life a little easier for
the init code to reuse the whole function with a kernel filename.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Factor out a path_umount helper that takes a struct path * instead of the
actual file name. This will allow to convert the init and devtmpfs code
to properly mount based on a kernel pointer instead of relying on the
implicit set_fs(KERNEL_DS) during early init.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Factor out a path_mount helper that takes a struct path * instead of the
actual file name. This will allow to convert the init and devtmpfs code
to properly mount based on a kernel pointer instead of relying on the
implicit set_fs(KERNEL_DS) during early init.
Signed-off-by: Christoph Hellwig <hch@lst.de>