Commit Graph

77903 Commits

Author SHA1 Message Date
Stefan Roesch
18e419f6e8 iomap: Return -EAGAIN from iomap_write_iter()
If iomap_write_iter() encounters -EAGAIN, return -EAGAIN to the caller.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20220623175157.1715274-7-shr@fb.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: make the suggested ternary edit]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:31 -06:00
Stefan Roesch
cae2de6978 iomap: Add async buffered write support
This adds async buffered write support to iomap.

This replaces the call to balance_dirty_pages_ratelimited() with the
call to balance_dirty_pages_ratelimited_flags. This allows to specify if
the write request is async or not.

In addition this also moves the above function call to the beginning of
the function. If the function call is at the end of the function and the
decision is made to throttle writes, then there is no request that
io-uring can wait on. By moving it to the beginning of the function, the
write request is not issued, but returns -EAGAIN instead. io-uring will
punt the request and process it in the io-worker.

By moving the function call to the beginning of the function, the write
throttling will happen one page later.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20220623175157.1715274-6-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:31 -06:00
Stefan Roesch
9753b868fd iomap: Add flags parameter to iomap_page_create()
Add the kiocb flags parameter to the function iomap_page_create().
Depending on the value of the flags parameter it enables different gfp
flags.

No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20220623175157.1715274-5-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:31 -06:00
Jens Axboe
ed29b0b4fd io_uring: move to separate directory
In preparation for splitting io_uring up a bit, move it into its own
top level directory. It didn't really belong in fs/ anyway, as it's
not a file system only API.

This adds io_uring/ and moves the core files in there, and updates the
MAINTAINERS file for the new location.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:10 -06:00
Jens Axboe
0702e5364f io_uring: define a 'prep' and 'issue' handler for each opcode
Rather than have two giant switches for doing request preparation and
then for doing request issue, add a prep and issue handler for each
of them in the io_op_defs[] request definition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:10 -06:00
Namjae Jeon
1c90b54718 ksmbd: remove unused ksmbd_share_configs_cleanup function
remove unused ksmbd_share_configs_cleanup function.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-24 15:30:16 -05:00
Anna Schumaker
d3b00a802c NFS: Replace the READ_PLUS decoding code
We now take a 2-step process that allows us to place data and hole
segments directly at their final position in the xdr_stream without
needing to do a bunch of redundant copies to expand holes. Due to the
variable lengths of each segment, the xdr metadata might cross page
boundaries which I account for by setting a small scratch buffer so
xdr_inline_decode() won't fail.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-23 15:38:29 -04:00
Chuck Lever
33ce83ef0b NFS: Replace fs_context-related dprintk() call sites with tracepoints
Contributed as part of the long patch series that converts NFS from
using dprintk to tracepoints for observability.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-23 15:34:40 -04:00
Jeff Layton
69d966510d nfs: only issue commit in DIO codepath if we have uncommitted data
Currently, we try to determine whether to issue a commit based on
nfs_write_need_commit which looks at the current verifier. In the case
where we got a short write and then tried to follow it up with one that
failed, the verifier can't be trusted.

What we really want to know is whether the pgio request had any
successful writes that came back as UNSTABLE. Add a new flag to the pgio
request, and use that to indicate that we've had a successful unstable
write. Only issue a commit if that flag is set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-23 15:28:59 -04:00
Jeff Layton
55051c0ced nfs: always check dreq->error after a commit
When the client gets back a short DIO write, it will then attempt to
issue another write to finish the DIO request. If that write then fails
(as is often the case in an -ENOSPC situation), then we still may need
to issue a COMMIT if the earlier short write was unstable. If that COMMIT
then succeeds, then we don't want the client to reschedule the write
requests, and to instead just return a short write. Otherwise, we can
end up looping over the same DIO write forever.

Always consult dreq->error after a successful RPC, even when the flag
state is not NFS_ODIRECT_DONE.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2028370
Reported-by: Boyang Xue <bxue@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-23 15:28:59 -04:00
Jeff Layton
8efc4bbe84 nfs: add new nfs_direct_req tracepoint events
Add some new tracepoints to the DIO write code.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-23 15:28:59 -04:00
Linus Torvalds
a5235996e1 Merge tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 "Fix for a bad kfree() introduced in this cycle, and a quick fix for
  disabling buffer recycling for IORING_OP_READV.

  The latter will get reworked for 5.20, but it gets the job done for
  5.19"

* tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block:
  io_uring: do not recycle buffer in READV
  io_uring: fix free of unallocated buffer list
2022-07-22 12:47:09 -07:00
Christoph Hellwig
478af190cb iomap: remove iomap_writepage
Unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-22 10:59:17 -07:00
Christoph Hellwig
7b86e8a5ba zonefs: remove ->writepage
->writepage is only used for single page writeback from memory reclaim,
and not called at all for cgroup writeback.  Follow the lead of XFS
and remove ->writepage and rely entirely on ->writepages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-22 10:59:16 -07:00
Christoph Hellwig
d3d71901b1 gfs2: remove ->writepage
->writepage is only used for single page writeback from memory reclaim,
and not called at all for cgroup writeback.  Follow the lead of XFS
and remove ->writepage and rely entirely on ->writepages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-22 10:59:16 -07:00
Christoph Hellwig
b2b0a5e978 gfs2: stop using generic_writepages in gfs2_ail1_start_one
Use filemap_fdatawrite_wbc instead of generic_writepages in
gfs2_ail1_start_one so that the functin can also cope with address_space
operations that only implement ->writepages and to properly account
for cgroup writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-22 10:59:15 -07:00
Slark Xiao
4869b6e84a xfs: Fix typo 'the the' in comment
Replace 'the the' with 'the' in the comment.

Signed-off-by: Slark Xiao <slark_xiao@163.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-22 10:58:39 -07:00
Xin Gao
29d286d0ce xfs: Fix comment typo
The double `the' is duplicated in line 552, remove one.

Signed-off-by: Xin Gao <gaoxin@cdjrlc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-22 10:58:39 -07:00
Gao Xiang
cc2a171372 erofs: get rid of the leftover PAGE_SIZE in dir.c
Convert the last hardcoded PAGE_SIZEs of uncompressed cases.

Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220619150940.121005-1-hsiangkao@linux.alibaba.com
2022-07-22 21:46:26 +08:00
Gao Xiang
de8a801ab6 erofs: get rid of erofs_prepare_dio() helper
Fold in erofs_prepare_dio() in order to simplify the code.

Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220720082229.12172-1-hsiangkao@linux.alibaba.com
2022-07-22 21:46:03 +08:00
Gao Xiang
267f2492c8 erofs: introduce multi-reference pclusters (fully-referenced)
Let's introduce multi-reference pclusters at runtime. In details,
if one pcluster is requested by multiple extents at almost the same
time (even belong to different files), the longest extent will be
decompressed as representative and the other extents are actually
copied from the longest one in one round.

After this patch, fully-referenced extents can be correctly handled
and the full decoding check needs to be bypassed for
partial-referenced extents.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-17-hsiangkao@linux.alibaba.com
2022-07-22 21:44:27 +08:00
Gao Xiang
2bfab9c0ed erofs: record the longest decompressed size in this round
Currently, `pcl->length' records the longest decompressed length
as long as the pcluster itself isn't reclaimed.  However, such
number is unneeded for the general cases since it doesn't indicate
the exact decompressed size in this round.

Instead, let's record the decompressed size for this round instead,
thus `pcl->nr_pages' can be completely dropped and pageofs_out is
also designed to be kept in sync with `pcl->length'.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-16-hsiangkao@linux.alibaba.com
2022-07-21 22:55:44 +08:00
Gao Xiang
3fe96ee0f9 erofs: introduce z_erofs_do_decompressed_bvec()
Both out_bvecs and in_bvecs share the common logic for decompressed
buffers. So let's make a helper for this.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-15-hsiangkao@linux.alibaba.com
2022-07-21 22:55:37 +08:00
Gao Xiang
fe3e5914e6 erofs: try to leave (de)compressed_pages on stack if possible
For the most cases, small pclusters can be decompressed with page
arrays on stack.

Try to leave both (de)compressed_pages on stack if possible as before.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-14-hsiangkao@linux.alibaba.com
2022-07-21 22:55:30 +08:00
Gao Xiang
4f05687fd7 erofs: introduce struct z_erofs_decompress_backend
Let's introduce struct z_erofs_decompress_backend in order to pass
on the decompression backend context between helper functions more
easier and avoid too many arguments.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-13-hsiangkao@linux.alibaba.com
2022-07-21 22:55:22 +08:00
Gao Xiang
e73681877d erofs: get rid of `z_pagemap_global'
In order to introduce multi-reference pclusters for compressed data
deduplication, let's get rid of the global page array for now since
it needs to be re-designed then at least.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-12-hsiangkao@linux.alibaba.com
2022-07-21 22:55:15 +08:00
Gao Xiang
db166fc202 erofs: clean up `enum z_erofs_collectmode'
`enum z_erofs_collectmode' is really ambiguous, but I'm not quite
sure if there are better naming, basically it's used to judge whether
inplace I/O can be used due to the current status of pclusters in
the chain.

Rename it as `enum z_erofs_pclustermode' instead.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-11-hsiangkao@linux.alibaba.com
2022-07-21 22:55:07 +08:00
Gao Xiang
5b220b204c erofs: get rid of `enum z_erofs_page_type'
Remove it since pagevec[] is no longer used.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-10-hsiangkao@linux.alibaba.com
2022-07-21 22:54:54 +08:00
Gao Xiang
671485516e erofs: rework online page handling
Since all decompressed offsets have been integrated to bvecs[], this
patch avoids all sub-indexes so that page->private only includes a
part count and an eio flag, thus in the future folio->private can have
the same meaning.

In addition, PG_error will not be used anymore after this patch and
we're heading to use page->private (later folio->private) and
page->mapping  (later folio->mapping) only.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-9-hsiangkao@linux.alibaba.com
2022-07-21 22:54:46 +08:00
Gao Xiang
ed722fbcca erofs: switch compressed_pages[] to bufvec
Convert compressed_pages[] to bufvec in order to avoid using
page->private to keep onlinepage_index (decompressed offset)
for inplace I/O pages.

In the future, we only rely on folio->private to keep a countdown
to unlock folios and set folio_uptodate.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-8-hsiangkao@linux.alibaba.com
2022-07-21 22:54:37 +08:00
Gao Xiang
67139e36d9 erofs: introduce `z_erofs_parse_in_bvecs'
`z_erofs_decompress_pcluster()' is too long therefore it'd be better
to introduce another helper to parse compressed pages (or laterly,
compressed bvecs.)

BTW, since `compressed_bvecs' is too long as a part of the function
name, `in_bvecs' is used here instead.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-7-hsiangkao@linux.alibaba.com
2022-07-21 22:54:29 +08:00
Gao Xiang
387bab8716 erofs: drop the old pagevec approach
Remove the old pagevec approach but keep z_erofs_page_type for now.
It will be reworked in the following commits as well.

Also rename Z_EROFS_NR_INLINE_PAGEVECS as Z_EROFS_INLINE_BVECS with
the new value 2 since it's actually enough to bootstrap.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-6-hsiangkao@linux.alibaba.com
2022-07-21 22:54:20 +08:00
Gao Xiang
06a304cd9c erofs: introduce bufvec to store decompressed buffers
For each pcluster, the total compressed buffers are determined in
advance, yet the number of decompressed buffers actually vary.  Too
many decompressed pages can be recorded if one pcluster is highly
compressed or its pcluster size is large.  That takes extra memory
footprints compared to uncompressed filesystems, especially a lot of
I/O in flight on low-ended devices.

Therefore, similar to inplace I/O, pagevec was introduced to reuse
page cache to store these pointers in the time-sharing way since
these pages are actually unused before decompressing.

In order to make it more flexable, a cleaner bufvec is used to
replace the old pagevec stuffs so that

 - Decompressed offsets can be stored inline, thus it can be used
   for the upcoming feature like compressed data deduplication.
   It's calculated by `page_offset(page) - map->m_la';

 - Towards supporting large folios for compressed inodes since
   our final goal is to completely avoid page->private but use
   folio->private only for all page cache pages.

Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-5-hsiangkao@linux.alibaba.com
2022-07-21 22:54:10 +08:00
Gao Xiang
42fec235f1 erofs: introduce `z_erofs_parse_out_bvecs()'
`z_erofs_decompress_pcluster()' is too long therefore it'd be better
to introduce another helper to parse decompressed pages (or laterly,
decompressed bvecs.)

BTW, since `decompressed_bvecs' is too long as a part of the function
name, `out_bvecs' is used instead.

Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-4-hsiangkao@linux.alibaba.com
2022-07-21 22:53:53 +08:00
Gao Xiang
0d823b424f erofs: clean up z_erofs_collector_begin()
Rearrange the code and get rid of all gotos.

Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-3-hsiangkao@linux.alibaba.com
2022-07-21 22:53:43 +08:00
Gao Xiang
83a386c0a5 erofs: get rid of unneeded inode', map' and `sb'
Since commit 5c6dcc57e2 ("erofs: get rid of
`struct z_erofs_collector'"), these arguments can be dropped as well.

No logic changes.

Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-2-hsiangkao@linux.alibaba.com
2022-07-21 22:53:33 +08:00
Dylan Yudaken
934447a603 io_uring: do not recycle buffer in READV
READV cannot recycle buffers as it would lose some of the data required to
reimport that buffer.

Reported-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Fixes: b66e65f414 ("io_uring: never call io_buffer_select() for a buffer re-select")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220721131325.624788-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 08:31:31 -06:00
Dylan Yudaken
ec8516f3b7 io_uring: fix free of unallocated buffer list
in the error path of io_register_pbuf_ring, only free bl if it was
allocated.

Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/all/CANX2M5bXKw1NaHdHNVqssUUaBCs8aBpmzRNVEYEvV0n44P7ioA@mail.gmail.com/
Link: https://lore.kernel.org/all/CANX2M5YiZBXU3L6iwnaLs-HHJXRvrxM8mhPDiMDF9Y9sAvOHUA@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 08:29:01 -06:00
Deming Wang
1e5b9e048c virtiofs: delete unused parameter for virtio_fs_cleanup_vqs
fs parameter not used. So, it needs to be deleted.

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:19 +02:00
Dave Marchevsky
9ccf47b26b fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other
Since commit 73f03c2b4b ("fuse: Restrict allow_other to the superblock's
namespace or a descendant"), access to allow_other FUSE filesystems has
been limited to users in the mounting user namespace or descendants. This
prevents a process that is privileged in its userns - but not its parent
namespaces - from mounting a FUSE fs w/ allow_other that is accessible to
processes in parent namespaces.

While this restriction makes sense overall it breaks a legitimate usecase:
I have a tracing daemon which needs to peek into process' open files in
order to symbolicate - similar to 'perf'. The daemon is a privileged
process in the root userns, but is unable to peek into FUSE filesystems
mounted by processes in child namespaces.

This patch adds a module param, allow_sys_admin_access, to act as an escape
hatch for this descendant userns logic and for the allow_other mount option
in general. Setting allow_sys_admin_access allows processes with
CAP_SYS_ADMIN in the initial userns to access FUSE filesystems irrespective
of the mounting userns or whether allow_other was set. A sysadmin setting
this param must trust FUSEs on the host to not DoS processes as described
in 73f03c2b4b.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:19 +02:00
Xie Yongji
c64797809a fuse: Remove the control interface for virtio-fs
The commit 15c8e72e88 ("fuse: allow skipping control interface and forced
unmount") tries to remove the control interface for virtio-fs since it does
not support aborting requests which are being processed. But it doesn't
work now.

This patch fixes it by skipping creating the control interface if
fuse_conn->no_control is set.

Fixes: 15c8e72e88 ("fuse: allow skipping control interface and forced unmount")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:19 +02:00
Miklos Szeredi
02c0cab8e7 fuse: ioctl: translate ENOSYS
Overlayfs may fail to complete updates when a filesystem lacks
fileattr/xattr syscall support and responds with an ENOSYS error code,
resulting in an unexpected "Function not implemented" error.

This bug may occur with FUSE filesystems, such as davfs2.

Steps to reproduce:

  # install davfs2, e.g., apk add davfs2
  mkdir /test mkdir /test/lower /test/upper /test/work /test/mnt
  yes '' | mount -t davfs -o ro http://some-web-dav-server/path \
    /test/lower
  mount -t overlay -o upperdir=/test/upper,lowerdir=/test/lower \
    -o workdir=/test/work overlay /test/mnt

  # when "some-file" exists in the lowerdir, this fails with "Function
  # not implemented", with dmesg showing "overlayfs: failed to retrieve
  # lower fileattr (/some-file, err=-38)"
  touch /test/mnt/some-file

The underlying cause of this regresion is actually in FUSE, which fails to
translate the ENOSYS error code returned by userspace filesystem (which
means that the ioctl operation is not supported) to ENOTTY.

Reported-by: Christian Kohlschütter <christian@kohlschutter.com>
Fixes: 72db82115d ("ovl: copy up sync/noatime fileattr flags")
Fixes: 59efec7b90 ("fuse: implement ioctl support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:18 +02:00
Miklos Szeredi
47912eaa06 fuse: limit nsec
Limit nanoseconds to 0..999999999.

Fixes: d8a5ba4545 ("[PATCH] FUSE - core")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:18 +02:00
Jeffle Xu
47e301491c fuse: avoid unnecessary spinlock bump
Move dmap free worker kicker inside the critical region, so that extra
spinlock lock/unlock could be avoided.

Suggested-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:02:45 +02:00
Miklos Szeredi
2fdbb8dd01 fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.

For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.

open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
  inode_lock            [C acquire]
  fuse_set_nowrite      [A acquire]

  fuse_finish_open
    truncate_pagecache
      lock_page         [B acquire]
      truncate_inode_page
      unlock_page       [B release]

  fuse_release_nowrite  [A release]
  inode_unlock          [C release]
----------------------------------------------------------------

open()
----------------------------------------------------------------
fuse_open_common
  fuse_finish_open
    invalidate_inode_pages2
      lock_page         [B acquire]
        fuse_launder_page
          fuse_wait_on_page_writeback [A acquire & release]
      unlock_page       [B release]
----------------------------------------------------------------

Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).

Fix by moving the truncate_pagecache() call outside the nowrite protected
region.  The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server.  Write syscalls racing with
page cache truncation still get the inode lock protection.

This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common().  This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().

Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b8 ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:02:45 +02:00
Miklos Szeredi
035ff33cf4 fuse: write inode in fuse_release()
A race between write(2) and close(2) allows pages to be dirtied after
fuse_flush -> write_inode_now().  If these pages are not flushed from
fuse_release(), then there might not be a writable open file later.  So any
remaining dirty pages must be written back before the file is released.

This is a partial revert of the blamed commit.

Reported-by: syzbot+6e1efbd8efaaa6860e91@syzkaller.appspotmail.com
Fixes: 36ea23374d ("fuse: write inode in fuse_vma_close() instead of fuse_release()")
Cc: <stable@vger.kernel.org> # v5.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:02:45 +02:00
Yang Xu
5fadbd9929 ceph: rely on vfs for setgid stripping
Now that we finished moving setgid stripping for regular files in setgid
directories into the vfs, individual filesystem don't need to manually
strip the setgid bit anymore. Drop the now unneeded code from ceph.

Link: https://lore.kernel.org/r/1657779088-2242-4-git-send-email-xuyang2018.jy@fujitsu.com
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft)<brauner@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-21 11:34:16 +02:00
Yang Xu
1639a49ccd fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.

Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.

When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.

However, there are several key issues with the current implementation:

* S_ISGID stripping logic is entangled with umask stripping.

  If a filesystem doesn't support or enable POSIX ACLs then umask
  stripping is done directly in the vfs before calling into the
  filesystem.
  If the filesystem does support POSIX ACLs then unmask stripping may be
  done in the filesystem itself when calling posix_acl_create().

  Since umask stripping has an effect on S_ISGID inheritance, e.g., by
  stripping the S_IXGRP bit from the file to be created and all relevant
  filesystems have to call posix_acl_create() before inode_init_owner()
  where we currently take care of S_ISGID handling S_ISGID handling is
  order dependent. IOW, whether or not you get a setgid bit depends on
  POSIX ACLs and umask and in what order they are called.

  Note that technically filesystems are free to impose their own
  ordering between posix_acl_create() and inode_init_owner() meaning
  that there's additional ordering issues that influence S_SIGID
  inheritance.

* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
  stripping logic.

  While that may be intentional (e.g. network filesystems might just
  defer setgid stripping to a server) it is often just a security issue.

This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.

So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.

We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).

The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:

sys_mknod()
-> do_mknodat(mode)
   -> .mknod = ovl_mknod(mode)
      -> ovl_create(mode)
         -> vfs_mknod(mode)

get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.

Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.

The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.

All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.

Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.

In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:

* btrfs allows the creation of filesystem objects through various
  ioctls(). Snapshot creation literally takes a snapshot and so the mode
  is fully preserved and S_ISGID stripping doesn't apply.

  Creating a new subvolum relies on inode_init_owner() in
  btrfs_new_subvol_inode() but only creates directories and doesn't
  raise S_ISGID.

* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
  xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
  the actual extents ocfs2 uses a separate ioctl() that also creates the
  target file.

  Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
  inode_init_owner() to strip the S_ISGID bit. This is the only place
  where a filesystem needs to call mode_strip_sgid() directly but this
  is self-inflicted pain.

* spufs doesn't go through the vfs at all and doesn't use ioctl()s
  either. Instead it has a dedicated system call spufs_create() which
  allows the creation of filesystem objects. But spufs only creates
  directories and doesn't allo S_SIGID bits, i.e. it specifically only
  allows 0777 bits.

* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.

The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.

Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.

Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e10 ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-21 11:34:16 +02:00
Darrick J. Wong
c78c2d0903 xfs: don't leak memory when attr fork loading fails
I observed the following evidence of a memory leak while running xfs/399
from the xfs fsck test suite (edited for brevity):

XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork
XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315
------------[ cut here ]------------
WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Call Trace:
 <TASK>
 xfs_ifork_zap_attr+0x7c/0xb0
 xfs_iformat_attr_fork+0x86/0x110
 xfs_inode_from_disk+0x41d/0x480
 xfs_iget+0x389/0xd70
 xfs_bulkstat_one_int+0x5b/0x540
 xfs_bulkstat_iwalk+0x1e/0x30
 xfs_iwalk_ag_recs+0xd1/0x160
 xfs_iwalk_run_callbacks+0xb9/0x180
 xfs_iwalk_ag+0x1d8/0x2e0
 xfs_iwalk+0x141/0x220
 xfs_bulkstat+0x105/0x180
 xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130
 xfs_file_ioctl+0xa5f/0xef0
 __x64_sys_ioctl+0x82/0xa0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

This newly-added assertion checks that there aren't any incore data
structures hanging off the incore fork when we're trying to reset its
contents.  From the call trace, it is evident that iget was trying to
construct an incore inode from the ondisk inode, but the attr fork
verifier failed and we were trying to undo all the memory allocations
that we had done earlier.

The three assertions in xfs_ifork_zap_attr check that the caller has
already called xfs_idestroy_fork, which clearly has not been done here.
As the zap function then zeroes the pointers, we've effectively leaked
the memory.

The shortest change would have been to insert an extra call to
xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
call into _zap_attr, since all other callsites call _idestroy_fork
immediately prior to calling _zap_attr.  IOWs, it eliminates one way to
fail.

Note: This change only applies cleanly to 2ed5b09b3e, since we just
reworked the attr fork lifetime.  However, I think this memory leak has
existed since 0f45a1b20c, since the chain xfs_iformat_attr_fork ->
xfs_iformat_local -> xfs_init_local_fork will allocate
ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
hanging off i_afp.  The solution for older kernels I think is to add the
missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.

Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.

Fixes: 2ed5b09b3e ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
Probably-Fixes: 0f45a1b20c ("xfs: improve local fork verification")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-20 16:40:39 -07:00
sunliming
1a53d3d426 xfs: fix for variable set but not used warning
Fix below kernel warning:

fs/xfs/scrub/repair.c:539:19: warning: variable 'agno' set but not used [-Wunused-but-set-variable]

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: sunliming <sunliming@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-20 16:40:39 -07:00