Commit Graph

74668 Commits

Author SHA1 Message Date
Darrick J. Wong
9e253954ac xfs: compact deferred intent item structures
Rearrange these structs to reduce the amount of unused padding bytes.
This saves eight bytes for each of the three structs changed here, which
means they're now all (rmap/bmap are 64 bytes, refc is 32 bytes) even
powers of two.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong
182696fb02 xfs: rename _zone variables to _cache
Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:20 -07:00
Darrick J. Wong
e7720afad0 xfs: remove kmem_zone typedef
Remove these typedefs by referencing kmem_cache directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:00:31 -07:00
Linus Torvalds
5ab2ed0a8d Merge tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "Syzbot discovered a race in case of reusing the fuse sb (introduced in
  this cycle).

  Fix it by doing the s_fs_info initialization at the proper place"

* tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: clean up error exits in fuse_fill_super()
  fuse: always initialize sb->s_fs_info
  fuse: clean up fuse_mount destruction
  fuse: get rid of fuse_put_super()
  fuse: check s_root when destroying sb
2021-10-22 10:39:47 -10:00
Miklos Szeredi
cefd1b8327 fuse: decrement nlink on overwriting rename
Rename didn't decrement/clear nlink on overwritten target inode.

Create a common helper fuse_entry_unlinked() that handles this for unlink,
rmdir and rename.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:02 +02:00
Miklos Szeredi
84840efc3c fuse: simplify __fuse_write_file_get()
Use list_first_entry_or_null() instead of list_empty() + list_entry().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:02 +02:00
Miklos Szeredi
371e8fd029 fuse: move fuse_invalidate_attr() into fuse_update_ctime()
Logically it belongs there since attributes are invalidated due to the
updated ctime.  This is a cleanup and should not change behavior.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Peng Hao
b5d9758297 fuse: delete redundant code
'ia->io=io' has been set in fuse_io_alloc.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Peng Hao
5fe0fc9f1d fuse: use kmap_local_page()
Due to the introduction of kmap_local_*, the storage of slots used for
short-term mapping has changed from per-CPU to per-thread.  kmap_atomic()
disable preemption, while kmap_local_*() only disable migration.

There is no need to disable preemption in several kamp_atomic places used
in fuse.

Link: https://lwn.net/Articles/836144/
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Miklos Szeredi
bda9a71980 fuse: annotate lock in fuse_reverse_inval_entry()
Add missing inode lock annotatation; found by syzbot.

Reported-and-tested-by: syzbot+9f747458f5990eaa8d43@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Miklos Szeredi
36ea23374d fuse: write inode in fuse_vma_close() instead of fuse_release()
Fuse ->release() is otherwise asynchronous for the reason that it can
happen in contexts unrelated to close/munmap.

Inode is already written back from fuse_flush().  Add it to
fuse_vma_close() as well to make sure inode dirtying from mmaps also get
written out before the file is released.

Also add error handling.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Miklos Szeredi
5c791fe1e2 fuse: make sure reclaim doesn't write the inode
In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped.  This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

 - fallocate(2)/copy_file_range(2): these call file_update_time() or
   file_modified(), so flush the inode before returning from the call

 - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
   flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Christoph Hellwig
1e03a36bdf block: simplify the block device syncing code
Get rid of the indirections and just provide a sync_bdevs
helper for the generic sync code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
680e667bc2 ntfs3: use sync_blockdev_nowait
Use sync_blockdev_nowait instead of opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
cb9568ee75 fat: use sync_blockdev_nowait
Use sync_blockdev_nowait instead of opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
1226dfff57 btrfs: use sync_blockdev
Use sync_blockdev instead of opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
70164eb6cc block: remove __sync_blockdev
Instead offer a new sync_blockdev_nowait helper for the !wait case.
This new helper is exported as it will grow modular callers in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
9a208ba5c9 fs: remove __sync_filesystem
There is no clear benefit in having this helper vs just open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
8c6aabd1c7 nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands
Call the ->get_unique_id method to query the SCSI identifiers.  This can
use the cached VPD page in the sd driver instead of sending a command
on every LAYOUTGET.  It will also allow to support NVMe based volumes
if the draft for that ever takes off.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20211021060607.264371-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:33:57 -06:00
Trond Myklebust
4cd27df88a NFS: Remove redundant call to __set_page_dirty_nobuffers
Remove a redundant call in nfs_updatepage(). nfs_writepage_setup() will
have already called nfs_mark_request_dirty() on success.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-21 17:11:41 -04:00
Pavel Begunkov
b22fa62a35 io_uring: apply worker limits to previous users
Another change to the API io-wq worker limitation API added in 5.15,
apply the limit to all prior users that already registered a tctx. It
may be confusing as it's now, in particular the change covers the
following 2 cases:

TASK1                   | TASK2
_________________________________________________
ring = create()         |
                        | limit_iowq_workers()
*not limited*           |

TASK1                   | TASK2
_________________________________________________
ring = create()         |
                        | issue_requests()
limit_iowq_workers()    |
                        | *not limited*

A note on locking, it's safe to traverse ->tctx_list as we hold
->uring_lock, but do that after dropping sqd->lock to avoid possible
problems. It's also safe to access tctx->io_wq there because tasks
kill it only after removing themselves from tctx_list, see
io_uring_cancel_generic() -> io_uring_clean_tctx()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6e09ecc3545e4dc56e43c906ee3d71b7ae21bed.1634818641.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 11:19:38 -06:00
Miklos Szeredi
964d32e512 fuse: clean up error exits in fuse_fill_super()
Instead of "goto err", return error directly, since there's no error
cleanup to do now.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi
80019f1138 fuse: always initialize sb->s_fs_info
Syzkaller reports a null pointer dereference in fuse_test_super() that is
caused by sb->s_fs_info being NULL.

This is due to the fact that fuse_fill_super() is initializing s_fs_info,
which is too late, it's already on the fs_supers list.  The initialization
needs to be done in sget_fc() with the sb_lock held.

Move allocation of fuse_mount and fuse_conn from fuse_fill_super() into
fuse_get_tree().

After this ->kill_sb() will always be called with non-NULL ->s_fs_info,
hence fuse_mount_destroy() can drop the test for non-NULL "fm".

Reported-by: syzbot+74a15f02ccb51f398601@syzkaller.appspotmail.com
Fixes: 5d5b74aa9c ("fuse: allow sharing existing sb")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi
c191cd07ee fuse: clean up fuse_mount destruction
1. call fuse_mount_destroy() for open coded variants

2. before deactivate_locked_super() don't need fuse_mount destruction since
that will now be done (if ->s_fs_info is not cleared)

3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the
regular pattern can be used

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi
a27c061a49 fuse: get rid of fuse_put_super()
The ->put_super callback is called from generic_shutdown_super() in case of
a fully initialized sb.  This is called from kill_***_super(), which is
called from ->kill_sb instances.

Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the
reference to the fuse_conn, while it does the same on each error case
during sb setup.

This patch moves the destruction from fuse_put_super() to
fuse_mount_destroy(), called at the end of all ->kill_sb instances.  A
follup patch will clean up the error paths.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:38 +02:00
Miklos Szeredi
d534d31d6a fuse: check s_root when destroying sb
Checking "fm" works because currently sb->s_fs_info is cleared on error
paths; however, sb->s_root is what generic_shutdown_super() checks to
determine whether the sb was fully initialized or not.

This change will allow cleanup of sb setup error paths.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:38 +02:00
Ian Kent
25f54d08f1 autofs: fix wait name hash calculation in autofs_wait()
There's a mistake in commit 2be7828c9f ("get rid of autofs_getpath()")
that affects kernels from v5.13.0, basically missed because of me not
fully testing the change for Al.

The problem is that the hash calculation for the wait name qstr hasn't
been updated to account for the change to use dentry_path_raw(). This
prevents the correct matching an existing wait resulting in multiple
notifications being sent to the daemon for the same mount which must
not occur.

The problem wasn't discovered earlier because it only occurs when
multiple processes trigger a request for the same mount concurrently
so it only shows up in more aggressive testing.

Fixes: 2be7828c9f ("get rid of autofs_getpath()")
Cc: stable@vger.kernel.org
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-10-20 21:09:02 -04:00
Len Baker
6446c4fb12 aio: Prefer struct_size over open coded arithmetic
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + size * count" in the kzalloc() function.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:24:31 -05:00
Len Baker
98b160c828 writeback: prefer struct_size over open coded arithmetic
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

In this case these are not actually dynamic sizes: all the operands
involved in the calculation are constant values. However it is better to
refactor them anyway, just to keep the open-coded math idiom out of
code.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kzalloc() functions.

This code was detected with the help of Coccinelle and audited and fixed
manually.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:20:28 -05:00
Gustavo A. R. Silva
c2e4e3b756 xfs: Use kvcalloc() instead of kvzalloc()
Use 2-factor argument multiplication form kvcalloc() instead of
kvzalloc().

Link: https://github.com/KSPP/linux/issues/162
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:14:12 -05:00
Anna Schumaker
5fe1210d25 NFS: Unexport nfs_probe_fsinfo()
All the callers are now in client.c so we can remove the
EXPORT_SYMBOL_GPL() and make it static.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:55 -04:00
Anna Schumaker
1301ba603c NFS: Call nfs_probe_server() during a fscontext-reconfigure event
This lets us update the server's attributes when the user does a "mount
-o remount" on the filesystem.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:55 -04:00
Anna Schumaker
4d4cf8d2d6 NFS: Replace calls to nfs_probe_fsinfo() with nfs_probe_server()
Clean up. There are a few places where we want to probe the server, but
don't actually care about the fsinfo result. Change these to use
nfs_probe_server(), which handles the fattr allocation for us.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Anna Schumaker
e5731131fb NFS: Move nfs_probe_destination() into the generic client
And rename it to nfs_probe_server(). I also change it to take the nfs_fh
as an argument so callers can choose what filehandle to probe.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Anna Schumaker
01dde76e47 NFS: Create an nfs4_server_set_init_caps() function
And call it before doing an FSINFO probe to reset to the baseline
capabilities before probing.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Chuck Lever
86882c7546 NFS: Remove --> and <-- dprintk call sites
dprintk call sites that display no other information than the
function name can be replaced with use of the trace "function" or
"function_graph" plug-ins.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Chuck Lever
b40887e10d SUNRPC: Trace calls to .rpc_call_done
Introduce a single tracepoint that can replace simple dprintk call
sites in upper layer "rpc_call_done" callbacks. Example:

   kworker/u24:2-1254  [001]   771.026677: rpc_stats_latency:    task:00000001@00000002 xid=0x16a6f3c0 rpcbindv2 GETPORT backlog=446 rtt=101 execute=555
   kworker/u24:2-1254  [001]   771.026677: rpc_task_call_done:   task:00000001@00000002 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpcb_getport_done
   kworker/u24:2-1254  [001]   771.026678: rpcb_setport:         task:00000001@00000002 status=0 port=20048

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Chuck Lever
d9f877433e NFS: Replace dprintk callsites in nfs_readpage(s)
These new events report slightly different information for readpage
and readpages/readahead.

For readpage:
             fsx-1387  [006]   380.761896: nfs_aop_readpage:    fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355910932437 offset=131072
             fsx-1387  [006]   380.761900: nfs_aop_readpage_done: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355910932437 offset=131072 ret=0

The index of a synchronous single-page read is reported.

For readpages:

             fsx-1387  [006]   380.760847: nfs_aop_readahead:   fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 nr_pages=3
             fsx-1387  [006]   380.760853: nfs_aop_readahead_done: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 nr_pages=3 ret=0

The count of pages requested is reported. nfs_readpages does not
wait for the READ requests to complete.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Chuck Lever
b4776a341e SUNRPC: Tracepoints should display tk_pid and cl_clid as a fixed-size field
For certain special cases, RPC-related tracepoints record a -1 as
the task ID or the client ID. It's ugly for a trace event to display
4 billion in these cases.

To help keep SUNRPC tracepoints consistent, create a macro that
defines the print format specifiers for tk_pid and cl_clid. At some
point in the future we might try tk_pid with a wider range of values
than 0..64K so this makes it easier to make that change.

RPC tracepoints now look like this:

<...>-1276  [009]   149.720358: rpc_clnt_new:         client=00000005 peer=[192.168.2.55]:20049 program=nfs server=klimt.ib

<...>-1342  [004]   149.921234: rpc_xdr_recvfrom:     task:0000001a@00000005 head=[0xff1242d9ab6dc01c,144] page=0 tail=[(nil),0] len=144
<...>-1342  [004]   149.921235: xprt_release_cong:    task:0000001a@00000005 snd_task:ffffffff cong=256 cwnd=16384
<...>-1342  [004]   149.921235: xprt_put_cong:        task:0000001a@00000005 snd_task:ffffffff cong=0 cwnd=16384

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Alexey Gladkov
d5f458a979 Fix user namespace leak
Fixes: 61ca2c4afd ("NFS: Only reference user namespace from nfs4idmap struct instead of cred")
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Trond Myklebust
e591b298d7 NFS: Save some space in the inode
Save some space in the nfs_inode by setting up an anonymous union with
the fields that are peculiar to a specific type of filesystem object.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Trond Myklebust
6e176d4716 NFSv4: Fixes for nfs4_inode_return_delegation()
We mustn't call nfs_wb_all() on anything other than a regular file.
Furthermore, we can exit early when we don't hold a delegation.

Reported-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:53 -04:00
Trond Myklebust
f0caea8882 NFS: Fix an Oops in pnfs_mark_request_commit()
Olga reports seeing the following Oops when doing O_DIRECT writes to a
pNFS flexfiles server:

Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 234186 Comm: kworker/u8:1 Not tainted 5.15.0-rc4+ #4
Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
Workqueue: nfsiod rpc_async_release [sunrpc]
RIP: 0010:nfs_mark_request_commit+0x12/0x30 [nfs]
Code: ff ff be 03 00 00 00 e8 ac 34 83 eb e9 29 ff ff
ff e8 22 bc d7 eb 66 90 0f 1f 44 00 00 48 85 f6 74 16 48 8b 42 10 48
8b 40 18 <48> 8b 40 18 48 85 c0 74 05 e9 70 fc 15 ec 48 89 d6 e9 68 ed
ff ff
RSP: 0018:ffffa82f0159fe00 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8f3393141880 RCX: 0000000000000000
RDX: ffffa82f0159fe08 RSI: ffff8f3381252500 RDI: ffff8f3393141880
RBP: ffff8f33ac317c00 R08: 0000000000000000 R09: ffff8f3487724cb0
R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000001
R13: ffff8f3485bccee0 R14: ffff8f33ac317c10 R15: ffff8f33ac317cd8
FS:  0000000000000000(0000) GS:ffff8f34fbc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 0000000122120006 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 nfs_direct_write_completion+0x13b/0x250 [nfs]
 rpc_free_task+0x39/0x60 [sunrpc]
 rpc_async_release+0x29/0x40 [sunrpc]
 process_one_work+0x1ce/0x370
 worker_thread+0x30/0x380
 ? process_one_work+0x370/0x370
 kthread+0x11a/0x140
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30

Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 9c455a8c1e ("NFS/pNFS: Clean up pNFS commit operations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:53 -04:00
Trond Myklebust
133a48abf6 NFS: Fix up commit deadlocks
If O_DIRECT bumps the commit_info rpcs_out field, then that could lead
to fsync() hangs. The fix is to ensure that O_DIRECT calls
nfs_commit_end().

Fixes: 723c921e7d ("sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:45 -04:00
Linus Torvalds
2f111a6fd5 Merge tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "Two important filesystem fixes, marked for stable.

  The blocklisted superblocks issue was particularly annoying because
  for unexperienced users it essentially exacted a reboot to establish a
  new functional mount in that scenario"

* tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client:
  ceph: fix handling of "meta" errors
  ceph: skip existing superblocks that are blocklisted or shut down when mounting
2021-10-20 10:23:05 -10:00
Andreas Gruenbacher
1b223f7065 gfs2: Eliminate ip->i_gh
Now that gfs2_file_buffered_write is the only remaining user of
ip->i_gh, we can move the glock holder to the stack (or rather, use the
one we already have on the stack); there is no need for keeping the
holder in the inode anymore.

This is slightly complicated by the fact that we're using ip->i_gh for
the statfs inode in gfs2_file_buffered_write as well.  Writing to the
statfs inode isn't very common, so allocate the statfs holder
dynamically when needed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:09 +02:00
Andreas Gruenbacher
b924bdab74 gfs2: Move the inode glock locking to gfs2_file_buffered_write
So far, for buffered writes, we were taking the inode glock in
gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
not holding the inode glock while iomap_write_actor faults in user
pages.  It turns out that iomap_write_actor is called inside iomap_begin
... iomap_end, so the user pages were still faulted in while holding the
inode glock and the locking code in iomap_begin / iomap_end was
completely pointless.

Move the locking into gfs2_file_buffered_write instead.  We'll take care
of the potential deadlocks due to faulting in user pages while holding a
glock in a subsequent patch.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:09 +02:00
Bob Peterson
dc732906c2 gfs2: Introduce flag for glock holder auto-demotion
This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
will allow glocks to be demoted automatically on locking conflicts.
When a locking request comes in that isn't compatible with the locking
state of an active holder and that holder has the HIF_MAY_DEMOTE flag
set, the holder will be demoted before the incoming locking request is
granted.

Note that this mechanism demotes active holders (with the HIF_HOLDER
flag set), while before we were only demoting glocks without any active
holders.  This allows processes to keep hold of locks that may form a
cyclic locking dependency; the core glock logic will then break those
dependencies in case a conflicting locking request occurs.  We'll use
this to avoid giving up the inode glock proactively before faulting in
pages.

Processes that allow a glock holder to be taken away indicate this by
calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
Later, they call gfs2_holder_disallow_demote() to clear the flag again,
and then they check if their holder is still queued: if it is, they are
still holding the glock; if it isn't, they can re-acquire the glock (or
abort).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:08 +02:00
Andreas Gruenbacher
6144464937 gfs2: Clean up function may_grant
Pass the first current glock holder into function may_grant and
deobfuscate the logic there.

While at it, switch from BUG_ON to GLOCK_BUG_ON in may_grant.  To make
that build cleanly, de-constify the may_grant arguments.

We're now using function find_first_holder in do_promote, so move the
function's definition above do_promote.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:08 +02:00
Andreas Gruenbacher
2eb7509a05 gfs2: Add wrapper for iomap_file_buffered_write
Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:08 +02:00