Commit Graph

74668 Commits

Author SHA1 Message Date
Pavel Begunkov
8d4ad41e3e io_uring: prolong tctx_task_work() with flushing
io_submit_flush_completions() may enqueue linked requests for task_work
execution, so don't leave tctx_task_work() right after the tw list is
exhausted, but try to flush and then retry.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0755d4c2c36301447c63bdd4146c10477cea4249.1630539342.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03 06:16:15 -06:00
Pavel Begunkov
636378535a io_uring: don't disable kiocb_done() CQE batching
Not passing issue_flags from kiocb_done() into __io_complete_rw() means
that completion batching for this case is disabled, e.g. for most of
buffered reads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2689462835c3ee28a5999ef4f9a581e24be04a2.1630539342.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03 06:16:14 -06:00
Jens Axboe
fa84693b3c io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
SQPOLL has a different thread doing submissions, we need to check for
that and use the right task context when updating the worker values.
Just hold the sqd->lock across the operation, this ensures that the
thread cannot go away while we poke at ->io_uring.

Link: https://github.com/axboe/liburing/issues/420
Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Reported-by: Johannes Lundberg <johalun0@gmail.com>
Tested-by: Johannes Lundberg <johalun0@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03 06:16:11 -06:00
Colin Ian King
05a444d3f9 ceph: fix dereference of null pointer cf
Currently in the case where kmem_cache_alloc fails the null pointer
cf is dereferenced when assigning cf->is_capsnap = false. Fix this
by adding a null pointer check and return path.

Cc: stable@vger.kernel.org
Addresses-Coverity: ("Dereference null return")
Fixes: b2f9fa1f3b ("ceph: correctly handle releasing an embedded cap flush")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-03 10:55:51 +02:00
Linus Torvalds
a9c9a6f741 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
 "This series consists of the usual driver updates (ufs, qla2xxx,
  target, smartpqi, lpfc, mpt3sas).

  The core change causing the most churn was replacing the command
  request field request with a macro, allowing us to offset map to it
  and remove the redundant field; the same was also done for the tag
  field.

  The most impactful change is the final removal of scsi_ioctl, which
  has been deprecated for over a decade"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
  scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
  scsi: ufs: ufs-exynos: Fix static checker warning
  scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
  scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
  scsi: lpfc: Copyright updates for 14.0.0.1 patches
  scsi: lpfc: Update lpfc version to 14.0.0.1
  scsi: lpfc: Add bsg support for retrieving adapter cmf data
  scsi: lpfc: Add cmf_info sysfs entry
  scsi: lpfc: Add debugfs support for cm framework buffers
  scsi: lpfc: Add support for maintaining the cm statistics buffer
  scsi: lpfc: Add rx monitoring statistics
  scsi: lpfc: Add support for the CM framework
  scsi: lpfc: Add cmfsync WQE support
  scsi: lpfc: Add support for cm enablement buffer
  scsi: lpfc: Add cm statistics buffer support
  scsi: lpfc: Add EDC ELS support
  scsi: lpfc: Expand FPIN and RDF receive logging
  scsi: lpfc: Add MIB feature enablement support
  scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
  scsi: fc: Add EDC ELS definition
  ...
2021-09-02 15:09:46 -07:00
Jeff Layton
9f3589993c ceph: drop the mdsc_get_session/put_session dout messages
These are very chatty, racy, and not terribly useful. Just remove them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
3eaf5aa1cf ceph: lockdep annotations for try_nonblocking_invalidate
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
a76d0a9c28 ceph: don't WARN if we're forcibly removing the session caps
For example in the case of a forced umount, we'll remove all the session
caps even if they are dirty. Move the warning to a wrapper function and
make most of the callers use it. Call the core function when removing
caps due to a forced umount.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
42ad631b4d ceph: don't WARN if we're force umounting
Force umount will try to close the sessions by setting the session
state to _CLOSING. We don't want to WARN in this situation, since it's
expected.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
a6d37ccdd2 ceph: remove the capsnaps when removing caps
capsnaps will take inode references via ihold when queueing to flush.
When force unmounting, the client will just close the sessions and
may never get a flush reply, causing a leak and inode ref leak.

Fix this by removing the capsnaps for an inode when removing the caps.

URL: https://tracker.ceph.com/issues/52295
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
b11ed50346 ceph: request Fw caps before updating the mtime in ceph_write_iter
The current code will update the mtime and then try to get caps to
handle the write. If we end up having to request caps from the MDS, then
the mtime in the cap grant will clobber the updated mtime and it'll be
lost.

This is most noticable when two clients are alternately writing to the
same file. Fw caps are continually being granted and revoked, and the
mtime ends up stuck because the updated mtimes are always being
overwritten with the old one.

Fix this by changing the order of operations in ceph_write_iter to get
the caps before updating the times. Also, make sure we check the pool
full conditions before even getting any caps or uninlining.

URL: https://tracker.ceph.com/issues/46574
Reported-by: Jozef Kováč <kovac@firma.zoznam.sk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
d517b3983d ceph: reconnect to the export targets on new mdsmaps
In the case where the export MDS has crashed just after the EImportStart
journal is flushed, a standby MDS takes over for it and when replaying
the EImportStart journal the MDS will wait the client to reconnect. That
may never happen because the client may not have registered or opened
the sessions yet.

When receiving a new map, ensure we reconnect to valid export targets as
well if their sessions don't exist yet.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
692e171597 ceph: print more information when we can't find snaprealm
Print a bit more information when we can't find the realm during
ceph_add_cap. Show both the inode number and the old realm inode
number.

Suggested-by: Sage Weil <sage@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
0ba92e1c5f ceph: add ceph_change_snap_realm() helper
Consolidate some fiddly code for changing an inode's snap_realm
into a new helper function, and change the callers to use it.

While we're in here, nothing uses the i_snap_realm_counter field, so
remove that from the inode.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
c80dc3aee9 ceph: remove redundant initializations from mdsc and session
The ceph_mds_client and ceph_mds_session structures are kzalloc'ed so
there's no need to explicitly initialize either of their fields to 0.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
b4002173b7 ceph: cancel delayed work instead of flushing on mdsc teardown
The first thing metric_delayed_work does is check mdsc->stopping,
and then return immediately if it's set. That's good since we would
have already torn down the metric structures at this point, otherwise,
but there is no locking around mdsc->stopping.

It's possible that the ceph_metric_destroy call could race with the
delayed_work, in which case we could end up with the delayed_work
accessing destroyed percpu variables.

At this point in the mdsc teardown, the "stopping" flag has already been
set, so there's no benefit to flushing the work. Move the work
cancellation in ceph_metric_destroy ahead of the percpu variable
destruction, and eliminate the flush_delayed_work call in
ceph_mdsc_destroy.

Fixes: 18f473b384 ("ceph: periodically send perf metrics to MDSes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
40e309de4d ceph: add a new vxattr to return auth mds for an inode
Add a new vxattr that shows what MDS is authoritative for an inode (if
we happen to have auth caps). If we don't have an auth cap for the inode
then just return -1.

URL: https://tracker.ceph.com/issues/1276
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
49f8899e5e ceph: remove some defunct forward declarations
We missed these in the recent fscache rework.

Reported-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
e1a4541ec0 ceph: flush the mdlog before waiting on unsafe reqs
For the client requests who will have unsafe and safe replies from
MDS daemons, in the MDS side the MDS daemons won't flush the mdlog
(journal log) immediatelly, because they think it's unnecessary.
That's true for most cases but not all, likes the fsync request.
The fsync will wait until all the unsafe replied requests to be
safely replied.

Normally if there have multiple threads or clients are running, the
whole mdlog in MDS daemons could be flushed in time if any request
will trigger the mdlog submit thread. So usually we won't experience
the normal operations will stuck for a long time. But in case there
has only one client with only thread is running, the stuck phenomenon
maybe obvious and the worst case it must wait at most 5 seconds to
wait the mdlog to be flushed by the MDS's tick thread periodically.

This patch will trigger to flush the mdlog in the relevant and auth
MDSes to which the in-flight requests are sent just before waiting
the unsafe requests to finish.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
d095559ce4 ceph: flush mdlog before umounting
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
59b312f362 ceph: make iterate_sessions a global symbol
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
fba97e8025 ceph: make ceph_create_session_msg a global symbol
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
ce3a8732ae ceph: fix comment about short copies in ceph_write_end
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
2ad32cf09b ceph: fix memory leak on decode error in ceph_handle_caps
If we hit a decoding error late in the frame, then we might exit the
function without putting the pool_ns string. Ensure that we always put
that reference on the way out of the function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Linus Torvalds
c815f04ba9 Merge tag 'linux-kselftest-kunit-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
 "This KUnit update for Linux 5.15-rc1 adds new features and tests:

  Tool:

   - support for '--kernel_args' to allow setting module params

   - support for '--raw_output' option to show just the kunit output
     during make

  Tests:

   - new KUnit tests for checksums and timestamps

   - Print test statistics on failure

   - Integrates UBSAN into the KUnit testing framework. It fails KUnit
     tests whenever it reports undefined behavior"

* tag 'linux-kselftest-kunit-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: Print test statistics on failure
  kunit: tool: make --raw_output support only showing kunit output
  kunit: tool: add --kernel_args to allow setting module params
  kunit: ubsan integration
  fat: Add KUnit tests for checksums and timestamps
2021-09-02 12:32:12 -07:00
Linus Torvalds
eceae1e7ac Merge tag 'configfs-5.15' of git://git.infradead.org/users/hch/configfs
Pull configfs updates from Christoph Hellwig:

 - fix a race in configfs_lookup (Sishuai Gong)

 - minor cleanups (me)

* tag 'configfs-5.15' of git://git.infradead.org/users/hch/configfs:
  configfs: fix a race in configfs_lookup()
  configfs: fold configfs_attach_attr into configfs_lookup
  configfs: simplify the configfs_dirent_is_ready
  configfs: return -ENAMETOOLONG earlier in configfs_lookup
2021-09-02 10:27:17 -07:00
Linus Torvalds
265113f70f Merge tag 'dlm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
 "This set includes a number of minor fixes and cleanups related to the
  networking changes in the last release.

  A patch to delay ack messages reduces network traffic significantly"

* tag 'dlm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  fs: dlm: avoid comms shutdown delay in release_lockspace
  fs: dlm: fix return -EINTR on recovery stopped
  fs: dlm: implement delayed ack handling
  fs: dlm: move receive loop into receive handler
  fs: dlm: fix multiple empty writequeue alloc
  fs: dlm: generic connect func
  fs: dlm: auto load sctp module
  fs: dlm: introduce generic listen
  fs: dlm: move to static proto ops
  fs: dlm: introduce con_next_wq helper
  fs: dlm: cleanup and remove _send_rcom
  fs: dlm: clear CF_APP_LIMITED on close
  fs: dlm: fix typo in tlv prefix
  fs: dlm: use READ_ONCE for config var
  fs: dlm: use sk->sk_socket instead of con->sock
2021-09-02 10:19:45 -07:00
Jens Axboe
3146cba99a io-wq: make worker creation resilient against signals
If a task is queueing async work and also handling signals, then we can
run into the case where create_io_thread() is interrupted and returns
failure because of that. If this happens for creating the first worker
in a group, then that worker will never get created and we can hang the
ring.

If we do get a fork failure, retry from task_work. With signals we have
to be a bit careful as we cannot simply queue as task_work, as we'll
still have signals pending at that point. Punt over a normal workqueue
first and then create from task_work after that.

Lastly, ensure that we handle fatal worker creations. Worker creation
failures are normally not fatal, only if we fail to create one in an empty
worker group can we not make progress. Right now that is ignored, ensure
that we handle that and run cancel on the work item.

There are two paths that create new workers - one is the "existing worker
going to sleep", and the other is "no workers found for this work, create
one". The former is never fatal, as workers do exist in the group. Only
the latter needs to be carefully handled.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02 11:12:19 -06:00
Jens Axboe
05c5f4ee4d io-wq: get rid of FIXED worker flag
It makes the logic easier to follow if we just get rid of the fixed worker
flag, and simply ensure that we never exit the last worker in the group.
This also means that no particular worker is special.

Just track the last timeout state, and if we have hit it and no work
is pending, check if there are other workers. If yes, then we can exit
this one safely.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02 11:12:16 -06:00
Linus Torvalds
b0cfcdd9b9 d_path: make 'prepend()' fill up the buffer exactly on overflow
Instead of just marking the buffer as having overflowed, fill it up as
much as we can.  That will allow the overflow case to then return
whatever truncated result if it wants to.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-02 10:07:29 -07:00
Linus Torvalds
111c1aa8ca Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
 "In addition to some ext4 bug fixes and cleanups, this cycle we add the
  orphan_file feature, which eliminates bottlenecks when doing a large
  number of parallel truncates and file deletions, and move the discard
  operation out of the jbd2 commit thread when using the discard mount
  option, to better support devices with slow discard operations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: make the updating inode data procedure atomic
  ext4: remove an unnecessary if statement in __ext4_get_inode_loc()
  ext4: move inode eio simulation behind io completeion
  ext4: Improve scalability of ext4 orphan file handling
  ext4: Orphan file documentation
  ext4: Speedup ext4 orphan inode handling
  ext4: Move orphan inode handling into a separate file
  ext4: Support for checksumming from journal triggers
  ext4: fix race writing to an inline_data file while its xattrs are changing
  jbd2: add sparse annotations for add_transaction_credits()
  ext4: fix sparse warnings
  ext4: Make sure quota files are not grabbed accidentally
  ext4: fix e2fsprogs checksum failure for mounted filesystem
  ext4: if zeroout fails fall back to splitting the extent node
  ext4: reduce arguments of ext4_fc_add_dentry_tlv
  ext4: flush background discard kwork when retry allocation
  ext4: get discard out of jbd2 commit kthread contex
  ext4: remove the repeated comment of ext4_trim_all_free
  ext4: add new helper interface ext4_try_to_trim_range()
  ext4: remove the 'group' parameter of ext4_trim_extent
  ...
2021-09-02 09:37:09 -07:00
Linus Torvalds
815409a12c Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:

 - Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

 - Improve performance by enabling RCU lookup.

 - Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument.  The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: enable RCU'd ->get_acl()
  vfs: add rcu argument to ->get_acl() callback
  ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
  ovl: use kvalloc in xattr copy-up
  ovl: update ctime when changing fileattr
  ovl: skip checking lower file's i_writecount on truncate
  ovl: relax lookup error on mismatch origin ftype
  ovl: do not set overlay.opaque for new directories
  ovl: add ovl_allow_offline_changes() helper
  ovl: disable decoding null uuid with redirect_dir
  ovl: consistent behavior for immutable/append-only inodes
  ovl: copy up sync/noatime fileattr flags
  ovl: pass ovl_fs to ovl_check_setxattr()
  fs: add generic helper for filling statx attribute flags
2021-09-02 09:21:27 -07:00
Linus Torvalds
412106c203 Merge tag 'erofs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
 "In this cycle, direct I/O and fsdax support for uncompressed files are
  now added in order to avoid double-caching for loop device and VM
  container use cases. All uncompressed cases are now turned into iomap
  infrastructure, which looks much simpler and cleaner.

  In addition, fiemap support is added for both (un)compressed files by
  using iomap infrastructure as well so end users can easily get file
  distribution. We've also added chunk-based uncompressed files support
  for data deduplication as the next step of VM container use cases.

  Summary:

   - support direct I/O for all uncompressed files

   - support fsdax for non-tailpacking regular files

   - use iomap infrastructure for all uncompressed cases

   - support fiemap for both (un)compressed files

   - introduce chunk-based files for chunk deduplication

   - some cleanups"

* tag 'erofs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix double free of 'copied'
  erofs: support reading chunk-based uncompressed files
  erofs: introduce chunk-based file on-disk format
  erofs: add fiemap support with iomap
  erofs: add support for the full decompressed length
  erofs: remove the mapping parameter from erofs_try_to_free_cached_page()
  erofs: directly use wrapper erofs_page_is_managed() when shrinking
  erofs: convert all uncompressed cases to iomap
  erofs: dax support for non-tailpacking regular file
  erofs: iomap support for non-tailpacking DIO
2021-09-02 09:12:05 -07:00
Linus Torvalds
89594c746b Merge tag 'fscache-next-20210829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache updates from David Howells:
 "Preparatory work for the fscache rewrite that's being worked on and
  fix some bugs. These include:

   - Always select netfs stats when enabling fscache stats since they're
     displayed through the same procfile.

   - Add a cookie debug ID that can be used in tracepoints instead of a
     pointer and cache it in the netfs_cache_resources struct rather
     than in the netfs_read_request struct to make it more available.

   - Use file_inode() in cachefiles rather than dereferencing
     file->f_inode directly.

   - Provide a procfile to display fscache cookies.

   - Remove the fscache and cachefiles histogram procfiles.

   - Remove the fscache object list procfile.

   - Avoid using %p in fscache and cachefiles as the value is hashed and
     not comparable to the register dump in an oops trace.

   - Fix the cookie hash function to actually achieve useful dispersion.

   - Fix fscache_cookie_put() so that it doesn't dereference the cookie
     pointer in the tracepoint after the refcount has been decremented
     (we're only allowed to do that if we decremented it to zero).

   - Use refcount_t rather than atomic_t for the fscache_cookie
     refcount"

* tag 'fscache-next-20210829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache: Use refcount_t for the cookie refcount instead of atomic_t
  fscache: Fix fscache_cookie_put() to not deref after dec
  fscache: Fix cookie key hashing
  cachefiles: Change %p in format strings to something else
  fscache: Change %p in format strings to something else
  fscache: Remove the object list procfile
  fscache, cachefiles: Remove the histogram stuff
  fscache: Procfile to display cookies
  fscache: Add a cookie debug ID and use that in traces
  cachefiles: Use file_inode() rather than accessing ->f_inode
  netfs: Move cookie debug ID to struct netfs_cache_resources
  fscache: Select netfs stats if fscache stats are enabled
2021-09-02 09:06:28 -07:00
Kari Argillander
2e3a51b59e fs/ntfs3: Change how module init/info messages are displayed
Usually in file system init() messages are only displayed in info level.
Change level from notice to info, but keep CONFIG_NTFS3_64BIT_CLUSTER in
notice level. Also this need even more attention so let's put big
warning here so that nobody will not try accidentally use it.

There is also no good reason to display internal stuff like binary tree
search. This is always on option which can only disabled for debugging
purposes by developer. Also this message does not even check if
developer has disabled it or not so it is useless info.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2021-09-02 18:51:27 +03:00
Kari Argillander
989e795bfe fs/ntfs3: Remove GPL boilerplates from decompress lib files
Files already have SDPX identifier so no reason to keep boilerplates in
these files anymore.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2021-09-02 18:51:26 +03:00
Kari Argillander
dd854e4b5b fs/ntfs3: Remove unnecessary condition checking from ntfs_file_read_iter
This check will be also performed in generic_file_read_iter() so we do
not want to check this two times in a row.

This was founded with Smatch
	fs/ntfs3/file.c:803 ntfs_file_read_iter()
	warn: unused return: count = iov_iter_count()

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2021-09-02 18:51:26 +03:00
Kari Argillander
d4e8e135a9 fs/ntfs3: Fix integer overflow in ni_fiemap with fiemap_prep()
Use fiemap_prep() to check valid flags. It also shrink request scope
(@len) to what the fs can actually handle.

This address following Smatch static checker warning:
	fs/ntfs3/frecord.c:1894 ni_fiemap()
	warn: potential integer overflow from user 'vbo + len'

Because fiemap_prep() shrinks @len this cannot happened anymore.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: lore.kernel.org/ntfs3/20210825080440.GA17407@kili/

Fixes: 4342306f0f ("fs/ntfs3: Add file operations and implementation")
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2021-09-02 18:51:00 +03:00
Theodore Ts'o
1fd95c05d8 ext4: add error checking to ext4_ext_replay_set_iblocks()
If the call to ext4_map_blocks() fails due to an corrupted file
system, ext4_ext_replay_set_iblocks() can get stuck in an infinite
loop.  This could be reproduced by running generic/526 with a file
system that has inline_data and fast_commit enabled.  The system will
repeatedly log to the console:

EXT4-fs warning (device dm-3): ext4_block_to_path:105: block 1074800922 > max in inode 131076

and the stack that it gets stuck in is:

   ext4_block_to_path+0xe3/0x130
   ext4_ind_map_blocks+0x93/0x690
   ext4_map_blocks+0x100/0x660
   skip_hole+0x47/0x70
   ext4_ext_replay_set_iblocks+0x223/0x440
   ext4_fc_replay_inode+0x29e/0x3b0
   ext4_fc_replay+0x278/0x550
   do_one_pass+0x646/0xc10
   jbd2_journal_recover+0x14a/0x270
   jbd2_journal_load+0xc4/0x150
   ext4_load_journal+0x1f3/0x490
   ext4_fill_super+0x22d4/0x2c00

With this patch, generic/526 still fails, but system is no longer
locking up in a tight loop.  It's likely the root casue is that
fast_commit replay is corrupting file systems with inline_data, and we
probably need to add better error handling in the fast commit replay
code path beyond what is done here, which essentially just breaks the
infinite loop without reporting the to the higher levels of the code.

Fixes: 8016E29F4362 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Cc: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-09-02 11:36:01 -04:00
Linus Torvalds
90c90cda05 Merge tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
 "There's a lot in this cycle.

  Starting with bug fixes: To avoid livelocks between the logging code
  and the quota code, we've disabled the ability of quotaoff to turn off
  quota accounting. (Admins can still disable quota enforcement, but
  truly turning off accounting requires a remount.) We've tried to do
  this in a careful enough way that there shouldn't be any user visible
  effects aside from quotaoff no longer randomly hanging the system.

  We've also fixed some bugs in runtime log behavior that could trip up
  log recovery if (otherwise unrelated) transactions manage to start and
  commit concurrently; some bugs in the GETFSMAP ioctl where we would
  incorrectly restrict the range of records output if the two xfs
  devices are of different sizes; a bug that resulted in fallocate
  funshare failing unnecessarily; and broken behavior in the xfs inode
  cache when DONTCACHE is in play.

  As for new features: we now batch inode inactivations in percpu
  background threads, which sharply decreases frontend thread wait time
  when performing file deletions and should improve overall directory
  tree deletion times. This eliminates both the problem where closing an
  unlinked file (especially on a frozen fs) can stall for a long time,
  and should also ease complaints about direct reclaim bogging down on
  unlinked file cleanup.

  Starting with this release, we've enabled pipelining of the XFS log.
  On workloads with high rates of metadata updates to different shards
  of the filesystem, multiple threads can be used to format committed
  log updates into log checkpoints.

  Lastly, with this release, two new features have graduated to
  supported status: inode btree counters (for faster mounts), and
  support for dates beyond Y2038. Expect these to be enabled by default
  in a future release of xfsprogs.

  Summary:

   - Fix a potential log livelock on busy filesystems when there's so
     much work going on that we can't finish a quotaoff before filling
     up the log by removing the ability to disable quota accounting.

   - Introduce the ability to use per-CPU data structures in XFS so that
     we can do a better job of maintaining CPU locality for certain
     operations.

   - Defer inode inactivation work to per-CPU lists, which will help us
     batch that processing. Deletions of large sparse files will
     *appear* to run faster, but all that means is that we've moved the
     work to the backend.

   - Drop the EXPERIMENTAL warnings from the y2038+ support and the
     inode btree counters, since it's been nearly a year and no
     complaints have come in.

   - Remove more of our bespoke kmem* variants in favor of using the
     standard Linux calls.

   - Prepare for the addition of log incompat features in upcoming
     cycles by actually adding code to support this.

   - Small cleanups of the xattr code in preparation for landing support
     for full logging of extended attribute updates in a future cycle.

   - Replace the various log shutdown state and flag code all over xfs
     with a single atomic bit flag.

   - Fix a serious log recovery bug where log item replay can be skipped
     based on the start lsn of a transaction even though the transaction
     commit lsn is the key data point for that by enforcing start lsns
     to appear in the log in the same order as commit lsns.

   - Enable pipelining in the code that pushes log items to disk.

   - Drop ->writepage.

   - Fix some bugs in GETFSMAP where the last fsmap record reported for
     a device could extend beyond the end of the device, and a separate
     bug where query keys for one device could be applied to another.

   - Don't let GETFSMAP query functions edit their input parameters.

   - Small cleanups to the scrub code's handling of perag structures.

   - Small cleanups to the incore inode tree walk code.

   - Constify btree function parameters that aren't changed, so that
     there will never again be confusion about range query functions
     changing their input parameters.

   - Standardize the format and names of tracepoint data attributes.

   - Clean up all the mount state and feature flags to use wrapped
     bitset functions instead of inconsistently open-coded flag checks.

   - Fix some confusion between xfs_buf hash table key variable vs.
     block number.

   - Fix a mis-interaction with iomap where we reported shared delalloc
     cow fork extents to iomap, which would cause the iomap unshare
     operation to return IO errors unnecessarily.

   - Fix DONTCACHE behavior"

* tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (103 commits)
  xfs: fix I_DONTCACHE
  xfs: only set IOMAP_F_SHARED when providing a srcmap to a write
  xfs: fix perag structure refcounting error when scrub fails
  xfs: rename buffer cache index variable b_bn
  xfs: convert bp->b_bn references to xfs_buf_daddr()
  xfs: introduce xfs_buf_daddr()
  xfs: kill xfs_sb_version_has_v3inode()
  xfs: introduce xfs_sb_is_v5 helper
  xfs: remove unused xfs_sb_version_has wrappers
  xfs: convert xfs_sb_version_has checks to use mount features
  xfs: convert scrub to use mount-based feature checks
  xfs: open code sb verifier feature checks
  xfs: convert xfs_fs_geometry to use mount feature checks
  xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
  xfs: convert remaining mount flags to state flags
  xfs: convert mount flags to features
  xfs: consolidate mount option features in m_features
  xfs: replace xfs_sb_version checks with feature flag checks
  xfs: reflect sb features in xfs_mount
  xfs: rework attr2 feature and mount options
  ...
2021-09-02 08:26:03 -07:00
Linus Torvalds
bcfeebbff3 Merge branch 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exit cleanups from Eric Biederman:
 "In preparation of doing something about PTRACE_EVENT_EXIT I have
  started cleaning up various pieces of code related to do_exit. Most of
  that code I did not manage to get tested and reviewed before the merge
  window opened but a handful of very useful cleanups are ready to be
  merged.

  The first change is simply the removal of the bdflush system call. The
  code has now been disabled long enough that even the oldest userspace
  working userspace setups anyone can find to test are fine with the
  bdflush system call being removed.

  Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of
  calling do_exit directly is interesting only in that it is nearly the
  most difficult of the incorrect uses of do_exit to remove.

  The change to the seccomp code to simply send a signal instead of
  calling do_coredump directly is a very nice little cleanup made
  possible by realizing the existing signal sending helpers were missing
  a little bit of functionality that is easy to provide"

* 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal/seccomp: Dump core when there is only one live thread
  signal/seccomp: Refactor seccomp signal and coredump generation
  signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
  exit/bdflush: Remove the deprecated bdflush system call
2021-09-01 14:52:05 -07:00
Linus Torvalds
48983701a1 Merge branch 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo si_trapno updates from Eric Biederman:
 "The full set of si_trapno changes was not appropriate as a fix for the
  newly added SIGTRAP TRAP_PERF, and so I postponed the rest of the
  related cleanups.

  This is the rest of the cleanups for si_trapno that reduces it from
  being a really weird arch special case that is expect to be always
  present (but isn't) on the architectures that support it to being yet
  another field in the _sigfault union of struct siginfo.

  The changes have been reviewed and marinated in linux-next. With the
  removal of this awkward special case new code (like SIGTRAP TRAP_PERF)
  that works across architectures should be easier to write and
  maintain"

* 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency
  signal: Verify the alignment and size of siginfo_t
  signal: Remove the generic __ARCH_SI_TRAPNO support
  signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNK
  signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRP
  arm64: Add compile-time asserts for siginfo_t offsets
  arm: Add compile-time asserts for siginfo_t offsets
  sparc64: Add compile-time asserts for siginfo_t offsets
2021-09-01 14:42:36 -07:00
Jens Axboe
15e20db2e0 io-wq: only exit on fatal signals
If the application uses io_uring and also relies heavily on signals
for communication, that can cause io-wq workers to spuriously exit
just because the parent has a signal pending. Just ignore signals
unless they are fatal.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-01 12:35:32 -06:00
Jens Axboe
f95dc207b9 io-wq: split bounded and unbounded work into separate lists
We've got a few issues that all boil down to the fact that we have one
list of pending work items, yet two different types of workers to
serve them. This causes some oddities around workers switching type and
even hashed work vs regular work on the same bounded list.

Just separate them out cleanly, similarly to how we already do
accounting of what is running. That provides a clean separation and
removes some corner cases that can cause stalls when handling IO
that is punted to io-wq.

Fixes: ecc53c48c1 ("io-wq: check max_worker limits if a worker transitions bound state")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-01 12:35:30 -06:00
Alexander Aring
ecd9567314 fs: dlm: avoid comms shutdown delay in release_lockspace
When dlm_release_lockspace does active shutdown on connections to
other nodes, the active shutdown will wait for any exisitng passive
shutdowns to be resolved.  But, the sequence of operations during
dlm_release_lockspace can prevent the normal resolution of passive
shutdowns (processed normally by way of lockspace recovery.)
This disruption of passive shutdown handling can cause the active
shutdown to wait for a full timeout period, delaying the completion
of dlm_release_lockspace.

To fix this, make dlm_release_lockspace resolve existing passive
shutdowns (by calling dlm_clear_members earlier), before it does
active shutdowns.  The active shutdowns will not find any passive
shutdowns to wait for, and will not be delayed.

Reported-by: Chris Mackowski <cmackows@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-09-01 11:29:14 -05:00
Linus Torvalds
c6c3c5704b Merge tag 'driver-core-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
 "Here is the big set of driver core patches for 5.15-rc1.

  These do change a number of different things across different
  subsystems, and because of that, there were 2 stable tags created that
  might have already come into your tree from different pulls that did
  the following

   - changed the bus remove callback to return void

   - sysfs iomem_get_mapping rework

  Other than those two things, there's only a few small things in here:

   - kernfs performance improvements for huge numbers of sysfs users at
     once

   - tiny api cleanups

   - other minor changes

  All of these have been in linux-next for a while with no reported
  problems, other than the before-mentioned merge issue"

* tag 'driver-core-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (33 commits)
  MAINTAINERS: Add dri-devel for component.[hc]
  driver core: platform: Remove platform_device_add_properties()
  ARM: tegra: paz00: Handle device properties with software node API
  bitmap: extend comment to bitmap_print_bitmask/list_to_buf
  drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI
  topology: use bin_attribute to break the size limitation of cpumap ABI
  lib: test_bitmap: add bitmap_print_bitmask/list_to_buf test cases
  cpumask: introduce cpumap_print_list/bitmask_to_buf to support large bitmask and list
  sysfs: Rename struct bin_attribute member to f_mapping
  sysfs: Invoke iomem_get_mapping() from the sysfs open callback
  debugfs: Return error during {full/open}_proxy_open() on rmmod
  zorro: Drop useless (and hardly used) .driver member in struct zorro_dev
  zorro: Simplify remove callback
  sh: superhyway: Simplify check in remove callback
  nubus: Simplify check in remove callback
  nubus: Make struct nubus_driver::remove return void
  kernfs: dont call d_splice_alias() under kernfs node lock
  kernfs: use i_lock to protect concurrent inode updates
  kernfs: switch kernfs to use an rwsem
  kernfs: use VFS negative dentry caching
  ...
2021-09-01 08:44:42 -07:00
Jaegeuk Kim
9605f75cf3 f2fs: should put a page beyond EOF when preparing a write
The prepare_compress_overwrite() gets/locks a page to prepare a read, and calls
f2fs_read_multi_pages() which checks EOF first. If there's any page beyond EOF,
we unlock the page and set cc->rpages[i] = NULL, which we can't put the page
anymore. This makes page leak, so let's fix by putting that page.

Fixes: a949dc5f2c ("f2fs: compress: fix race condition of overwrite vs truncate")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-31 14:39:39 -07:00
Jaegeuk Kim
827f02842e f2fs: deallocate compressed pages when error happens
In f2fs_write_multi_pages(), f2fs_compress_pages() allocates pages for
compression work in cc->cpages[]. Then, f2fs_write_compressed_pages() initiates
bio submission. But, if there's any error before submitting the IOs like early
f2fs_cp_error(), previously it didn't free cpages by f2fs_compress_free_page().
Let's fix memory leak by putting that just before deallocating cc->cpages.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-31 14:39:24 -07:00
Jens Axboe
0242f6426e io-wq: fix queue stalling race
We need to set the stalled bit early, before we drop the lock for adding
us to the stall hash queue. If not, then we can race with new work being
queued between adding us to the stall hash and io_worker_handle_work()
marking us stalled.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31 13:53:00 -06:00
Linus Torvalds
927bc120a2 Merge tag 'fs.close_range.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull close_range() cleanup from Christian Brauner:
 "This is a cleanup for close_range() which was sent as part of a bugfix
  we did some time ago in commit 9b5b872215 ("file: fix close_range()
  for unshare+cloexec").

  We used to share more code between some helpers for close_range()
  which made retrieving the maximum number of open fds before calling
  into the helpers sensible. But with the introduction of
  CLOSE_RANGE_CLOEXEC and the need to retrieve the number of maximum fds
  once more for CLOSE_RANGE_CLOEXEC that stopped making sense. So the
  code was in a dumb in-limbo state.

  Fix this by simplifying the code a bit.

  The original idea was to only fix the bug itself and make backporting
  easy. And since the cleanup wasn't very pressing I left it in
  linux-next for a very long time. I didn't pull the patches from the
  list again back then which is why they don't have lore-links. So I'm
  listing them below explicitly"

Commit 03ba0fe4d0 ("file: simplify logic in __close_range()")
Link: https://lore.kernel.org/linux-fsdevel/20210402123548.108372-3-brauner@kernel.org

Commit f49fd6d3c0 ("file: let pick_file() tell caller it's done")
Link: https://lore.kernel.org/linux-fsdevel/20210402123548.108372-4-brauner@kernel.org

* tag 'fs.close_range.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  file: simplify logic in __close_range()
  file: let pick_file() tell caller it's done
2021-08-31 12:00:07 -07:00