Commit Graph

91022 Commits

Author SHA1 Message Date
Christoph Hellwig
f50805713a
xfs: drop fop_flags for directories
Directories have non of the capabilities, so drop the flags.  Note that
the current state is harmless as no one actually checks for the flags
either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240423124608.537794-3-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24 14:46:51 +02:00
Christoph Hellwig
19e048641b
xfs: fix overly long line in the file_operations
Re-wrap the newly added fop_flags fields to not go over 80 characters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240423124608.537794-2-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24 14:46:39 +02:00
Thomas Weißschuh
a35dd3a786 sysctl: drop now unnecessary out-of-bounds check
Remove the now unneeded check for ctl_table_size; it is safe
to do so as sysctl_set_perm_empty_ctl_header() does not access the
ctl_table member anymore.

This also makes the element of sysctl_mount_point unnecessary, so drop
it at the same time.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24 09:43:54 +02:00
Thomas Weißschuh
4a7b29f650 sysctl: move sysctl type to ctl_table_header
Move the SYSCTL_TABLE_TYPE_{DEFAULT,PERMANENTLY_EMPTY} enums from
ctl_table to ctl_table_header.
Removing the mutable member is necessary to constify static instances
of struct ctl_table.

Move the initialization of the sysctl_mount_point type into
init_header() where all the other header fields are also initialized.

As a side-effect the memory usage of the sysctl core is reduced.
Each ctl_table_header instance can manage multiple ctl_table instances
and is only allocated when the table is actually registered.
This saves 8 bytes of memory per ctl_table on 64bit, 4 due to the enum
field itself and 4 due to padding.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24 09:43:54 +02:00
Thomas Weißschuh
eb32d3adef sysctl: drop sysctl_is_perm_empty_ctl_table
It is used only twice and those callers are simpler with
sysctl_is_perm_empty_ctl_header().
So use this sibling function.

This is part of an effort to constify definition of struct ctl_table.
For this effort the mutable member 'type' is moved from
struct ctl_table to struct ctl_table_header.
Unifying the macros sysctl_is_perm_empty_ctl_* makes this easier.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24 09:43:54 +02:00
Thomas Weißschuh
520713a93d sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
Remove the 'table' argument from set_ownership as it is never used. This
change is a step towards putting "struct ctl_table" into .rodata and
eventually having sysctl core only use "const struct ctl_table".

The patch was created with the following coccinelle script:

  @@
  identifier func, head, table, uid, gid;
  @@

  void func(
    struct ctl_table_header *head,
  - struct ctl_table *table,
    kuid_t *uid, kgid_t *gid)
  { ... }

No additional occurrences of 'set_ownership' were found after doing a
tree-wide search.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24 09:43:54 +02:00
Gustavo A. R. Silva
9a1f1d04f6 smb: client: Fix struct_group() usage in __packed structs
Use struct_group_attr() in __packed structs, instead of struct_group().

Below you can see the pahole output before/after changes:

pahole -C smb2_file_network_open_info fs/smb/client/smb2ops.o
struct smb2_file_network_open_info {
	union {
		struct {
			__le64     CreationTime;         /*     0     8 */
			__le64     LastAccessTime;       /*     8     8 */
			__le64     LastWriteTime;        /*    16     8 */
			__le64     ChangeTime;           /*    24     8 */
			__le64     AllocationSize;       /*    32     8 */
			__le64     EndOfFile;            /*    40     8 */
			__le32     Attributes;           /*    48     4 */
		};                                       /*     0    56 */
		struct {
			__le64     CreationTime;         /*     0     8 */
			__le64     LastAccessTime;       /*     8     8 */
			__le64     LastWriteTime;        /*    16     8 */
			__le64     ChangeTime;           /*    24     8 */
			__le64     AllocationSize;       /*    32     8 */
			__le64     EndOfFile;            /*    40     8 */
			__le32     Attributes;           /*    48     4 */
		} network_open_info;                     /*     0    56 */
	};                                               /*     0    56 */
	__le32                     Reserved;             /*    56     4 */

	/* size: 60, cachelines: 1, members: 2 */
	/* last cacheline: 60 bytes */
} __attribute__((__packed__));

pahole -C smb2_file_network_open_info fs/smb/client/smb2ops.o
struct smb2_file_network_open_info {
	union {
		struct {
			__le64     CreationTime;         /*     0     8 */
			__le64     LastAccessTime;       /*     8     8 */
			__le64     LastWriteTime;        /*    16     8 */
			__le64     ChangeTime;           /*    24     8 */
			__le64     AllocationSize;       /*    32     8 */
			__le64     EndOfFile;            /*    40     8 */
			__le32     Attributes;           /*    48     4 */
		} __attribute__((__packed__));           /*     0    52 */
		struct {
			__le64     CreationTime;         /*     0     8 */
			__le64     LastAccessTime;       /*     8     8 */
			__le64     LastWriteTime;        /*    16     8 */
			__le64     ChangeTime;           /*    24     8 */
			__le64     AllocationSize;       /*    32     8 */
			__le64     EndOfFile;            /*    40     8 */
			__le32     Attributes;           /*    48     4 */
		} __attribute__((__packed__)) network_open_info;       /*     0    52 */
	};                                               /*     0    52 */
	__le32                     Reserved;             /*    52     4 */

	/* size: 56, cachelines: 1, members: 2 */
	/* last cacheline: 56 bytes */
};

pahole -C smb_com_open_rsp fs/smb/client/cifssmb.o
struct smb_com_open_rsp {
	...

	union {
		struct {
			__le64     CreationTime;         /*    48     8 */
			__le64     LastAccessTime;       /*    56     8 */
			/* --- cacheline 1 boundary (64 bytes) --- */
			__le64     LastWriteTime;        /*    64     8 */
			__le64     ChangeTime;           /*    72     8 */
			__le32     FileAttributes;       /*    80     4 */
		};                                       /*    48    40 */
		struct {
			__le64     CreationTime;         /*    48     8 */
			__le64     LastAccessTime;       /*    56     8 */
			/* --- cacheline 1 boundary (64 bytes) --- */
			__le64     LastWriteTime;        /*    64     8 */
			__le64     ChangeTime;           /*    72     8 */
			__le32     FileAttributes;       /*    80     4 */
		} common_attributes;                     /*    48    40 */
	};                                               /*    48    40 */

	...

	/* size: 111, cachelines: 2, members: 14 */
	/* last cacheline: 47 bytes */
} __attribute__((__packed__));

pahole -C smb_com_open_rsp fs/smb/client/cifssmb.o
struct smb_com_open_rsp {
	...

	union {
		struct {
			__le64     CreationTime;         /*    48     8 */
			__le64     LastAccessTime;       /*    56     8 */
			/* --- cacheline 1 boundary (64 bytes) --- */
			__le64     LastWriteTime;        /*    64     8 */
			__le64     ChangeTime;           /*    72     8 */
			__le32     FileAttributes;       /*    80     4 */
		} __attribute__((__packed__));           /*    48    36 */
		struct {
			__le64     CreationTime;         /*    48     8 */
			__le64     LastAccessTime;       /*    56     8 */
			/* --- cacheline 1 boundary (64 bytes) --- */
			__le64     LastWriteTime;        /*    64     8 */
			__le64     ChangeTime;           /*    72     8 */
			__le32     FileAttributes;       /*    80     4 */
		} __attribute__((__packed__)) common_attributes;       /*    48    36 */
	};                                               /*    48    36 */

	...

	/* size: 107, cachelines: 2, members: 14 */
	/* last cacheline: 43 bytes */
} __attribute__((__packed__));

pahole -C FILE_ALL_INFO fs/smb/client/cifssmb.o
typedef struct {
	union {
		struct {
			__le64     CreationTime;         /*     0     8 */
			__le64     LastAccessTime;       /*     8     8 */
			__le64     LastWriteTime;        /*    16     8 */
			__le64     ChangeTime;           /*    24     8 */
			__le32     Attributes;           /*    32     4 */
		};                                       /*     0    40 */
		struct {
			__le64     CreationTime;         /*     0     8 */
			__le64     LastAccessTime;       /*     8     8 */
			__le64     LastWriteTime;        /*    16     8 */
			__le64     ChangeTime;           /*    24     8 */
			__le32     Attributes;           /*    32     4 */
		} common_attributes;                     /*     0    40 */
	};                                               /*     0    40 */

	...

	/* size: 113, cachelines: 2, members: 17 */
	/* last cacheline: 49 bytes */
} __attribute__((__packed__)) FILE_ALL_INFO;

pahole -C FILE_ALL_INFO fs/smb/client/cifssmb.o
typedef struct {
	union {
		struct {
			__le64     CreationTime;         /*     0     8 */
			__le64     LastAccessTime;       /*     8     8 */
			__le64     LastWriteTime;        /*    16     8 */
			__le64     ChangeTime;           /*    24     8 */
			__le32     Attributes;           /*    32     4 */
		} __attribute__((__packed__));           /*     0    36 */
		struct {
			__le64     CreationTime;         /*     0     8 */
			__le64     LastAccessTime;       /*     8     8 */
			__le64     LastWriteTime;        /*    16     8 */
			__le64     ChangeTime;           /*    24     8 */
			__le32     Attributes;           /*    32     4 */
		} __attribute__((__packed__)) common_attributes;       /*     0    36 */
	};                                               /*     0    36 */

	...

	/* size: 109, cachelines: 2, members: 17 */
	/* last cacheline: 45 bytes */
} __attribute__((__packed__)) FILE_ALL_INFO;

Fixes: 0015eb6e12 ("smb: client, common: fix fortify warnings")
Cc: stable@vger.kernel.org
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-24 01:06:13 -05:00
Chuck Lever
8ddb7142c8 Revert "NFSD: Convert the callback workqueue to use delayed_work"
This commit was a pre-requisite for commit c1ccfcf1a9 ("NFSD:
Reschedule CB operations when backchannel rpc_clnt is shut down"),
which has already been reverted.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-23 20:12:41 -04:00
Chuck Lever
9c8ecb9308 Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"
The reverted commit attempted to enable NFSD to retransmit pending
callback operations if an NFS client disconnects, but
unintentionally introduces a hazardous behavior regression if the
client becomes permanently unreachable while callback operations are
still pending.

A disconnect can occur due to network partition or if the NFS server
needs to force the NFS client to retransmit (for example, if a GSS
window under-run occurs).

Reverting the commit will make NFSD behave the same as it did in
v6.8 and before. Pending callback operations are permanently lost if
the client connection is terminated before the client receives them.

For some callback operations, this loss is not harmful.

However, for CB_RECALL, the loss means a delegation might be revoked
unnecessarily. For CB_OFFLOAD, pending COPY operations will never
complete unless the NFS client subsequently sends an OFFLOAD_STATUS
operation, which the Linux NFS client does not currently implement.

These issues still need to be addressed somehow.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-23 20:12:34 -04:00
Alexander Aring
7b72ab2c6a dlm: return -ENOMEM if ls_recover_buf fails
This patch fixes to return -ENOMEM in case of an allocation failure that
was forgotten to change in commit 6c648035cb ("dlm: switch to use
rhashtable for rsbs").

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202404200536.jGi6052v-lkp@intel.com/
Fixes: 6c648035cb ("dlm: switch to use rhashtable for rsbs")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-23 16:08:55 -05:00
Linus Torvalds
9d1ddab261 smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYm8oEACgkQiiy9cAdy
 T1HAVAv/T9NXYhTYQK8EK7DvzcGDV4dOcFI6GVsrinp1KHGyuxoPABKGctQXfou0
 4DJ9ik0CYFWmVuz8CODxmHkJq8fclnkJtVPH4DjcOm8CqhcaMJNVgEWP6eMGevGo
 sQPEtFAYRPQrIm5X8u2uvARE490YiUD85Se+6LrYZdzt/BOUQQHtKodSTre1ZCAV
 F6GLEVVuncP9iqno9lNu1EI8ltcOW6i1i15s4HOmULNwtUKdwsYTWVqW6JDOy5gM
 9YqXJhPobVcZnY/m8QVWfE/lEPOwJO5lbRF4Ktykz4PQkxZqD6t+Noesj73GKEgC
 7jnt3L79s1zA51gHdn96Z1qWlaIruX4ugYhfzQAW7PfYmSUr3I8G09ofzwFc77aH
 osYoU6mZWWL+4/RIKK3DGYRe2ET68KmlNheG2OTOeCNfwjSoI5rWHSLMGqlegTc0
 40gz62OHuwncMyKFLwGaf6ztzlOLHIdVU4uhRv1/9ptZuMl/2LoiN7G8ZddbtYIS
 JxL/kIEE
 =PaAG
 -----END PGP SIGNATURE-----

Merge tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - fscache fix

 - fix for case where we could use uninitialized lease

 - add tracepoint for debugging refcounting of tcon

 - fix mount option regression (e.g. forceuid vs. noforceuid when uid=
   specified) caused by conversion to the new mount API

* tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: reinstate original behavior again for forceuid/forcegid
  smb: client: fix rename(2) regression against samba
  cifs: Add tracing for the cifs_tcon struct refcounting
  cifs: Fix reacquisition of volume cookie on still-live connection
2024-04-23 09:37:32 -07:00
David Howells
619606a7b8
netfs: Fix writethrough-mode error handling
Fix the error return in netfs_perform_write() acting in writethrough-mode
to return any cached error in the case that netfs_end_writethrough()
returns 0.

This can affect the use of O_SYNC/O_DSYNC/RWF_SYNC/RWF_DSYNC in 9p and afs.

Fixes: 41d8e7673a ("netfs: Implement a write-through caching option")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/6736.1713343639@warthog.procyon.org.uk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-23 09:39:07 +02:00
Christian Brauner
9b872cc50d
ntfs3: add legacy ntfs file operations
To ensure that ioctl()s can't be used to circumvent write restrictions.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-23 09:39:07 +02:00
Christian Brauner
d55f90e9b2
ntfs3: enforce read-only when used as legacy ntfs driver
Ensure that ntfs3 is mounted read-only when it is used to provide the
legacy ntfs driver.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-23 09:39:07 +02:00
Linus Torvalds
71b1543c83 five ksmbd server fixes, most also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYmfyAACgkQiiy9cAdy
 T1H8CQv/RlvQNbfSc9V5/vUPqfEWhx+5EyFoMFPZmo2HmtxT4BI8sEICUQcC6h+Y
 L2li6ihmsF1DnxDxsFwKAhQ9nTpYG4LoBbljk/j/N+sKhUFE3ZLWxZyoEAbWWfeh
 UJt8ZuGTgbc8nyPZ1s2d5oZy37PNlGd7CilLtP5HQyEW3l8EHg0qQOp98cibh0mm
 +kS2vilb4nIlapblTq2uFDcBTolXRSjTlY8eWC9DNvWdjTJLdTEjAurqeuN71FA0
 +XXnA9EYE9MDEkCqnQdZEwGvR8ZKy9sR4cqCUGZO/tNz/KfOGfjJkrEf+OEj37Af
 I5ahxUKrYwBEC/6gaFx8i1ZiivdDOyIaM7b+rpazpiV1sCCbxlM18TlDSyODFnUC
 nqEABR3slx/oZ4xYcFuxjzN0L9ZS7+0NQ1VQovFLlNqW7f9qFdUq0JKqfNDvFOm4
 tgi5/lCnlMco0ts6h0HIw5LrwDi7HJWbYf1Fmv/1jC2Ed5Ut0GCMNpSC9WQxDo+t
 G4m8ul1S
 =ZizJ
 -----END PGP SIGNATURE-----

Merge tag '6.9-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Five ksmbd server fixes, most also for stable:

   - rename fix

   - two fixes for potential out of bounds

   - fix for connections from MacOS (padding in close response)

   - fix for when to enable persistent handles"

* tag '6.9-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: add continuous availability share parameter
  ksmbd: common: use struct_group_attr instead of struct_group for network_open_info
  ksmbd: clear RENAME_NOREPLACE before calling vfs_rename
  ksmbd: validate request buffer size in smb2_allocate_rsp_buf()
  ksmbd: fix slab-out-of-bounds in smb2_allocate_rsp_buf
2024-04-22 16:28:31 -07:00
Linus Torvalds
a2c63a3f3d bcachefs fixes for 6.9-rc6
- fix a few more deadlocks in recovery
 - fix u32/u64 issues in mi_btree_bitmap
 - btree key cache shrinker now actually frees, with more instrumentation
   coming so we can verify that it's working correctly more easily in the
   future
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYmveYACgkQE6szbY3K
 bnY+kA//dtdKfPliuQlfYhUvIZSvgEy+KxjDaDmeJFVMKFYHiip/JJt7V/YU7jWW
 DmWGtPqo1hoGZnic7h9fstbBCRgUdIYGBInqAWPzL3wmFYe5FPE02KN5fuZ+A+Mp
 sn0QFML4oA0uxD7TRXfCeNx3NRwXonztletVskXCuLa0T8iTOTdOuAH0MCGow3OC
 mlfRZyK05f6Q0UOIzvntfBl8Tkr+yk5g0GFz54U7s+qs/zJsYiYfruXuq6AxLTy2
 xVMDj3Hc6W0vggVpv68HInluubl/b7rVSy+w59GG0D3iQ9/fBdqitFLdw49ReXzi
 J//ctZLb3n+IM4lA7t5ev0lY7bvI2FwFNkrL4qW41E4un5eQ3ghbyOmMoz87svyg
 4JW/CPGP7uKNVmfRuHn1qhgJ9/vIXkObJVl9GKZF2BylaZwjMM5YrL5MZwrKFQYy
 9BMgemgvHFK+wRi74q/OUu3PyH045AoJdIKI66ypFexhGi5YeNqtRHLUKdfSrJR+
 eEkAUaHcgLLfxyk+fIRvcSK+Q9j3BibvsSkU3vLSnl2B+xdvfJqb+I/yvBt/SZQW
 P09JceDQABRBMu9beVbVqMED+PniSZVfG2eU2jBcZ+jhbGQgmiWfMNJVKS4/0uwz
 PRS33P2mViVZ6PJokWyecbgGtVxKrK2ruPdcu6/W05Qi0Vv58+k=
 =lCNd
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-04-22' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Nothing too crazy in this one, and it looks like (fingers crossed) the
  recovery and repair issues are settling down - although there's going
  to be a long tail there, as we've still yet to really ramp up on error
  injection or syzbot.

   - fix a few more deadlocks in recovery

   - fix u32/u64 issues in mi_btree_bitmap

   - btree key cache shrinker now actually frees, with more
     instrumentation coming so we can verify that it's working
     correctly more easily in the future"

* tag 'bcachefs-2024-04-22' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: If we run merges at a lower watermark, they must be nonblocking
  bcachefs: Fix inode early destruction path
  bcachefs: Fix deadlock in journal write path
  bcachefs: Tweak btree key cache shrinker so it actually frees
  bcachefs: bkey_cached.btree_trans_barrier_seq needs to be a ulong
  bcachefs: Fix missing call to bch2_fs_allocator_background_exit()
  bcachefs: Check for journal entries overruning end of sb clean section
  bcachefs: Fix bio alloc in check_extent_checksum()
  bcachefs: fix leak in bch2_gc_write_reflink_key
  bcachefs: KEY_TYPE_error is allowed for reflink
  bcachefs: Fix bch2_dev_btree_bitmap_marked_sectors() shift
  bcachefs: make sure to release last journal pin in replay
  bcachefs: node scan: ignore multiple nodes with same seq if interior
  bcachefs: Fix format specifier in validate_bset_keys()
  bcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINE
2024-04-22 13:53:50 -07:00
Takayuki Nagata
77d8aa79ec cifs: reinstate original behavior again for forceuid/forcegid
forceuid/forcegid should be enabled by default when uid=/gid= options are
specified, but commit 24e0a1eff9 ("cifs: switch to new mount api")
changed the behavior. Due to the change, a mounted share does not show
intentional uid/gid for files and directories even though uid=/gid=
options are specified since forceuid/forcegid are not enabled.

This patch reinstates original behavior that overrides uid/gid with
specified uid/gid by the options.

Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Signed-off-by: Takayuki Nagata <tnagata@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-22 10:50:31 -05:00
Eric Van Hensbergen
d05dcfdf5e
fs/9p: mitigate inode collisions
Detect and mitigate inode collsions that now occur since we
fixed 9p generating duplicate inode structures.  Underlying
cause of these appears to be a race condition between reuse
of inode numbers in underlying file system and cleanup of
inode numbers in the client.  Enabling caching
makes this much more likely to happen as it increases cleanup
latency due to writebacks.

Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-04-22 15:34:27 +00:00
Amir Goldstein
aef8acd79f fuse: verify zero padding in fuse_backing_map
To allow us extending the interface in the future.

Fixes: 44350256ab ("fuse: implement ioctls to manage backing files")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-22 17:13:43 +02:00
Kent Overstreet
e858beeddf bcachefs: If we run merges at a lower watermark, they must be nonblocking
Fix another deadlock related to the merge path; previously, we switched
to always running merges at a lower watermark (because they are
noncritical); but when we run at a lower watermark we also need to run
nonblocking or we've introduced a new deadlock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-and-tested-by: s@m-h.ug
2024-04-22 01:26:51 -04:00
Linus Torvalds
4e90ba757b Kernfs bugfix and documentation update for 6.9-rc5
Here are 2 changes for 6.9-rc5 that deal with "driver core" stuff, that
 do the following:
   - sysfs reference leak fix
   - embargoed-hardware-issues.rst update for Power
 
 Both of these have been in linux-next for over a week with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZiT4Dw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymFwQCfR1OysT/aO16NUYBSaGd7Tx4/3dIAn3YDU7O7
 BvGCYc/Nv7S7WdmA5KKf
 =RuJo
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull kernfs bugfix and documentation update from Greg KH:
 "Here are two changes for 6.9-rc5 that deal with "driver core" stuff,
  that do the following:

   - sysfs reference leak fix

   - embargoed-hardware-issues.rst update for Power

  Both of these have been in linux-next for over a week with no reported
  issues"

* tag 'driver-core-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  Documentation: embargoed-hardware-issues.rst: Add myself for Power
  fs: sysfs: Fix reference leak in sysfs_break_active_protection()
2024-04-21 10:30:21 -07:00
Kent Overstreet
0e42f38119 bcachefs: Fix inode early destruction path
discard_new_inode() is the wrong interface to use when we need to free
an inode that was never inserted into the inode hash table; we can
bypass the whole iput() -> evict() path and replace it with
__destroy_inode(); kmem_cache_free() - this fixes a WARN_ON() about
I_NEW.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20 23:00:59 -04:00
Kent Overstreet
85ab365f7c bcachefs: Fix deadlock in journal write path
bch2_journal_write() was incorrectly waiting on earlier journal writes
synchronously; this usually worked because most of the time we'd be
running in the context of a thread that did a journal_buf_put(), but
sometimes we'd be running out of the same workqueue that completes those
prior journal writes.

Additionally, this makes sure to punt to a workqueue before submitting
preflushes - we really don't want to be calling submit_bio() in the main
transaction commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20 23:00:59 -04:00
Kent Overstreet
adfe9357c3 bcachefs: Tweak btree key cache shrinker so it actually frees
Freeing key cache items is a multi stage process; we need to wait for an
SRCU grace period to elapse, and we handle this ourselves - partially to
avoid callback overhead, but primarily so that when allocating we can
first allocate from the freed items waiting for an SRCU grace period.

Previously, the shrinker was counting the items on the 'waiting for SRCU
grace period' lists as items being scanned, but this meant that too many
items waiting for an SRCU grace period could prevent it from doing any
work at all.

After this, we're seeing that items skipped due to the accessed bit are
the main cause of the shrinker not making any progress, and we actually
want the key cache shrinker to run quite aggressively because reclaimed
items will still generally be found (more compactly) in the btree node
cache - so we also tweak the shrinker to not count those against
nr_to_scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20 17:06:20 -04:00
Kent Overstreet
6e4d9bd110 bcachefs: bkey_cached.btree_trans_barrier_seq needs to be a ulong
this stores the SRCU sequence number, which we use to check if an SRCU
barrier has elapsed; this is a partial fix for the key cache shrinker
not actually freeing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20 15:15:51 -04:00
Kent Overstreet
ec438ac59d bcachefs: Fix missing call to bch2_fs_allocator_background_exit()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20 00:31:59 -04:00
Kent Overstreet
fcdbc1d7a4 bcachefs: Check for journal entries overruning end of sb clean section
Fix a missing bounds check in superblock validation.

Note that we don't yet have repair code for this case - repair code for
individual items is generally low priority, since the whole superblock
is checksummed, validated prior to write, and we have backups.

Reported-by: lei lu <llfamsec@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20 00:16:53 -04:00
Namjae Jeon
e9d8c2f95a ksmbd: add continuous availability share parameter
If capabilities of the share is not SMB2_SHARE_CAP_CONTINUOUS_AVAILABILITY,
ksmbd should not grant a persistent handle to the client.
This patch add continuous availability share parameter to control it.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19 20:48:47 -05:00
Namjae Jeon
0268a7cc7f ksmbd: common: use struct_group_attr instead of struct_group for network_open_info
4byte padding cause the connection issue with the applications of MacOS.
smb2_close response size increases by 4 bytes by padding, And the smb
client of MacOS check it and stop the connection. This patch use
struct_group_attr instead of struct_group for network_open_info to use
 __packed to avoid padding.

Fixes: 0015eb6e12 ("smb: client, common: fix fortify warnings")
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19 20:48:47 -05:00
Marios Makassikis
4973b04d3e ksmbd: clear RENAME_NOREPLACE before calling vfs_rename
File overwrite case is explicitly handled, so it is not necessary to
pass RENAME_NOREPLACE to vfs_rename.

Clearing the flag fixes rename operations when the share is a ntfs-3g
mount. The latter uses an older version of fuse with no support for
flags in the ->rename op.

Cc: stable@vger.kernel.org
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19 20:48:47 -05:00
Namjae Jeon
17cf0c2794 ksmbd: validate request buffer size in smb2_allocate_rsp_buf()
The response buffer should be allocated in smb2_allocate_rsp_buf
before validating request. But the fields in payload as well as smb2 header
is used in smb2_allocate_rsp_buf(). This patch add simple buffer size
validation to avoid potencial out-of-bounds in request buffer.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19 20:48:47 -05:00
Namjae Jeon
c119f4ede3 ksmbd: fix slab-out-of-bounds in smb2_allocate_rsp_buf
If ->ProtocolId is SMB2_TRANSFORM_PROTO_NUM, smb2 request size
validation could be skipped. if request size is smaller than
sizeof(struct smb2_query_info_req), slab-out-of-bounds read can happen in
smb2_allocate_rsp_buf(). This patch allocate response buffer after
decrypting transform request. smb3_decrypt_req() will validate transform
request size and avoid slab-out-of-bound in smb2_allocate_rsp_buf().

Reported-by: Norbert Szetei <norbert@doyensec.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19 20:48:47 -05:00
Paulo Alcantara
18d86965e3 smb: client: fix rename(2) regression against samba
After commit 2c7d399e55 ("smb: client: reuse file lease key in
compound operations") the client started reusing lease keys for
rename, unlink and set path size operations to prevent it from
breaking its own leases and thus causing unnecessary lease breaks to
same connection.

The implementation relies on positive dentries and
cifsInodeInfo::lease_granted to decide whether reusing lease keys for
the compound requests.  cifsInodeInfo::lease_granted was introduced by
commit 0ab95c2510 ("Defer close only when lease is enabled.") to
indicate whether lease caching is granted for a specific file, but
that can only happen until file is open, so
cifsInodeInfo::lease_granted was left uninitialised in ->alloc_inode
and then client started sending random lease keys for files that
hadn't any leases.

This fixes the following test case against samba:

mount.cifs //srv/share /mnt/1 -o ...,nosharesock
mount.cifs //srv/share /mnt/2 -o ...,nosharesock
touch /mnt/1/foo; tail -f /mnt/1/foo & pid=$!
mv /mnt/2/foo /mnt/2/bar # fails with -EIO
kill $pid

Fixes: 0ab95c2510 ("Defer close only when lease is enabled.")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19 16:02:45 -05:00
David Howells
afc23febd5 cifs: Add tracing for the cifs_tcon struct refcounting
Add tracing for the refcounting/lifecycle of the cifs_tcon struct, marking
different events with different labels and giving each tcon its own debug
ID so that the tracelines corresponding to individual tcons can be
distinguished.  This can be enabled with:

	echo 1 >/sys/kernel/debug/tracing/events/cifs/smb3_tcon_ref/enable

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19 16:02:09 -05:00
David Howells
dad80c6bff cifs: Fix reacquisition of volume cookie on still-live connection
During mount, cifs_mount_get_tcon() gets a tcon resource connection record
and then attaches an fscache volume cookie to it.  However, it does this
irrespective of whether or not the tcon returned from cifs_get_tcon() is a
new record or one that's already in use.  This leads to a warning about a
volume cookie collision and a leaked volume cookie because tcon->fscache
gets reset.

Fix this be adding a mutex and a "we've already tried this" flag and only
doing it once for the lifetime of the tcon.

[!] Note: Looking at cifs_mount_get_tcon(), a more general solution may
actually be required.  Reacquiring the volume cookie isn't the only thing
that function does: it also partially reinitialises the tcon record without
any locking - which may cause live filesystem ops already using the tcon
through a previous mount to malfunction.

This can be reproduced simply by something like:

    mount //example.com/test /xfstest.test -o user=shares,pass=xxx,fsc
    mount //example.com/test /mnt -o user=shares,pass=xxx,fsc

Fixes: 70431bfd82 ("cifs: Support fscache indexing rewrite")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19 15:37:47 -05:00
Linus Torvalds
46b28503cd fs/9p: fixes regressions in 6.9
This series contains a reversion of one of the original 6.9
 patches which seems to have been the cause of most of the
 instability.  It also incorporates several fixes to legacy
 support and cache fixes.
 
 There are few additional changes to improve stability,
 but I want another week of testing before sending them
 upstream.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmYiiQ0ACgkQiP/V+0pf
 /5hrCw//aJwdNAimpwPrc5UfE4Q37igQeXoT29VJbkOBO78rZ2cNgd3EFpgC2UES
 RFJejQ/IQlEkpqbHiMHIyCii2MmWGT0xzePLf3nUZW/qmoUvhvXlPG5OZb0FomXY
 gxCRFuUgegNcK3t3LtFAVn7v6NpXtOfLAgJb3MDIFP8WsCuN863pQcJCwn4aSuKc
 C1ct2tLaaIeZSAy68xytqDwRXslMGaKUp7ygBzpyaIIEqy2l9H8NRKQ8Cmg+vyKF
 2+zu3fNYIGIS3KflUtcTQDZ9IVtp/YxN7QXchZ56nnD5PFy9L9GgvBecZ0i8zzoZ
 XFmzyp7HLwyBA8oNmmEJWMz93iwx61mePxOzPu2n1VfqWRTgFp/kd3KrFKWLfHvw
 NoPGbneAhtwifKCNkxAmX6aCvnTZ18j9nds8WbRcuLRbTF0hHfkI36+vgoRWebaA
 su673A0fnFFe64EEnOLjlnAa0V8CL26V2rX2Mi2Kjaw6emc1Yz5HDnGYjckKIlvS
 fZjlfP1dtqzBecXvBLIuMQKfygpRJD83sEni+rGtAN1FKVP8eKz+/ZcyAG5xqcrZ
 dnDXBegjhieqyz4q9vykxTLmYKEKd4fqbhhjZQ3PStyXgc6iFVKvD41akSdxR6ob
 3oujNYblpkJVhHCcO+H4dWa7tznB7hqd9xv2Jerx4cKTdd9uIik=
 =M4CO
 -----END PGP SIGNATURE-----

Merge tag '9p-fixes-for-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull fs/9p fixes from Eric Van Hensbergen:
 "This contains a reversion of one of the original 6.9 patches which
  seems to have been the cause of most of the instability. It also
  incorporates several fixes to legacy support and cache fixes.

  There are few additional changes to improve stability, but I want
  another week of testing before sending them upstream"

* tag '9p-fixes-for-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: drop inodes immediately on non-.L too
  fs/9p: Revert "fs/9p: fix dups even in uncached mode"
  fs/9p: remove erroneous nlink init from legacy stat2inode
  9p: explicitly deny setlease attempts
  fs/9p: fix the cache always being enabled on files with qid flags
  fs/9p: translate O_TRUNC into OTRUNC
  fs/9p: only translate RWX permissions for plain 9P2000
2024-04-19 13:36:28 -07:00
Linus Torvalds
daa757767d fuse fixes for 6.9-rc5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZiJcTAAKCRDh3BK/laaZ
 PK1QAP9u/S7GYKDj0k58xOVAof2x/q0puHWXoObRma+bPmeoeQEA2+K+vlnTJHub
 kLRURaTCzGyFfL+CB/JQ4Kv4tDF5qQc=
 =Eoob
 -----END PGP SIGNATURE-----

Merge tag 'fuse-fixes-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:

 - Fix two bugs in the new passthrough mode

 - Fix a statx bug introduced in v6.6

 - Fix code documentation

* tag 'fuse-fixes-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  cuse: add kernel-doc comments to cuse_process_init_reply()
  fuse: fix leaked ENOSYS error on first statx call
  fuse: fix parallel dio write on file open in passthrough mode
  fuse: fix wrong ff->iomode state changes from parallel dio write
2024-04-19 13:16:10 -07:00
Linus Torvalds
54c23548e0 15 hotfixes. 9 are cc:stable and the remainder address post-6.8 issues
or aren't considered suitable for backporting.
 
 There are a significant number of fixups for this cycle's page_owner
 changes (series "page_owner: print stacks and their outstanding
 allocations").  Apart from that, singleton changes all over, mainly in MM.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZiGTewAKCRDdBJ7gKXxA
 jt1QAP9QxiU/+gUMVjkHyKaMBHSBMD/CWBFjDfRjx+BPqYx55gD+JWxUXwlyVkMo
 Z8fqtCGEgatev1VbwpCwByhvnH9bKgw=
 =YBZ9
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-04-18-14-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "15 hotfixes. 9 are cc:stable and the remainder address post-6.8 issues
  or aren't considered suitable for backporting.

  There are a significant number of fixups for this cycle's page_owner
  changes (series "page_owner: print stacks and their outstanding
  allocations"). Apart from that, singleton changes all over, mainly in
  MM"

* tag 'mm-hotfixes-stable-2024-04-18-14-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  nilfs2: fix OOB in nilfs_set_de_type
  MAINTAINERS: update Naoya Horiguchi's email address
  fork: defer linking file vma until vma is fully initialized
  mm/shmem: inline shmem_is_huge() for disabled transparent hugepages
  mm,page_owner: defer enablement of static branch
  Squashfs: check the inode number is not the invalid value of zero
  mm,swapops: update check in is_pfn_swap_entry for hwpoison entries
  mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
  mm/userfaultfd: allow hugetlb change protection upon poison entry
  mm,page_owner: fix printing of stack records
  mm,page_owner: fix accounting of pages when migrating
  mm,page_owner: fix refcount imbalance
  mm,page_owner: update metadata for tail pages
  userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE
  mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly
2024-04-19 09:13:35 -07:00
Jakub Kicinski
41e3ddb291 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/trace/events/rpcgss.h
  386f4a7379 ("trace: events: cleanup deprecated strncpy uses")
  a4833e3aba ("SUNRPC: Fix rpcgss_context trace event acceptor field")

Adjacent changes:

drivers/net/ethernet/intel/ice/ice_tc_lib.c
  2cca35f5dd ("ice: Fix checking for unsupported keys on non-tunnel device")
  784feaa65d ("ice: Add support for PFCP hardware offload in switchdev")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-18 13:12:24 -07:00
Qu Wenruo
fe1c6c7acc btrfs: fix wrong block_start calculation for btrfs_drop_extent_map_range()
[BUG]
During my extent_map cleanup/refactor, with extra sanity checks,
extent-map-tests::test_case_7() would not pass the checks.

The problem is, after btrfs_drop_extent_map_range(), the resulted
extent_map has a @block_start way too large.
Meanwhile my btrfs_file_extent_item based members are returning a
correct @disk_bytenr/@offset combination.

The extent map layout looks like this:

     0        16K    32K       48K
     | PINNED |      | Regular |

The regular em at [32K, 48K) also has 32K @block_start.

Then drop range [0, 36K), which should shrink the regular one to be
[36K, 48K).
However the @block_start is incorrect, we expect 32K + 4K, but got 52K.

[CAUSE]
Inside btrfs_drop_extent_map_range() function, if we hit an extent_map
that covers the target range but is still beyond it, we need to split
that extent map into half:

	|<-- drop range -->|
		 |<----- existing extent_map --->|

And if the extent map is not compressed, we need to forward
extent_map::block_start by the difference between the end of drop range
and the extent map start.

However in that particular case, the difference is calculated using
(start + len - em->start).

The problem is @start can be modified if the drop range covers any
pinned extent.

This leads to wrong calculation, and would be caught by my later
extent_map sanity checks, which checks the em::block_start against
btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset.

This is a regression caused by commit c962098ca4 ("btrfs: fix
incorrect splitting in btrfs_drop_extent_map_range"), which removed the
@len update for pinned extents.

[FIX]
Fix it by avoiding using @start completely, and use @end - em->start
instead, which @end is exclusive bytenr number.

And update the test case to verify the @block_start to prevent such
problem from happening.

Thankfully this is not going to lead to any data corruption, as IO path
does not utilize btrfs_drop_extent_map_range() with @skip_pinned set.

So this fix is only here for the sake of consistency/correctness.

CC: stable@vger.kernel.org # 6.5+
Fixes: c962098ca4 ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18 18:18:50 +02:00
Johannes Thumshirn
2f7ef5bb4a btrfs: fix information leak in btrfs_ioctl_logical_to_ino()
Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_ino():

  BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
  BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40
   instrument_copy_to_user include/linux/instrumented.h:114 [inline]
   _copy_to_user+0xbc/0x110 lib/usercopy.c:40
   copy_to_user include/linux/uaccess.h:191 [inline]
   btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499
   btrfs_ioctl+0x714/0x1260
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:904 [inline]
   __se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
   __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
   x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Uninit was created at:
   __kmalloc_large_node+0x231/0x370 mm/slub.c:3921
   __do_kmalloc_node mm/slub.c:3954 [inline]
   __kmalloc_node+0xb07/0x1060 mm/slub.c:3973
   kmalloc_node include/linux/slab.h:648 [inline]
   kvmalloc_node+0xc0/0x2d0 mm/util.c:634
   kvmalloc include/linux/slab.h:766 [inline]
   init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779
   btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480
   btrfs_ioctl+0x714/0x1260
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:904 [inline]
   __se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
   __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
   x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Bytes 40-65535 of 65536 are uninitialized
  Memory access of size 65536 starts at ffff888045a40000

This happens, because we're copying a 'struct btrfs_data_container' back
to user-space. This btrfs_data_container is allocated in
'init_data_container()' via kvmalloc(), which does not zero-fill the
memory.

Fix this by using kvzalloc() which zeroes out the memory on allocation.

CC: stable@vger.kernel.org # 4.14+
Reported-by:  <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <Johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18 18:18:13 +02:00
Linus Torvalds
8cd26fd90c for-6.9-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYgXDMACgkQxWXV+ddt
 WDsPpg//RzpLGyfFVFx+AdqIPScBvDSr6RIQAug++4OmDbIRMxzOpxKOAWThhivf
 78KIms2fj9R/zLJEdUGCLQTcy8a1eWBnoeSzXoeTta2pip5cKrc9v3hJId53l0F6
 BfltbVjpAKt6XHqeI0V2myrL/KHx5bApz5oNn/oEQCwiA2HBkasrYTRLEA7xMem2
 hRUIXrTuIdwiyWugi84xjp9D0BxEdbTBfH6SR6RG4ESy+73gdEt4BAeDI6DzWN+D
 eKUv/CthhrP7xuO8Aq9XGkwznP7lIeIwBCiV5XURLR0HztFm64vXgbPQHhwqvI43
 5uhA7wifc/VE8nOysubfET6MwVEeyOptW6+25ih/9Da9VLxRK1y/Hm94JW8t6Sxi
 VPgT5gz4YuE5/QaojETDLYgkkjKj7Lpe/Bs225J3QBCHu3fs/tp9kHKbUNJrcAeM
 b56tiRMccLVpeoslbK4ahvQqCH4/LKBMdAqfWK5/p24JkYT/ubVP3CdLS2MOeRpV
 UqDpQExuWsVJZKH8znSXXrHf2ZMYHmlA/1gRqdEmcvPF8A2vCc9aMMZHTP7v57EC
 /80NJv9HQuxcUFQCl0h4WBlB+gGQtAszz+0q1X9aedauC6Hd/7LeICLCPRczJC3g
 rD3J+EXiTg2MxqZWyXJXQ1Q9cQWNkQjG6o/rEhl5r5c3OGWgssk=
 =ZKAP
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fixup in zoned mode for out-of-order writes of metadata that are no
   longer necessary, this used to be tracked in a separate list but now
   the old locaion needs to be zeroed out, also add assertions

 - fix bulk page allocation retry, this may stall after first failure
   for compression read/write

* tag 'for-6.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: do not wait for short bulk allocation
  btrfs: zoned: add ASSERT and WARN for EXTENT_BUFFER_ZONED_ZEROOUT handling
  btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer
2024-04-17 18:25:40 -07:00
Sweet Tea Dorminy
131a821a24 btrfs: fallback if compressed IO fails for ENOSPC
In commit b4ccace878 ("btrfs: refactor submit_compressed_extents()"), if
an async extent compressed but failed to find enough space, we changed
from falling back to an uncompressed write to just failing the write
altogether. The principle was that if there's not enough space to write
the compressed version of the data, there can't possibly be enough space
to write the larger, uncompressed version of the data.

However, this isn't necessarily true: due to fragmentation, there could
be enough discontiguous free blocks to write the uncompressed version,
but not enough contiguous free blocks to write the smaller but
unsplittable compressed version.

This has occurred to an internal workload which relied on write()'s
return value indicating there was space. While rare, it has happened a
few times.

Thus, in order to prevent early ENOSPC, re-add a fallback to
uncompressed writing.

Fixes: b4ccace878 ("btrfs: refactor submit_compressed_extents()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Co-developed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18 01:46:52 +02:00
Naohiro Aota
7192833c4e btrfs: scrub: run relocation repair when/only needed
When btrfs scrub finds an error, it reads mirrors to find correct data. If
all the errors are fixed, sctx->error_bitmap is cleared for the stripe
range. However, in the zoned mode, it runs relocation to repair scrub
errors when the bitmap is *not* empty, which is a flipped condition.

Also, it runs the relocation even if the scrub is read-only. This was
missed by a fix in commit 1f2030ff6e ("btrfs: scrub: respect the
read-only flag during repair").

The repair is only necessary when there is a repaired sector and should be
done on read-write scrub. So, tweak the condition for both regular and
zoned case.

Fixes: 54765392a1 ("btrfs: scrub: introduce helper to queue a stripe for scrub")
Fixes: 1f2030ff6e ("btrfs: scrub: respect the read-only flag during repair")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18 01:46:47 +02:00
David Sterba
e5a78fdec0 btrfs: remove colon from messages with state
The message format in syslog is usually made of two parts:

  prefix ":" message

Various tools parse the prefix up to the first ":". When there's
an additional status of a btrfs filesystem like

  [5.199782] BTRFS info (device nvme1n1p1: state M): use zstd compression, level 9

where 'M' is for remount, there's one more ":" that does not conform to
the format. Remove it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18 01:46:35 +02:00
Kent Overstreet
0389c09b2f bcachefs: Fix bio alloc in check_extent_checksum()
if the buffer is virtually mapped it won't be a single bvec

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17 17:29:58 -04:00
Kent Overstreet
719aec84b1 bcachefs: fix leak in bch2_gc_write_reflink_key
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17 17:29:58 -04:00
Kent Overstreet
605109ff5e bcachefs: KEY_TYPE_error is allowed for reflink
KEY_TYPE_error is left behind when we have to delete all pointers in an
extent in fsck; it allows errors to be correctly returned by reads
later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17 17:29:58 -04:00
Kent Overstreet
fa845c7349 bcachefs: Fix bch2_dev_btree_bitmap_marked_sectors() shift
Fixes: 27c15ed297 bcachefs: bch_member.btree_allocated_bitmap
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17 17:29:53 -04:00
Alexander Aring
7b012732d0 dlm: fix sleep in atomic context
This patch changes the orphans mutex to a spinlock since commit
c288745f1d ("dlm: avoid blocking receive at the end of recovery") is
using a rwlock_t to lock the DLM message receive path and do_purge() can
be called while this lock is held that forbids to sleep.

We need to use spin_lock_bh() because also a user context that calls
dlm_user_purge() can call do_purge() and since commit 92d59adfaf
("dlm: do message processing in softirq context") the DLM message
receive path is done under softirq context.

Fixes: c288745f1d ("dlm: avoid blocking receive at the end of recovery")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/gfs2/9ad928eb-2ece-4ad9-a79c-d2bce228e4bc@moroto.mountain/
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-17 14:20:04 -05:00
Christian Brauner
193feb69af
Merge patch series 'Fix shmem_rename2 directory offset calculation' of https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org
Pull shmem_rename2() offset fixes from Chuck Lever:

The existing code in shmem_rename2() allocates a fresh directory
offset value when renaming over an existing destination entry. User
space does not expect this behavior. In particular, applications
that rename while walking a directory can loop indefinitely because
they never reach the end of the directory.

* 'Fix shmem_rename2 directory offset calculation' of https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org: (3 commits)
  shmem: Fix shmem_rename2()
  libfs: Add simple_offset_rename() API
  libfs: Fix simple_offset_rename_exchange()

 fs/libfs.c         | 55 +++++++++++++++++++++++++++++++++++++++++-----
 include/linux/fs.h |  2 ++
 mm/shmem.c         |  3 +--
 3 files changed, 52 insertions(+), 8 deletions(-)

Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17 13:49:56 +02:00
Chuck Lever
ad191eb6d6
shmem: Fix shmem_rename2()
When renaming onto an existing directory entry, user space expects
the replacement entry to have the same directory offset as the
original one.

Link: https://gitlab.alpinelinux.org/alpine/aports/-/issues/15966
Fixes: a2e459555c ("shmem: stable directory offsets")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20240415152057.4605-4-cel@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17 13:49:44 +02:00
Chuck Lever
5a1a25be99
libfs: Add simple_offset_rename() API
I'm about to fix a tmpfs rename bug that requires the use of
internal simple_offset helpers that are not available in mm/shmem.c

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20240415152057.4605-3-cel@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17 13:49:43 +02:00
Chuck Lever
23cdd0eed3
libfs: Fix simple_offset_rename_exchange()
User space expects the replacement (old) directory entry to have
the same directory offset after the rename.

Suggested-by: Christian Brauner <brauner@kernel.org>
Fixes: a2e459555c ("shmem: stable directory offsets")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20240415152057.4605-2-cel@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17 13:49:43 +02:00
Ilya Denisyev
c6854e5a26
jffs2: prevent xattr node from overflowing the eraseblock
Add a check to make sure that the requested xattr node size is no larger
than the eraseblock minus the cleanmarker.

Unlike the usual inode nodes, the xattr nodes aren't split into parts
and spread across multiple eraseblocks, which means that a xattr node
must not occupy more than one eraseblock. If the requested xattr value is
too large, the xattr node can spill onto the next eraseblock, overwriting
the nodes and causing errors such as:

jffs2: argh. node added in wrong place at 0x0000b050(2)
jffs2: nextblock 0x0000a000, expected at 0000b00c
jffs2: error: (823) do_verify_xattr_datum: node CRC failed at 0x01e050,
read=0xfc892c93, calc=0x000000
jffs2: notice: (823) jffs2_get_inode_nodes: Node header CRC failed
at 0x01e00c. {848f,2fc4,0fef511f,59a3d171}
jffs2: Node at 0x0000000c with length 0x00001044 would run over the
end of the erase block
jffs2: Perhaps the file system was created with the wrong erase size?
jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found
at 0x00000010: 0x1044 instead

This breaks the filesystem and can lead to KASAN crashes such as:

BUG: KASAN: slab-out-of-bounds in jffs2_sum_add_kvec+0x125e/0x15d0
Read of size 4 at addr ffff88802c31e914 by task repro/830
CPU: 0 PID: 830 Comm: repro Not tainted 6.9.0-rc3+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Arch Linux 1.16.3-1-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xc6/0x120
 print_report+0xc4/0x620
 ? __virt_addr_valid+0x308/0x5b0
 kasan_report+0xc1/0xf0
 ? jffs2_sum_add_kvec+0x125e/0x15d0
 ? jffs2_sum_add_kvec+0x125e/0x15d0
 jffs2_sum_add_kvec+0x125e/0x15d0
 jffs2_flash_direct_writev+0xa8/0xd0
 jffs2_flash_writev+0x9c9/0xef0
 ? __x64_sys_setxattr+0xc4/0x160
 ? do_syscall_64+0x69/0x140
 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
 [...]

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: aa98d7cf59 ("[JFFS2][XATTR] XATTR support on JFFS2 (version. 5)")
Signed-off-by: Ilya Denisyev <dev@elkcl.ru>
Link: https://lore.kernel.org/r/20240412155357.237803-1-dev@elkcl.ru
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17 13:21:34 +02:00
Kent Overstreet
79055f50a6 bcachefs: make sure to release last journal pin in replay
This fixes a deadlock when journal replay has many keys to insert that
were from fsck, not the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16 19:14:01 -04:00
Kent Overstreet
fabb4d4985 bcachefs: node scan: ignore multiple nodes with same seq if interior
Interior nodes are not really needed, when we have to scan - but if this
pops up for leaf nodes we'll need a real heuristic.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16 19:14:00 -04:00
Nathan Chancellor
9fd5a48a1e bcachefs: Fix format specifier in validate_bset_keys()
When building for 32-bit platforms, for which size_t is 'unsigned int',
there is a warning from a format string in validate_bset_keys():

  fs/bcachefs/btree_io.c: In function 'validate_bset_keys':
  fs/bcachefs/btree_io.c:891:34: error: format '%lu' expects argument of type 'long unsigned int', but argument 12 has type 'unsigned int' [-Werror=format=]
    891 |                                  "bad k->u64s %u (min %u max %lu)", k->u64s,
        |                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  fs/bcachefs/btree_io.c:603:32: note: in definition of macro 'btree_err'
    603 |                                msg, ##__VA_ARGS__);                     \
        |                                ^~~
  fs/bcachefs/btree_io.c:887:21: note: in expansion of macro 'btree_err_on'
    887 |                 if (btree_err_on(!bkeyp_u64s_valid(&b->format, k),
        |                     ^~~~~~~~~~~~
  fs/bcachefs/btree_io.c:891:64: note: format string is defined here
    891 |                                  "bad k->u64s %u (min %u max %lu)", k->u64s,
        |                                                              ~~^
        |                                                                |
        |                                                                long unsigned int
        |                                                              %u
  cc1: all warnings being treated as errors

BKEY_U64s is size_t so the entire expression is promoted to size_t. Use
the '%zu' specifier so that there is no warning regardless of the width
of size_t.

Fixes: 031ad9e7db ("bcachefs: Check for packed bkeys that are too big")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404130747.wH6Dd23p-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202404131536.HdAMBOVc-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16 19:11:49 -04:00
Kent Overstreet
02bed83d59 bcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINE
We need to initialize the stdio redirects before they're used.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16 19:11:49 -04:00
Jeongjun Park
c4a7dc9523 nilfs2: fix OOB in nilfs_set_de_type
The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is
defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function,
which uses this array, specifies the index to read from the array in the
same way as "(mode & S_IFMT) >> S_SHIFT".

static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode
 *inode)
{
	umode_t mode = inode->i_mode;

	de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob
}

However, when the index is determined this way, an out-of-bounds (OOB)
error occurs by referring to an index that is 1 larger than the array size
when the condition "mode & S_IFMT == S_IFMT" is satisfied.  Therefore, a
patch to resize the nilfs_type_by_mode array should be applied to prevent
OOB errors.

Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com
Reported-by: syzbot+2e22057de05b9f3b30d8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8
Fixes: 2ba466d74e ("nilfs2: directory entry operations")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 15:39:52 -07:00
Phillip Lougher
9253c54e01 Squashfs: check the inode number is not the invalid value of zero
Syskiller has produced an out of bounds access in fill_meta_index().

That out of bounds access is ultimately caused because the inode
has an inode number with the invalid value of zero, which was not checked.

The reason this causes the out of bounds access is due to following
sequence of events:

1. Fill_meta_index() is called to allocate (via empty_meta_index())
   and fill a metadata index.  It however suffers a data read error
   and aborts, invalidating the newly returned empty metadata index.
   It does this by setting the inode number of the index to zero,
   which means unused (zero is not a valid inode number).

2. When fill_meta_index() is subsequently called again on another
   read operation, locate_meta_index() returns the previous index
   because it matches the inode number of 0.  Because this index
   has been returned it is expected to have been filled, and because
   it hasn't been, an out of bounds access is performed.

This patch adds a sanity check which checks that the inode number
is not zero when the inode is created and returns -EINVAL if it is.

[phillip@squashfs.org.uk: whitespace fix]
  Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk
Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: "Ubisectech Sirius" <bugreport@ubisectech.com>
Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport@ubisectech.com/
Cc: Christian Brauner <brauner@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 15:39:50 -07:00
Alexander Aring
15fd7e5517 dlm: use rwlock for lkbidr
Convert the lock for lkbidr to an rwlock.  Most idr lookups will use
the read lock.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 14:45:57 -05:00
Alexander Aring
e91313591b dlm: use rwlock for rsb hash table
The conversion to rhashtable introduced a hash table lock per lockspace,
in place of per bucket locks.  To make this more scalable, switch to
using a rwlock for hash table access.  The common case fast path uses
it as a read lock.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 14:45:31 -05:00
Alexander Aring
b1f2381c1a dlm: drop dlm_scand kthread and use timers
Currently the scand kthread acts like a garbage collection for expired
rsbs on toss list, to clean them up after a certain timeout. It triggers
every couple of seconds and iterates over the toss list while holding
ls_rsbtbl_lock for the whole hash bucket iteration.

To reduce the amount of time holding ls_rsbtbl_lock, we now handle the
disposal of expired rsbs using a per-lockspace timer that expires for the
earliest tossed rsb on the lockspace toss queue. This toss queue is
ordered according to the rsb res_toss_time with the earliest tossed rsb
as the first entry. The toss timer will only trylock() necessary locks,
since it is low priority garbage collection, and will rearm the timer
if trylock() fails. If the timer function does not find any expired
rsb's, it rearms the timer with the next earliest expired rsb.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 14:40:27 -05:00
Alexander Aring
6644925abf dlm: do not use ref counts for rsb in the toss state
In the past we had problems when an rsb had a reference counter greater
than one while in the toss state.  An rsb in the toss state is not
actively used for locking, and should not have any other references
apart from the single ref keeping it on the rsb hash.  Shift to freeing
rsb's directly rather than using kref_put to free them, since the ref
counting is not meant to be used in this state.  Add warnings if ref
counting is seen while an rsb is in the toss state.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 14:34:39 -05:00
Alexander Aring
6c648035cb dlm: switch to use rhashtable for rsbs
Replace our own hash table with the more advanced rhashtable
for keeping rsb structs.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 14:34:39 -05:00
Alexander Aring
93a693d19d dlm: add rsb lists for iteration
To prepare for using rhashtable, add two rsb lists for iterating
through rsb's in two uncommon cases where this is necesssary:
- when dumping rsb state from debugfs, now using seq_list.
- when looking at all rsb's during recovery.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 14:33:25 -05:00
Alexander Aring
2d90354027 dlm: merge toss and keep hash table lists into one list
There are several places where lock processing can perform two hash table
lookups, first in the "keep" list, and if not found, in the "toss" list.
This patch introduces a new rsb state flag "RSB_TOSS" to represent the
difference between the state of being on keep vs toss list, so that the
two lists can be combined.  This avoids cases of two lookups.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 13:49:13 -05:00
Alexander Aring
dcdaad05ca dlm: change to single hashtable lock
Prepare to replace our own hash table with rhashtable by replacing
the per-bucket locks in our own hash table with a single lock.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 13:46:41 -05:00
Alexander Aring
700b04808f dlm: increment ls_count for dlm_scand
Increment the ls_count value while dlm_scand is processing a
lockspace so that release_lockspace()/remove_lockspace() will
wait for dlm_scand to finish.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16 13:42:45 -05:00
Christian Brauner
74871791ff
ntfs3: serve as alias for the legacy ntfs driver
Johan Hovold reported that removing the legacy ntfs driver broke boot
for him since his fstab uses the legacy ntfs driver to access firmware
from the original Windows partition.

Use ntfs3 as an alias for legacy ntfs if CONFIG_NTFS_FS is selected.
This is similar to how ext3 is treated.

Link: https://lore.kernel.org/r/Zf2zPf5TO5oYt3I3@hovoldconsulting.com
Link: https://lore.kernel.org/r/20240325-hinkriegen-zuziehen-d7e2c490427a@brauner
Fixes: 7ffa8f3d30 ("fs: Remove NTFS classic")
Tested-by: Johan Hovold <johan+linaro@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johan Hovold <johan@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-16 10:45:26 +02:00
Linus Torvalds
96fca68c4f nfsd-6.9 fixes:
- Fix a potential tracepoint crash
 - Fix NFSv4 GETATTR on big-endian platforms
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYdlTMACgkQM2qzM29m
 f5esDBAAnXOgnizrGTMkpmqWL11UmpIjWDyTxQ7dWrk7dqQGXT3qAAya3dijaJiM
 a1eLdFiaaKFxtkFrR9QPCtqfpR/gNxkkHf05SK/LQ1SL2OMbAMa1/UIaf0teWM78
 CafmMT1YLMyiEDFpB0rAnoJ5VvTU2BVowjfzAW/0PkmwLlO5+XMMhPx/qd1061Ll
 gwl2pqwZPankZRWsUBZtDE5bCTuKQDePrG7e7J7FKVPR+1EqAcudsDMh1tmSTvar
 0NTeLH0LTJ2imZi21b+j9+VKtwXTtmuY2GxhADNb8goUuQI2+lqNakDk4AflQvuy
 Kg3Z0dnNkTWGKPIbV/020vhN/6Fev5RVF9SdPF5WcEfeaWDV5rjEY1s4svphUuS+
 Nh8VCPeQEAamAcShA584G8onWdXGP9sYgBiWXZvh8R38Akq6AC6LPEkbqT6dR5mU
 ftMDGb3BBvkOs7ahjaiUUaPqoRXxeS+Qh06Sa3JrZhbMFdccZRq/AgodtC7ZYGZZ
 4u7yG+y8MIytHbIljE2aCo8U8jV8f4nl6VV3xda3H9zZG0RRfpZfFetHiAWqRjoq
 BEB75eLFDjf1qAXENWzzdeS0wLRRr5PHIkBfDeFq71zyJO37RH15sfVnavinj2KY
 7a0ASn2xlqzDHY7MTZ2ULRCLYsS7XwN88KBF7tNghfQBKJYs59A=
 =wAk4
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix a potential tracepoint crash

 - Fix NFSv4 GETATTR on big-endian platforms

* tag 'nfsd-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: fix endianness issue in nfsd4_encode_fattr4
  SUNRPC: Fix rpcgss_context trace event acceptor field
2024-04-15 14:09:47 -07:00
Linus Torvalds
cef27048e5 bcachefs fixes for 6.9-rc5
various recovery fixes:
 
 - fixes for the btree_insert_entry being resized on path allocation
   btree_path array recently became dynamically resizable, and
   btree_insert_entry along with it; this was being observed during
   journal replay, when write buffer btree updates don't use the write
   buffer and instead use the normal btree update path
 - multiple fixes for deadlock in recovery when we need to do lots of
   btree node merges; excessive merges were clocking up the whole
   pipeline
 - write buffer path now correctly does btree node merges when needed
 - fix failure to go RW when superblock indicates recovery passes needed
   (i.e. to complete an unfinished upgrade)
 
 various unsafety fixes - test case contributed by a user who had two
 drives out of a six drive array write out a whole bunch of garbage after
 power failure
 
 new (tiny) on disk format feature: since it appears the btree node scan
 tool will be a more regular thing (crappy hardware, user error) - this
 adds a 64 bit per-device bitmap of regions that have ever had btree
 nodes.
 
 a path->should_be_locked fix, from a larger patch series tightening up
 invariants and assertions around btree transaction and path locking
 state; this particular fix prevents us from keeping around btree_paths
 that are no longer needed.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYdaRIACgkQE6szbY3K
 bnbqcA/9ETT0Jekf/V4klQmoWj9GX5nQstUz+ENABNNPL+5hld62EojiRvOW2qwU
 zVs7O0M59B8/+4v4KJoW+RqnLFjAF4z/Gf+/Uw9WarsHAKIxxFFFARxG93JpGqOn
 nGa8RSw0BaYQIdbMR0Bdacc2f0N+JkJQx956/+JV7EG5MAJqXgz00AvIuLqMZ+2t
 0m9av3n0tVmstyvvGqk8pouvQjK0XUvIDYN3oiUDl7WXOAIKXDlp6yviiGnTbusq
 DssmIt5fdeVBq/DAk5PMNEKM9NUP+weIZW1UWPWINaicarqyV+pn2fhvLrBxVl7q
 zBSN3v28viaABKC8A15b2bqj3IT2WIBDoBCEi406akMao9eiVsE6is13rFkPQwQI
 Obhc7NNDyOPPTvX25M3tKXpr8rSGoD2qHIMMKMIBe1ZWscj6lMbmUBErwzTOAW4+
 pNTvzWT2XwcS7tE8Fx50ZxcehTQl6ir0hQvjJL5JV2po8XMbdGxcImBe6xPmAa3n
 /XIzyglL8IvW494wjCsHxtTeOt+f8nW7BXJCrWB71UQeXIXq4b9FADOwWtlGTnxJ
 6XNprfi8TSp+RsSRxav6DBw2ou5viGjAjP2ddrO6Lw37XUYV0igS+BeDNEPA4dwI
 ZlbCzNE7qSXK2rjmGjyu7GCJ3+NOxJDQ8GdxkTDtpPrBF2kCOkQ=
 =NAId
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-04-15' of https://evilpiepirate.org/git/bcachefs

Pull yet more bcachefs fixes from Kent Overstreet:
 "This gets recovery working again for the affected user I've been
  working with, and I'm still waiting to hear back on other bug reports
  but should fix it for everyone else who's been having issues with
  recovery.

   - Various recovery fixes:

       - fixes for the btree_insert_entry being resized on path
         allocation btree_path array recently became dynamically
         resizable, and btree_insert_entry along with it; this was being
         observed during journal replay, when write buffer btree updates
         don't use the write buffer and instead use the normal btree
         update path

       - multiple fixes for deadlock in recovery when we need to do lots
         of btree node merges; excessive merges were clocking up the
         whole pipeline

       - write buffer path now correctly does btree node merges when
         needed

       - fix failure to go RW when superblock indicates recovery passes
         needed (i.e. to complete an unfinished upgrade)

   - Various unsafety fixes - test case contributed by a user who had
     two drives out of a six drive array write out a whole bunch of
     garbage after power failure

   - New (tiny) on disk format feature: since it appears the btree node
     scan tool will be a more regular thing (crappy hardware, user
     error) - this adds a 64 bit per-device bitmap of regions that have
     ever had btree nodes.

   - A path->should_be_locked fix, from a larger patch series tightening
     up invariants and assertions around btree transaction and path
     locking state.

     This particular fix prevents us from keeping around btree_paths
     that are no longer needed"

* tag 'bcachefs-2024-04-15' of https://evilpiepirate.org/git/bcachefs: (24 commits)
  bcachefs: set_btree_iter_dontneed also clears should_be_locked
  bcachefs: fix error path of __bch2_read_super()
  bcachefs: Check for backpointer bucket_offset >= bucket size
  bcachefs: bch_member.btree_allocated_bitmap
  bcachefs: sysfs internal/trigger_journal_flush
  bcachefs: Fix bch2_btree_node_fill() for !path
  bcachefs: add safety checks in bch2_btree_node_fill()
  bcachefs: Interior known are required to have known key types
  bcachefs: add missing bounds check in __bch2_bkey_val_invalid()
  bcachefs: Fix btree node merging on write buffer btrees
  bcachefs: Disable merges from interior update path
  bcachefs: Run merges at BCH_WATERMARK_btree
  bcachefs: Fix missing write refs in fs fio paths
  bcachefs: Fix deadlock in journal replay
  bcachefs: Go rw if running any explicit recovery passes
  bcachefs: Standardize helpers for printing enum strs with bounds checks
  bcachefs: don't queue btree nodes for rewrites during scan
  bcachefs: fix race in bch2_btree_node_evict()
  bcachefs: fix unsafety in bch2_stripe_to_text()
  bcachefs: fix unsafety in bch2_extent_ptr_to_text()
  ...
2024-04-15 11:01:11 -07:00
Kent Overstreet
ad29cf999a bcachefs: set_btree_iter_dontneed also clears should_be_locked
This is part of a larger series cleaning up the semantics of
should_be_locked and adding assertions around it; if we don't need an
iterator/path anymore, it clearly doesn't need to be locked.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-15 13:31:15 -04:00
Chao Yu
3078e059a5 bcachefs: fix error path of __bch2_read_super()
In __bch2_read_super(), if kstrdup() fails, it needs to release memory
in sb->holder, fix to call bch2_free_super() in the error path.

Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-15 13:31:15 -04:00
Christoph Hellwig
0fac04e4e0
iomap: convert iomap_writepages to writeack_iter
This removes one indirect function call per folio, and adds type safety
by not casting through a void pointer.

Based on a patch by Matthew Wilcox.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240412061614.1511629-1-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-15 14:25:55 +02:00
Yang Li
09492cb451 cuse: add kernel-doc comments to cuse_process_init_reply()
This commit adds kernel-doc style comments with complete parameter
descriptions for the function  cuse_process_init_reply.

Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-15 11:02:10 +02:00
Danny Lin
eb4b691b91 fuse: fix leaked ENOSYS error on first statx call
FUSE attempts to detect server support for statx by trying it once and
setting no_statx=1 if it fails with ENOSYS, but consider the following
scenario:

- Userspace (e.g. sh) calls stat() on a file
  * succeeds
- Userspace (e.g. lsd) calls statx(BTIME) on the same file
  - request_mask = STATX_BASIC_STATS | STATX_BTIME
  - first pass: sync=true due to differing cache_mask
  - statx fails and returns ENOSYS
  - set no_statx and retry
  - retry sets mask = STATX_BASIC_STATS
  - now mask == cache_mask; sync=false (time_before: still valid)
  - so we take the "else if (stat)" path
  - "err" is still ENOSYS from the failed statx call

Fix this by zeroing "err" before retrying the failed call.

Fixes: d3045530bd ("fuse: implement statx")
Cc: stable@vger.kernel.org # v6.6
Signed-off-by: Danny Lin <danny@orbstack.dev>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-15 10:12:44 +02:00
Amir Goldstein
7cc9112628 fuse: fix parallel dio write on file open in passthrough mode
Parallel dio write takes a negative refcount of fi->iocachectr and so does
open of file in passthrough mode.

The refcount of passthrough mode is associated with attach/detach of a
fuse_backing object to fuse inode.

For parallel dio write, the backing file is irrelevant, so the call to
fuse_inode_uncached_io_start() passes a NULL fuse_backing object.

Passing a NULL fuse_backing will result in false -EBUSY error if the file
is already open in passthrough mode.

Allow taking negative fi->iocachectr refcount with NULL fuse_backing,
because it does not conflict with an already attached fuse_backing object.

Fixes: 4a90451bbc ("fuse: implement open in passthrough mode")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-15 10:12:44 +02:00
Amir Goldstein
4864a6dd83 fuse: fix wrong ff->iomode state changes from parallel dio write
There is a confusion with fuse_file_uncached_io_{start,end} interface.
These helpers do two things when called from passthrough open()/release():
1. Take/drop negative refcount of fi->iocachectr (inode uncached io mode)
2. State change ff->iomode IOM_NONE <-> IOM_UNCACHED (file uncached open)

The calls from parallel dio write path need to take a reference on
fi->iocachectr, but they should not be changing ff->iomode state, because
in this case, the fi->iocachectr reference does not stick around until file
release().

Factor out helpers fuse_inode_uncached_io_{start,end}, to be used from
parallel dio write path and rename fuse_file_*cached_io_{start,end} helpers
to fuse_file_*cached_io_{open,release} to clarify the difference.

Fixes: 205c1d8026 ("fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAP")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-04-15 10:12:03 +02:00
Kent Overstreet
f0a73d4fde bcachefs: Check for backpointer bucket_offset >= bucket size
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-14 20:02:11 -04:00
Kent Overstreet
27c15ed297 bcachefs: bch_member.btree_allocated_bitmap
This adds a small (64 bit) per-device bitmap that tracks ranges that
have btree nodes, for accelerating btree node scan if it is ever needed.

- New helpers, bch2_dev_btree_bitmap_marked() and
  bch2_dev_bitmap_mark(), for checking and updating the bitmap

- Interior btree update path updates the bitmaps when required

- The check_allocations pass has a new fsck_err check,
  btree_bitmap_not_marked

- New on disk format version, mi_btree_mitmap, which indicates the new
  bitmap is present

- Upgrade table lists the required recovery pass and expected fsck error

- Btree node scan uses the bitmap to skip ranges if we're on the new
  version

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-14 20:02:11 -04:00
Kent Overstreet
bdae2a7e60 bcachefs: sysfs internal/trigger_journal_flush
Add a sysfs knob for immediately flushing the entire journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-14 20:02:11 -04:00
Kent Overstreet
e879389f57 bcachefs: Fix bch2_btree_node_fill() for !path
We shouldn't be doing the unlock/relock dance when we're not using a
path - this fixes an assertion pop when called from btree node scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-14 20:02:11 -04:00
Kent Overstreet
8cf2036e7b bcachefs: add safety checks in bch2_btree_node_fill()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-14 18:01:12 -04:00
Kent Overstreet
d789e9a7d5 bcachefs: Interior known are required to have known key types
For forwards compatibilyt, we allow bkeys of unknown type in leaf nodes;
we can simply ignore metadata we don't understand. Pointers to btree
nodes must always be of known types, howwever.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-14 18:01:12 -04:00
Kent Overstreet
bceb86be9e bcachefs: add missing bounds check in __bch2_bkey_val_invalid()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-14 18:01:12 -04:00
Linus Torvalds
72374d71c3 Get rid of lockdep false positives around sysfs/overlayfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZhu2kwAKCRBZ7Krx/gZQ
 62gzAP9eeADy6rQkzgWJ8d8sKzGfmd0nup9WlCOxZSR0XojTXwEAnue47dn7PlMx
 wQY0joZ0V5FO8PNTEbWc2P/dSQrANgc=
 =MshW
 -----END PGP SIGNATURE-----

Merge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull sysfs fix from Al Viro:
 "Get rid of lockdep false positives around sysfs/overlayfs

  syzbot has uncovered a class of lockdep false positives for setups
  with sysfs being one of the backing layers in overlayfs. The root
  cause is that of->mutex allocated when opening a sysfs file read-only
  (which overlayfs might do) is confused with of->mutex of a file opened
  writable (held in write to sysfs file, which overlayfs won't do).

  Assigning them separate lockdep classes fixes that bunch and it's
  obviously safe"

* tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kernfs: annotate different lockdep class for of->mutex of writable files
2024-04-14 11:41:51 -07:00
Amir Goldstein
16b52bbee4 kernfs: annotate different lockdep class for of->mutex of writable files
The writable file /sys/power/resume may call vfs lookup helpers for
arbitrary paths and readonly files can be read by overlayfs from vfs
helpers when sysfs is a lower layer of overalyfs.

To avoid a lockdep warning of circular dependency between overlayfs
inode lock and kernfs of->mutex, use a different lockdep class for
writable and readonly kernfs files.

Reported-by: syzbot+9a5b0ced8b1bfb238b56@syzkaller.appspotmail.com
Fixes: 0fedefd4c4 ("kernfs: sysfs: support custom llseek method for sysfs entries")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-14 06:55:46 -04:00
Kent Overstreet
86dbf8c566 bcachefs: Fix btree node merging on write buffer btrees
The btree write buffer flush fastpath that avoids the main transaction
commit path had the unfortunate side effect of not doing btree node
merging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:49:25 -04:00
Kent Overstreet
3f10048973 bcachefs: Disable merges from interior update path
There's been a bug in the btree write buffer where it wasn't triggering
btree node merges - and leaving behind a bunch of nearly empty btree
nodes.

Then during journal replay, when updates to the backpointers btree
aren't using the btree write buffer (because we require synchronization
with journal replay), we end up doing those merges all at once.

Then if it's the interior update path running them, we deadlock because
those run with the highest watermark.

There's no real need for the interior update path to be doing btree node
merges; other code paths can handle that at lower watermarks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:49:25 -04:00
Kent Overstreet
9054ef2ea9 bcachefs: Run merges at BCH_WATERMARK_btree
This fixes a deadlock where the interior update path during journal
replay ends up doing a ton of merges on the backpointers btree, and
deadlocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:49:25 -04:00
Kent Overstreet
9e203c43dc bcachefs: Fix missing write refs in fs fio paths
bch2_journal_flush_seq requires us to have a write ref

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:17 -04:00
Kent Overstreet
82cf18f23e bcachefs: Fix deadlock in journal replay
btree_key_can_insert_cached() should be checking the watermark -
BCH_TRANS_COMMIT_journal_replay really means nonblocking mode when
watermark < reclaim, it was being used incorrectly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:17 -04:00
Kent Overstreet
4518e80adf bcachefs: Go rw if running any explicit recovery passes
This fixes a bug where we fail to start when upgrading/downgrading
because we forgot we needed to go rw.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:17 -04:00
Kent Overstreet
9abb6dd7ce bcachefs: Standardize helpers for printing enum strs with bounds checks
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:17 -04:00
Kent Overstreet
ba8ed36e72 bcachefs: don't queue btree nodes for rewrites during scan
many nodes found during scan will be old nodes, overwritten by newer
nodes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:17 -04:00
Kent Overstreet
7b4c4ccf84 bcachefs: fix race in bch2_btree_node_evict()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:16 -04:00
Kent Overstreet
2aeed876d7 bcachefs: fix unsafety in bch2_stripe_to_text()
.to_text() functions need to work on key values that didn't pass .valid

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:16 -04:00
Kent Overstreet
dc32c118ec bcachefs: fix unsafety in bch2_extent_ptr_to_text()
Need to check if we have a valid bucket before checking if ptr is stale

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:16 -04:00
Kent Overstreet
87cb0239c8 bcachefs: btree node scan: handle encrypted nodes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:16 -04:00
Kent Overstreet
031ad9e7db bcachefs: Check for packed bkeys that are too big
add missing validation; fixes assertion pop in bkey unpack

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:16 -04:00
Kent Overstreet
58caa786f1 bcachefs: Fix UAFs of btree_insert_entry array
The btree paths array is now dynamically resizable - and as well the
btree_insert_entries array, as it needs to be the same size.

The merge path (and interior update path) allocates new btree paths,
thus can trigger a resize; thus we need to not retain direct pointers
after invoking merge; similarly when running btree node triggers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:16 -04:00
Linus Torvalds
76b0e9c429 zonefs fixes for 6.9-rc4
- Suppress a coccicheck warning using str_plural().
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZhnikwAKCRDdoc3SxdoY
 ditrAP9s/r36Dl4c8BZxpoUyCmf44ww6oTMxGzjwi/4gA8Ry2gEA766WGVsDyJf2
 G0qjl7+bVneaOG6Xayzce69OjvZGZws=
 =U6kP
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fix from Damien Le Moal:

 - Suppress a coccicheck warning using str_plural()

* tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: Use str_plural() to fix Coccinelle warning
2024-04-13 10:25:32 -07:00
Linus Torvalds
fa4022cb73 Four cifs.ko changesets, most also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYaS/8ACgkQiiy9cAdy
 T1G3OAwAjRyhi2XHb0QtA0YpZN0FLFghoY/+hgAYaS20WokdqbdyPfZeievRn6Jq
 zgoRYmPF9ihEAZ/UnNIu5ficHoFFsNTxPodQKOTMKxtYg7RCWv7ZF87+/OafuG/k
 FbYue7LQ1CvloEBMsNg088uFQsCp4u8k/9IaHHf+wEDmti42JTqoLjg5YsgbcotI
 Qt8VtxTVL+FzPax/qQkXtBvHggxTfs7VWvwRkv21NsotQzBoPGxHxv3bUV9G54Ez
 56w6Qg+R4ZSwIGgrRGAIjTZ0E6lPuIBanmePJlFwFB0em7Ajl3K2BBJUXpTsxgEd
 aDgJKFZ+ZmDpp5GWlYRccmy/jm6uOiLdCxPye/EQk5KK/oUNpFUUQ95OJ1clH6Su
 pR8LLk1yPQH+e8eyx0Exq7H8l0lsy4pHzerVyUgdgd9E1IZ3+MIN261vTu2TTfGq
 CtML0WNcZFnQdYbMnkADAZ7zYgr32U0UiWr6WS+V0rugyguDrPIEiIneFl3rSZsq
 MQaOHyS5
 =AHB9
 -----END PGP SIGNATURE-----

Merge tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - fix for oops in cifs_get_fattr of deleted files

 - fix for the remote open counter going negative in some directory
   lease cases

 - fix for mkfifo to instantiate dentry to avoid possible crash

 - important fix to allow handling key rotation for mount and remount
   (ie cases that are becoming more common when password that was used
   for the mount will expire soon but will be replaced by new password)

* tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix broken reconnect when password changing on the server by allowing password rotation
  smb: client: instantiate when creating SFU files
  smb3: fix Open files on server counter going negative
  smb: client: fix NULL ptr deref in cifs_mark_open_handles_for_deleted_file()
2024-04-13 10:10:18 -07:00
Linus Torvalds
42bd2af595
vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
"The definition of insanity is doing the same thing over and over
    again and expecting different results”

We've tried to do this before, most recently with commit bb2314b479
("fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink") about a
decade ago.

But the effort goes back even further than that, eg this thread back
from 1998 that is so old that we don't even have it archived in lore:

    https://lkml.org/lkml/1998/3/10/108

which also points out some of the reasons why it's dangerous.

Or, how about then in 2003:

    https://lkml.org/lkml/2003/4/6/112

where we went through some of the same arguments, just wirh different
people involved.

In particular, having access to a file descriptor does not necessarily
mean that you have access to the path that was used for lookup, and
there may be very good reasons why you absolutely must not have access
to a path to said file.

For example, if we were passed a file descriptor from the outside into
some limited environment (think chroot, but also user namespaces etc) a
'flink()' system call could now make that file visible inside a context
where it's not supposed to be visible.

In the process the user may also be able to re-open it with permissions
that the original file descriptor did not have (eg a read-only file
descriptor may be associated with an underlying file that is writable).

Another variation on this is if somebody else (typically root) opens a
file in a directory that is not accessible to others, and passes the
file descriptor on as a read-only file.  Again, the access to the file
descriptor does not imply that you should have access to a path to the
file in the filesystem.

So while we have tried this several times in the past, it never works.

The last time we did this, that commit bb2314b479 quickly got reverted
again in commit f0cc6ffb8c (Revert "fs: Allow unprivileged linkat(...,
AT_EMPTY_PATH) aka flink"), with a note saying "We may re-do this once
the whole discussion about the interface is done".

Well, the discussion is long done, and didn't come to any resolution.
There's no question that 'flink()' would be a useful operation, but it's
a dangerous one.

However, it does turn out that since 2008 (commit d76b0d9b2d: "CRED:
Use creds in file structs") we have had a fairly straightforward way to
check whether the file descriptor was opened by the same credentials as
the credentials of the flink().

That allows the most common patterns that people want to use, which tend
to be to either open the source carefully (ie using the openat2()
RESOLVE_xyz flags, and/or checking ownership with fstat() before
linking), or to use O_TMPFILE and fill in the file contents before it's
exposed to the world with linkat().

But it also means that if the file descriptor was opened by somebody
else, or we've gone through a credentials change since, the operation no
longer works (unless we have CAP_DAC_READ_SEARCH capabilities in the
opener's user namespace, as before).

Note that the credential equality check is done by using pointer
equality, which means that it's not enough that you have effectively the
same user - they have to be literally identical, since our credentials
are using copy-on-write semantics.

So you can't change your credentials to something else and try to change
it back to the same ones between the open() and the linkat().  This is
not meant to be some kind of generic permission check, this is literally
meant as a "the open and link calls are 'atomic' wrt user credentials"
check.

It also means that you can't just move things between namespaces,
because the credentials aren't just a list of uid's and gid's: they
includes the pointer to the user_ns that the capabilities are relative
to.

So let's try this one more time and see if maybe this approach ends up
being workable after all.

Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240411001012.12513-1-torvalds@linux-foundation.org
[brauner: relax capability check to opener of the file]
Link: https://lore.kernel.org/all/20231113-undenkbar-gediegen-efde5f1c34bc@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-13 11:33:58 +02:00
Tim Schumacher
cda30c6542 efi: Clear up misconceptions about a maximum variable name size
The UEFI specification does not make any mention of a maximum variable
name size, so the headers and implementation shouldn't claim that one
exists either.

Comments referring to this limit have been removed or rewritten, as this
is an implementation detail local to the Linux kernel.

Where appropriate, the magic value of 1024 has been replaced with
EFI_VAR_NAME_LEN, as this is used for the efi_variable struct
definition. This in itself does not change any behavior, but should
serve as points of interest when making future changes in the same area.

A related build-time check has been added to ensure that the special
512 byte sized buffer will not overflow with a potentially decreased
EFI_VAR_NAME_LEN.

Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-04-13 10:33:02 +02:00
Tim Schumacher
89ea21d70d efivarfs: Remove unused internal struct members
The structure was moved to the efivarfs internals in commit 2d82e6227e
("efi: vars: Move efivar caching layer into efivarfs") after previously
being used as the data ABI for efivars until its removal in commit
0f5b2c69a4 ("efi: vars: Remove deprecated 'efivars' sysfs interface").

As efivarfs only uses the structure for the variable name caching layer,
the data-related members were never in use. Remove them to avoid
implying that efivarfs is bound by the same restrictions that efivars
once had. While at it, remove the packed attribute, since we no longer
have to guarantee a stable layout.

Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-04-13 10:33:02 +02:00
Linus Torvalds
90d3eaaf4f Two CephFS fixes marked for stable and a MAINTAINERS update.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmYZXkYTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziyLNCACkJv5GjDeHC22P+UB6re3gO39wdFsL
 BvHmhyebz5iidjS1l6pC/RcLOmB+OwekOiDfFvkwu0E+3d5dZTBInUV4BWzbJVLR
 Hkjd+oCEThV+f58n2Ol/r6lBGIFCgJpgtW02DQurM8wHTLm/vqxpj2GsvJCRXNr6
 lfqWVvZBteEmqxHUGNWldOQh2jNPGyvOpDRCQTzdjEfUAijoc8arLyuwbxl9tFdA
 qVtCZk+Jv0Lz9uQUSsI7aM+SEXUgGW3pZXJL9h9PPjC9d811GCwQnUcL4hNE2B3X
 Nf5Qzo9jV2GKJh8+kHui0dmpt0dMQfFZ2YaeJDZl8c8dE2uqBL+zMO79
 =mvie
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two CephFS fixes marked for stable and a MAINTAINERS update"

* tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client:
  MAINTAINERS: remove myself as a Reviewer for Ceph
  ceph: switch to use cap_delay_lock for the unlink delay list
  ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE
2024-04-12 10:15:46 -07:00
Linus Torvalds
5939d45155 Tracing fixes for 6.9:
- Fix the buffer_percent accounting as it is dependent on three variables:
   1) pages_read - number of subbuffers read
   2) pages_lost - number of subbuffers lost due to overwrite
   3) pages_touched - number of pages that a writer entered
   These three counters only increment, and to know how many active pages
   there are on the buffer at any given time, the pages_read and
   pages_lost are subtracted from pages_touched. But the pages touched
   was incremented whenever any writer went to the next subbuffer even
   if it wasn't the only one, so it was incremented more than it should
   be causing the counter for how many subbuffers currently have content
   incorrect, which caused the buffer_percent that holds waiters until
   the ring buffer is filled to a given percentage to wake up early.
 
 - Fix warning of unused functions when PERF_EVENTS is not configured in
 
 - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE
 
 - Fix to some kerneldoc function comments in eventfs code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZhk+khQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qvs0AP98c226UFU6Dha4wvgSulC/wKVvHN3X
 jeclMTdn8RGs2gD/b9OULKNv1//6fP16ZRun7ntRQkotVhlNhf9Ee0smiwU=
 =UYrk
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix the buffer_percent accounting as it is dependent on three
   variables:

     1) pages_read - number of subbuffers read
     2) pages_lost - number of subbuffers lost due to overwrite
     3) pages_touched - number of pages that a writer entered

   These three counters only increment, and to know how many active
   pages there are on the buffer at any given time, the pages_read and
   pages_lost are subtracted from pages_touched.

   But the pages touched was incremented whenever any writer went to the
   next subbuffer even if it wasn't the only one, so it was incremented
   more than it should be causing the counter for how many subbuffers
   currently have content incorrect, which caused the buffer_percent
   that holds waiters until the ring buffer is filled to a given
   percentage to wake up early.

 - Fix warning of unused functions when PERF_EVENTS is not configured in

 - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE

 - Fix to some kerneldoc function comments in eventfs code.

* tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Only update pages_touched when a new page is touched
  tracing: hide unused ftrace_event_id_fops
  tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry
  eventfs: Fix kernel-doc comments to functions
2024-04-12 09:02:24 -07:00
Kent Overstreet
2b3e79fea6 bcachefs: Don't use bch2_btree_node_lock_write_nofail() in btree split path
It turns out - btree splits happen with the rest of the transaction
still locked, to avoid unnecessary restarts, which means using nofail
doesn't work here - we can deadlock.

Fortunately, we now have the ability to return errors here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-11 23:45:12 -04:00
Joakim Sindholt
7fd524b9bd
fs/9p: drop inodes immediately on non-.L too
Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-04-11 23:40:55 +00:00
Eric Van Hensbergen
824f06ff81
fs/9p: Revert "fs/9p: fix dups even in uncached mode"
This reverts commit be57855f50.

It caused a regression involving duplicate inode numbers in
some tester trees.  The bad behavior seems to be dependent on inode
reuse policy in underlying file system, so it did not trigger in my
test setup.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-04-11 23:36:33 +00:00
Yang Li
a8fa658eeb eventfs: Fix kernel-doc comments to functions
This commit fix kernel-doc style comments with complete parameter
descriptions for the lookup_file(),lookup_dir_entry() and
lookup_file_dentry().

Link: https://lore.kernel.org/linux-trace-kernel/20240322062604.28862-1-yang.lee@linux.alibaba.com

Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-04-11 17:42:09 -04:00
Jakub Kicinski
94426ed213 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/unix/garbage.c
  47d8ac011f ("af_unix: Fix garbage collector racing against connect()")
  4090fa373f ("af_unix: Replace garbage collection algorithm.")

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  faa12ca245 ("bnxt_en: Reset PTP tx_avail after possible firmware reset")
  b3d0083caf ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")

drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
  7ac10c7d72 ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()")
  194fad5b27 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions")

drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
  958f56e483 ("net/mlx5e: Un-expose functions in en.h")
  49e6c93870 ("net/mlx5e: RSS, Block XOR hash with over 128 channels")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 14:23:47 -07:00
Steve French
35f834265e smb3: fix broken reconnect when password changing on the server by allowing password rotation
There are various use cases that are becoming more common in which password
changes are scheduled on a server(s) periodically but the clients connected
to this server need to stay connected (even in the face of brief network
reconnects) due to mounts which can not be easily unmounted and mounted at
will, and servers that do password rotation do not always have the ability
to tell the clients exactly when to the new password will be effective,
so add support for an alt password ("password2=") on mount (and also
remount) so that we can anticipate the upcoming change to the server
without risking breaking existing mounts.

An alternative would have been to use the kernel keyring for this but the
processes doing the reconnect do not have access to the keyring but do
have access to the ses structure.

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-11 16:03:48 -05:00
Paulo Alcantara
c6ff459037 smb: client: instantiate when creating SFU files
In cifs_sfu_make_node(), on success, instantiate rather than leave it
with dentry unhashed negative to support callers that expect mknod(2)
to always instantiate.

This fixes the following test case:

  mount.cifs //srv/share /mnt -o ...,sfu
  mkfifo /mnt/fifo
  ./xfstests/ltp/growfiles -b -W test -e 1 -u -i 0 -L 30 /mnt/fifo
  ...
  BUG: unable to handle page fault for address: 000000034cec4e58
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 1 PREEMPT SMP PTI
  CPU: 0 PID: 138098 Comm: growfiles Kdump: loaded Not tainted
  5.14.0-436.3987_1240945149.el9.x86_64 #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  RIP: 0010:_raw_callee_save__kvm_vcpu_is_preempted+0x0/0x20
  Code: e8 15 d9 61 00 e9 63 ff ff ff 41 bd ea ff ff ff e9 58 ff ff ff e8
  d0 71 c0 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <48> 8b 04
  fd 60 2b c1 99 80 b8 90 50 03 00 00 0f 95 c0 c3 cc cc cc
  RSP: 0018:ffffb6a143cf7cf8 EFLAGS: 00010206
  RAX: ffff8a9bc30fb038 RBX: ffff8a9bc666a200 RCX: ffff8a9cc0260000
  RDX: 00000000736f622e RSI: ffff8a9bc30fb038 RDI: 000000007665645f
  RBP: ffffb6a143cf7d70 R08: 0000000000001000 R09: 0000000000000001
  R10: 0000000000000001 R11: 0000000000000000 R12: ffff8a9bc666a200
  R13: 0000559a302a12b0 R14: 0000000000001000 R15: 0000000000000000
  FS: 00007fbed1dbb740(0000) GS:ffff8a9cf0000000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000034cec4e58 CR3: 0000000128ec6006 CR4: 0000000000770ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
   <TASK>
   ? show_trace_log_lvl+0x1c4/0x2df
   ? show_trace_log_lvl+0x1c4/0x2df
   ? __mutex_lock.constprop.0+0x5f7/0x6a0
   ? __die_body.cold+0x8/0xd
   ? page_fault_oops+0x134/0x170
   ? exc_page_fault+0x62/0x150
   ? asm_exc_page_fault+0x22/0x30
   ? _pfx_raw_callee_save__kvm_vcpu_is_preempted+0x10/0x10
   __mutex_lock.constprop.0+0x5f7/0x6a0
   ? __mod_memcg_lruvec_state+0x84/0xd0
   pipe_write+0x47/0x650
   ? do_anonymous_page+0x258/0x410
   ? inode_security+0x22/0x60
   ? selinux_file_permission+0x108/0x150
   vfs_write+0x2cb/0x410
   ksys_write+0x5f/0xe0
   do_syscall_64+0x5c/0xf0
   ? syscall_exit_to_user_mode+0x22/0x40
   ? do_syscall_64+0x6b/0xf0
   ? sched_clock_cpu+0x9/0xc0
   ? exc_page_fault+0x62/0x150
   entry_SYSCALL_64_after_hwframe+0x6e/0x76

Cc: stable@vger.kernel.org
Fixes: 72bc63f5e2 ("smb3: fix creating FIFOs when mounting with "sfu" mount option")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-11 16:03:40 -05:00
Steve French
28e0947651 smb3: fix Open files on server counter going negative
We were decrementing the count of open files on server twice
for the case where we were closing cached directories.

Fixes: 8e843bf38f ("cifs: return a single-use cfid if we did not get a lease")
Cc: stable@vger.kernel.org
Acked-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-11 16:02:02 -05:00
Xiubo Li
17f8dc2db5 ceph: switch to use cap_delay_lock for the unlink delay list
The same list item will be used in both cap_delay_list and
cap_unlink_delay_list, so it's buggy to use two different locks
to protect them.

Cc: stable@vger.kernel.org
Fixes: dbc347ef7f ("ceph: add ceph_cap_unlink_work to fire check_caps() immediately")
Link: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AODC76VXRAMXKLFDCTK4TKFDDPWUSCN5
Reported-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-04-11 22:56:28 +02:00
Linus Torvalds
e1dc191dbf bcachefs fixes for v6.9-rc4
Notable user impacting bugs
 
 - On multi device filesystems, recovery was looping in
   btree_trans_too_many_iters(). This checks if a transaction has touched
   too many btree paths (because of iteration over many keys), and isuses
   a restart to drop unneeded paths. But it's now possible for some paths
   to exceed the previous limit without iteration in the interior btree
   update path, since the transaction commit will do alloc updates for
   every old and new btree node, and during journal replay we don't use
   the btree write buffer for locking reasons and thus those updates use
   btree paths when they wouldn't normally.
 
 - Fix a corner case in rebalance when moving extents on a durability=0
   device. This wouldn't be hit when a device was formatted with
   durability=0 since in that case we'll only use it as a write through
   cache (only cached extents will live on it), but durability can now be
   changed on an existing device.
 
 - bch2_get_acl() could rarely forget to handle a transaction restart;
   this manifested as the occasional missing acl that came back after
   dropping caches.
 
 - Fix a major performance regression on high iops multithreaded write
   workloads (only since 6.9-rc1); a previous fix for a deadlock in the
   interior btree update path to check the journal watermark introduced a
   dependency on the state of btree write buffer flushing that we didn't
   want.
 
 - Assorted other repair paths and recovery fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYXTd8ACgkQE6szbY3K
 bna4MA//Y/CSB2JupxJPFUAb69+WmNDMnJJV3FlD/Hwo19kOR7aRbKQMUxsH51nb
 dfv5o/58n39QIBqMcMtTipnoND6jrwv7l5NimFKQmj/YehxosdsOf2BgbD/M4Ozz
 84IUmHXSs5+zezjF8IAw/bjR/p13XNJGSYTfl1RWbGUMERLpfZcLn70FvoCR/qpQ
 Bcp+z70K6bBhrPZqFYd2mEC+Cfo42aCD1lqWUQ/e0FHiNEZnCNH2lNca4phBzGt4
 f9sBxBcwmDfizQxqpyZ4izRzbS9ZwZ3ega336L2DrPpwgjMgTRLKLjdXqJDjGvDW
 ngvnaUw7SgICs+q48g3f67cGhw/4lSPdVu/9a/ldqDEP9PynQiUS1G8aDofLdboW
 xCZM6toX86p0DMNY2kP9vc5kxr377cSXAL7VeKbE+ZV5vGyEz1qFDLRAYVixHDfr
 Q7KgYvoJq8CgjfK7rIZbO2tqKhz2TP4+2rJa6tDLwKXs5ice++w/aF5d/nBZ7iCe
 +dyJ+aJiosjHEVG+QscACFrjBJdNNspJqBWDP396XeOsdl+iCeIRP0VHOGytLLRf
 gisE0Lhj4pz6bv7OhAAkdxvegVQ8HfqN1E+f/WM7Kogqos0NS2skEhy9cTuDCHUm
 qtPTUq5XNibiE6J+NOK86pu6o+6sqpWIpfTPuTbMid5sevLsomE=
 =BQC0
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs

Pull more bcachefs fixes from Kent Overstreet:
 "Notable user impacting bugs

   - On multi device filesystems, recovery was looping in
     btree_trans_too_many_iters(). This checks if a transaction has
     touched too many btree paths (because of iteration over many keys),
     and isuses a restart to drop unneeded paths.

     But it's now possible for some paths to exceed the previous limit
     without iteration in the interior btree update path, since the
     transaction commit will do alloc updates for every old and new
     btree node, and during journal replay we don't use the btree write
     buffer for locking reasons and thus those updates use btree paths
     when they wouldn't normally.

   - Fix a corner case in rebalance when moving extents on a
     durability=0 device. This wouldn't be hit when a device was
     formatted with durability=0 since in that case we'll only use it as
     a write through cache (only cached extents will live on it), but
     durability can now be changed on an existing device.

   - bch2_get_acl() could rarely forget to handle a transaction restart;
     this manifested as the occasional missing acl that came back after
     dropping caches.

   - Fix a major performance regression on high iops multithreaded write
     workloads (only since 6.9-rc1); a previous fix for a deadlock in
     the interior btree update path to check the journal watermark
     introduced a dependency on the state of btree write buffer flushing
     that we didn't want.

   - Assorted other repair paths and recovery fixes"

* tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
  bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
  bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
  bcachefs: Fix a race in btree_update_nodes_written()
  bcachefs: btree_node_scan: Respect member.data_allowed
  bcachefs: Don't scan for btree nodes when we can reconstruct
  bcachefs: Fix check_topology() when using node scan
  bcachefs: fix eytzinger0_find_gt()
  bcachefs: fix bch2_get_acl() transaction restart handling
  bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
  bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
  bcachefs: Rename struct field swap to prevent macro naming collision
  MAINTAINERS: Add entry for bcachefs documentation
  Documentation: filesystems: Add bcachefs toctree
  bcachefs: JOURNAL_SPACE_LOW
  bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
  bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
  bcachefs: fix rand_delete unit test
  bcachefs: fix ! vs ~ typo in __clear_bit_le64()
  bcachefs: Fix rebalance from durability=0 device
  bcachefs: Print shutdown journal sequence number
  ...
2024-04-11 11:24:55 -07:00
NeilBrown
b372e96bd0 ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE
The page has been marked clean before writepage is called.  If we don't
redirty it before postponing the write, it might never get written.

Cc: stable@vger.kernel.org
Fixes: 503d4fa6ee ("ceph: remove reliance on bdi congestion")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-04-11 19:17:02 +02:00
Marco Elver
c82389947d tracing: Add sched_prepare_exec tracepoint
Add "sched_prepare_exec" tracepoint, which is run right after the point
of no return but before the current task assumes its new exec identity.

Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec"
tracepoint runs before flushing the old exec, i.e. while the task still
has the original state (such as original MM), but when the new exec
either succeeds or crashes (but never returns to the original exec).

Being able to trace this event can be helpful in a number of use cases:

  * allowing tracing eBPF programs access to the original MM on exec,
    before current->mm is replaced;
  * counting exec in the original task (via perf event);
  * profiling flush time ("sched_prepare_exec" to "sched_process_exec").

Example of tracing output:

 $ cat /sys/kernel/debug/tracing/trace_pipe
    <...>-379  [003] .....  179.626921: sched_prepare_exec: interp=/usr/bin/sshd filename=/usr/bin/sshd pid=379 comm=sshd
    <...>-381  [002] .....  180.048580: sched_prepare_exec: interp=/bin/bash filename=/bin/bash pid=381 comm=sshd
    <...>-385  [001] .....  180.068277: sched_prepare_exec: interp=/usr/bin/tty filename=/usr/bin/tty pid=385 comm=bash
    <...>-389  [006] .....  192.020147: sched_prepare_exec: interp=/usr/bin/dmesg filename=/usr/bin/dmesg pid=389 comm=bash

Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240411102158.1272267-1-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-04-11 09:02:21 -07:00
Vasily Gorbik
f488138b52 NFSD: fix endianness issue in nfsd4_encode_fattr4
The nfs4 mount fails with EIO on 64-bit big endian architectures since
v6.7. The issue arises from employing a union in the nfsd4_encode_fattr4()
function to overlay a 32-bit array with a 64-bit values based bitmap,
which does not function as intended. Address the endianness issue by
utilizing bitmap_from_arr32() to copy 32-bit attribute masks into a
bitmap in an endianness-agnostic manner.

Cc: stable@vger.kernel.org
Fixes: fce7913b13 ("NFSD: Use a bitmask loop to encode FATTR4 results")
Link: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2060217
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-11 09:21:06 -04:00
Alan Stern
a90bca2228 fs: sysfs: Fix reference leak in sysfs_break_active_protection()
The sysfs_break_active_protection() routine has an obvious reference
leak in its error path.  If the call to kernfs_find_and_get() fails then
kn will be NULL, so the companion sysfs_unbreak_active_protection()
routine won't get called (and would only cause an access violation by
trying to dereference kn->parent if it was called).  As a result, the
reference to kobj acquired at the start of the function will never be
released.

Fix the leak by adding an explicit kobject_put() call when kn is NULL.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Fixes: 2afc9166f7 ("scsi: sysfs: Introduce sysfs_{un,}break_active_protection()")
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/8a4d3f0f-c5e3-4b70-a188-0ca433f9e6f9@rowland.harvard.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-11 15:16:48 +02:00
Colin Ian King
fd0a133ef6
fs/direct-io: remove redundant assignment to variable retval
The variable retval is being assigned a value that is not being read,
it is being re-assigned later on in the function. The assignment
is redundant and can be removed.

Cleans up clang scan build warning:
fs/direct-io.c:1220:2: warning: Value stored to 'retval' is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240410162221.292485-1-colin.i.king@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-11 10:21:24 +02:00
Linus Torvalds
03a55b6391 Bootconfig fixes for v6.9-rc3:
- fs/proc: Fix to not show original kernel cmdline more than twice on
   /proc/bootconfig.
 - fs/proc: Fix to show the original cmdline only if the bootconfig
   modifies it.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmYWrVMbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bwxMH/0c/y0wimIlEgqhl27j1
 +SCIA9SBF3oZ7P9Jajs/tZZf8jNwOsuXFBQbwqgAkxdKbolaLkhbLOBqKg8LVodH
 fzpDh+w0BxYVnPLTQe1dpOrygEcfWEM7RdwskwDYyNDDuTZecYkbmHF3mlYnyj5h
 /aPSBcfqHqjM8ltUxjus2B2kMcS/Khun2HVyhQRpVeiRMZPtLpdh9RyNhAwwlfGV
 9WLH5JTg45EHN5V9GL5RzX86HIQ82ZfWtqTKdrU7u7ahph3NN6enkD1+MI2vMqjG
 t8F/GsFeFdgXfVE2B4HZPWdA07PUeh416iDeajZBY8QlyK3rt/vATkKcGyyYdVJK
 EGI=
 =N4qD
 -----END PGP SIGNATURE-----

Merge tag 'bootconfig-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull bootconfig fixes from Masami Hiramatsu:

 - show the original cmdline only once, and only if it was modeified by
   bootconfig

* tag 'bootconfig-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fs/proc: Skip bootloader comment if no embedded kernel parameters
  fs/proc: remove redundant comments from /proc/bootconfig
2024-04-10 19:42:45 -07:00
Kent Overstreet
1189bdda6c bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
We weren't respecting trans->journal_replay_not_finished - we shouldn't
be searching the journal keys unless we have a ref on them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-10 22:28:36 -04:00
Kent Overstreet
517236cb3e bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
dropping read locks in bch2_btree_node_lock_write_nofail() dates from
before we had the cycle detector; we can now tell the cycle detector
directly when taking a lock may not fail because we can't handle
transaction restarts.

This is needed for adding should_be_locked asserts.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-10 22:28:36 -04:00
Kent Overstreet
beccf29114 bcachefs: Fix a race in btree_update_nodes_written()
One btree update might have terminated in a node update, and then while
it is in flight another btree update might free that original node.

This race has to be handled in btree_update_nodes_written() - we were
missing a READ_ONCE().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-10 22:28:36 -04:00
Paulo Alcantara
ec4535b2a1 smb: client: fix NULL ptr deref in cifs_mark_open_handles_for_deleted_file()
cifs_get_fattr() may be called with a NULL inode, so check for a
non-NULL inode before calling
cifs_mark_open_handles_for_deleted_file().

This fixes the following oops:

  mount.cifs //srv/share /mnt -o ...,vers=3.1.1
  cd /mnt
  touch foo; tail -f foo &
  rm foo
  cat foo

  BUG: kernel NULL pointer dereference, address: 00000000000005c0
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 2 PID: 696 Comm: cat Not tainted 6.9.0-rc2 #1
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
  1.16.3-1.fc39 04/01/2014
  RIP: 0010:__lock_acquire+0x5d/0x1c70
  Code: 00 00 44 8b a4 24 a0 00 00 00 45 85 f6 0f 84 bb 06 00 00 8b 2d
  48 e2 95 01 45 89 c3 41 89 d2 45 89 c8 85 ed 0 0 <48> 81 3f 40 7a 76
  83 44 0f 44 d8 83 fe 01 0f 86 1b 03 00 00 31 d2
  RSP: 0018:ffffc90000b37490 EFLAGS: 00010002
  RAX: 0000000000000000 RBX: ffff888110021ec0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000005c0
  RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000200
  FS: 00007f2a1fa08740(0000) GS:ffff888157a00000(0000)
  knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0:
  0000000080050033
  CR2: 00000000000005c0 CR3: 000000011ac7c000 CR4: 0000000000750ef0
  PKRU: 55555554
  Call Trace:
   <TASK>
   ? __die+0x23/0x70
   ? page_fault_oops+0x180/0x490
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? exc_page_fault+0x70/0x230
   ? asm_exc_page_fault+0x26/0x30
   ? __lock_acquire+0x5d/0x1c70
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   lock_acquire+0xc0/0x2d0
   ? cifs_mark_open_handles_for_deleted_file+0x3a/0x100 [cifs]
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? kmem_cache_alloc+0x2d9/0x370
   _raw_spin_lock+0x34/0x80
   ? cifs_mark_open_handles_for_deleted_file+0x3a/0x100 [cifs]
   cifs_mark_open_handles_for_deleted_file+0x3a/0x100 [cifs]
   cifs_get_fattr+0x24c/0x940 [cifs]
   ? srso_alias_return_thunk+0x5/0xfbef5
   cifs_get_inode_info+0x96/0x120 [cifs]
   cifs_lookup+0x16e/0x800 [cifs]
   cifs_atomic_open+0xc7/0x5d0 [cifs]
   ? lookup_open.isra.0+0x3ce/0x5f0
   ? __pfx_cifs_atomic_open+0x10/0x10 [cifs]
   lookup_open.isra.0+0x3ce/0x5f0
   path_openat+0x42b/0xc30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   do_filp_open+0xc4/0x170
   do_sys_openat2+0xab/0xe0
   __x64_sys_openat+0x57/0xa0
   do_syscall_64+0xc1/0x1e0
   entry_SYSCALL_64_after_hwframe+0x72/0x7a

Fixes: ffceb7640c ("smb: client: do not defer close open handles to deleted files")
Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-10 18:53:43 -05:00
Jens Axboe
fbe38120eb signalfd: convert to ->read_iter()
Rather than use the older style ->read() hook, use ->read_iter() so that
signalfd can support both O_NONBLOCK and IOCB_NOWAIT for non-blocking
read attempts.

Split the fd setup into two parts, so that signalfd can mark the file
mode with FMODE_NOWAIT before installing it into the process table.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-10 16:23:04 -06:00
Jens Axboe
40f45fe8eb userfaultfd: convert to ->read_iter()
Rather than use the older style ->read() hook, use ->read_iter() so that
userfaultfd can support both O_NONBLOCK and IOCB_NOWAIT for non-blocking
read attempts.

Split the fd setup into two parts, so that userfaultfd can mark the file
mode with FMODE_NOWAIT before installing it into the process table. With
that, we can also defer grabbing the mm until we know the rest will
succeed, as the fd isn't visible before then.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-10 16:23:04 -06:00
Jens Axboe
d949799041 timerfd: convert to ->read_iter()
Switch timerfd to using fops->read_iter(), so it can support not just
O_NONBLOCK but IOCB_NOWAIT as well. With the latter, users like io_uring
interact with timerfds a lot better, as they can be driven purely
by the poll trigger.

Manually get and install the required fd, so that FMODE_NOWAIT can be
set before the file is installed into the file table.

No functional changes intended in this patch, it's purely a straight
conversion to using the read iterator method.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-10 16:23:02 -06:00
Eric Van Hensbergen
6e45a30fe5
fs/9p: remove erroneous nlink init from legacy stat2inode
In 9p2000 legacy mode, stat2inode initializes nlink to 1,
which is redundant with what alloc_inode should have already set.
9p2000.u overrides this with extensions if present in the stat
structure, and 9p2000.L incorporates nlink into its stat structure.

At the very least this probably messes with directory nlink
accounting in legacy mode.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-04-09 23:53:00 +00:00
Kent Overstreet
9b31152fd7 bcachefs: btree_node_scan: Respect member.data_allowed
If a device wasn't used for btree nodes, no need to scan for them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-09 18:54:46 -04:00
Thorsten Blum
60b703c71f zonefs: Use str_plural() to fix Coccinelle warning
Fixes the following Coccinelle/coccicheck warning reported by
string_choices.cocci:

	opportunity for str_plural(zgroup->g_nr_zones)

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2024-04-10 07:23:47 +09:00
Qu Wenruo
1db7959aac btrfs: do not wait for short bulk allocation
[BUG]
There is a recent report that when memory pressure is high (including
cached pages), btrfs can spend most of its time on memory allocation in
btrfs_alloc_page_array() for compressed read/write.

[CAUSE]
For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
even if the bulk allocation failed (fell back to single page
allocation) we still retry but with extra memalloc_retry_wait().

If the bulk alloc only returned one page a time, we would spend a lot of
time on the retry wait.

The behavior was introduced in commit 395cb57e85 ("btrfs: wait between
incomplete batch memory allocations").

[FIX]
Although the commit mentioned that other filesystems do the wait, it's
not the case at least nowadays.

All the mainlined filesystems only call memalloc_retry_wait() if they
failed to allocate any page (not only for bulk allocation).
If there is any progress, they won't call memalloc_retry_wait() at all.

For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait()
if there is no allocation progress at all, and the call is not for
metadata readahead.

So I don't believe we should call memalloc_retry_wait() unconditionally
for short allocation.

Call memalloc_retry_wait() if it fails to allocate any page for tree
block allocation (which goes with __GFP_NOFAIL and may not need the
special handling anyway), and reduce the latency for
btrfs_alloc_page_array().

Reported-by: Julian Taylor <julian.taylor@1und1.de>
Tested-by: Julian Taylor <julian.taylor@1und1.de>
Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/
Fixes: 395cb57e85 ("btrfs: wait between incomplete batch memory allocations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-09 23:20:32 +02:00
Naohiro Aota
073bda7a54 btrfs: zoned: add ASSERT and WARN for EXTENT_BUFFER_ZONED_ZEROOUT handling
Add an ASSERT to catch a faulty delayed reference item resulting from
prematurely cleared extent buffer.

Also, add a WARN to detect if we try to dirty a ZEROOUT buffer again, which
is suspicious as its update will be lost.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-09 23:20:29 +02:00
Naohiro Aota
6887938618 btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer
Btrfs clears the content of an extent buffer marked as
EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is
introduced to prevent a write hole of an extent buffer, which is once
allocated, marked dirty, but turns out unnecessary and cleaned up within
one transaction operation.

Currently, btrfs_clear_buffer_dirty() marks the extent buffer as
EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call
happens while the buffer is under IO (with the WRITEBACK flag set,
without the DIRTY flag), we can add the ZEROOUT flag and clear the
buffer's content just before a bio submission. As a result:

1) it can lead to adding faulty delayed reference item which leads to a
   FS corrupted (EUCLEAN) error, and

2) it writes out cleared tree node on disk

The former issue is previously discussed in [1]. The corruption happens
when it runs a delayed reference update. So, on-disk data is safe.

[1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/

The latter one can reach on-disk data. But, as that node is already
processed by btrfs_clear_buffer_dirty(), that will be invalidated in the
next transaction commit anyway. So, the chance of hitting the corruption
is relatively small.

Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to
keep the content under IO intact.

Fixes: aa6313e6ff ("btrfs: zoned: don't clear dirty flag of extent buffer")
CC: stable@vger.kernel.org # 6.8
Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-09 23:20:28 +02:00
Stefan Berger
1f65e57dc5 fs: Rename SB_I_EVM_UNSUPPORTED to SB_I_EVM_HMAC_UNSUPPORTED
Now that EVM supports RSA signatures for previously completely
unsupported filesystems rename the flag SB_I_EVM_UNSUPPORTED to
SB_I_EVM_HMAC_UNSUPPORTED to reflect that only HMAC is not supported.

Suggested-by: Amir Goldstein <amir73il@gmail.com>
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2024-04-09 17:14:58 -04:00
Stefan Berger
3253804773 security: allow finer granularity in permitting copy-up of security xattrs
Copying up xattrs is solely based on the security xattr name. For finer
granularity add a dentry parameter to the security_inode_copy_up_xattr
hook definition, allowing decisions to be based on the xattr content as
well.

Co-developed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com> (LSM,SELinux)
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2024-04-09 17:14:57 -04:00
Alexander Aring
92d59adfaf dlm: do message processing in softirq context
Move dlm message processing from an ordered workqueue context to an
ordered softirq context.  Handling dlm messages in softirq will allow
requests to be cleared more quickly and efficiently, and should avoid
longer queues of incomplete requests.  Later patches are expected to
run completion/blocking callbacks directly from this message processing
context, further reducing context switches required to complete a request.
In the longer term, concurrent message processing could be implemented.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:47:51 -05:00
Alexander Aring
578acf9a87 dlm: use spin_lock_bh for message processing
Use spin_lock_bh for all spinlocks involved in message processing,
in preparation for softirq message processing.  DLM lock requests
from user space involve dlm processing in user context, in addition
to the standard kernel context, necessitating bh variants.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:45:23 -05:00
Alexander Aring
308533b4b1 dlm: remove schedule in receive path
Remove an explicit schedule() call in the message processing path,
in preparation for softirq message processing.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
d52c9b8fef dlm: convert ls_recv_active from rw_semaphore to rwlock
Convert ls_recv_active rw_semaphore to an rwlock to avoid
sleeping, in preparation for softirq message processing.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
c288745f1d dlm: avoid blocking receive at the end of recovery
The end of the recovery process transitioned to normal message
processing by temporarily blocking the receiving context,
processing saved messages, then unblocking the receiving
context.  To avoid blocking the receiving context, the old
wait_queue and mutex are replaced by a new rwlock and the new
RECV_MSG_BLOCKED flag.  Received messages are added to the
list of saved messages, protected by the rwlock, until the
flag is cleared, which happens when all saved messages have
been processed.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
cc396e2355 dlm: convert res_lock to spinlock
Convert the rsb struct res_lock from a mutex to a spinlock
in preparation for processing messages in softirq context.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
097691dbad dlm: convert ls_waiters_mutex to spinlock
Convert the waiters mutex to a spinlock in prepration for
processing messages in softirq context.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
6b52ea7916 dlm: drop mutex use in waiters recovery
The waiters_mutex no longer needs to be used in the waiters recovery
functions dlm_recover_waiters_pre() and dlm_recover_waiters_pre().
During recovery, ordinary locking operations are paused, and the
recovery thread is the only context accessing the waiters list,
so the lock is not needed.

Access to the waiters list from debugfs functions is avoided by
taking the top level recovery lock in the debugfs dump function.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
3ae6776056 dlm: add new struct to save position in dlm_copy_master_names
Add a new struct to save the current position in the rsb masters_list
while sending the rsb names to other nodes. The rsb names are sent in
multiple chunks, and for each new chunk, the new "dlm_dir_dump" struct
saves the last position in the masters_list. The new struct is also
used to save more information to sanity check the recovery process.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
3a747f4a2e dlm: move rsb root_list to ls_recover() stack
Move the rsb root_list from the lockspace to a stack variable since
it is now only used by the ls_recover() function.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
aff46e0f24 dlm: use a new list for recovery of master rsb names
Add a new "masters_list" for master rsb structs, with a new
rwlock. The new list is created and used during the recovery
process to send the master rsb names to new nodes. With this
change, the current "root_list" can be used without locking.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
29e345f3c6 dlm: move root_list functionality to recover.c
Move dlm_create_root_list() and dlm_release_root_list() to
recover.c and declare them static because they are only used
there.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:44:49 -05:00
Alexander Aring
455597a55f dlm: switch to GFP_ATOMIC in dlm allocations
Replace GFP_NOFS with GFP_ATOMIC.  Also stop using idr_preload which
uses a non-bh spin_lock.  This is further preparation for softirq
message processing.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 11:43:49 -05:00
Andreas Gruenbacher
acf1f42faf gfs2: Fix "Make glock lru list scanning safer"
Commit 228804a35c tried to add a refcount check to
gfs2_scan_glock_lru() to make sure that glocks that are still referenced
cannot be freed.  It failed to account for the bias state_change() adds
to the refcount for held glocks, so held glocks are no longer removed
from the glock cache, which can lead to out-of-memory problems.  Fix
that.  (The inodes those glocks are associated with do get shrunk and do
get pushed out of memory.)

In addition, use the same eligibility check in gfs2_scan_glock_lru() and
gfs2_dispose_glock_lru().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:58 +02:00
Andreas Gruenbacher
c9a0a4b028 Revert "gfs2: fix glock shrinker ref issues"
This reverts commit 62862485a4.

Commit 62862485a4 tried to fix issues introduced by commit
228804a35c ("gfs2: Make glock lru list scanning safer"), but like that
commit, it failed to account for the bias state_change() adds to the
glock reference count for locked glocks.  Revert commit 62862485a4 so
that we can fix commit 228804a35c properly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:58 +02:00
Andreas Gruenbacher
5d92311119 gfs2: Fix "ignore unlock failures after withdraw"
Commit 3e11e53041 tries to suppress dlm_lock() lock conversion errors
that occur when the lockspace has already been released.

It does that by setting and checking the SDF_SKIP_DLM_UNLOCK flag.  This
conflicts with the intended meaning of the SDF_SKIP_DLM_UNLOCK flag, so
check whether the lockspace is still allocated instead.

(Given the current DLM API, checking for this kind of error after the
fact seems easier that than to make sure that the lockspace is still
allocated before calling dlm_lock().  Changing the DLM API so that users
maintain the lockspace references themselves would be an option.)

Fixes: 3e11e53041 ("GFS2: ignore unlock failures after withdraw")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:58 +02:00
Andreas Gruenbacher
262ee3a07e gfs2: Get rid of unnecessary test_and_set_bit
The GLF_LOCK flag is protected by the gl->gl_lockref.lock spin lock
which is held when entering run_queue(), so we can use test_bit() and
set_bit() here.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:58 +02:00
Andreas Gruenbacher
927cfc90d2 gfs2: Don't set GLF_LOCK in gfs2_dispose_glock_lru
In gfs2_dispose_glock_lru(), we want to skip glocks which are in the
process of transitioning state (as indicated by the set GLF_LOCK flag),
but we we don't need to set that flag for requesting a state transition.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andreas Gruenbacher
ee2be7d7c7 gfs2: Replace gfs2_glock_queue_put with gfs2_glock_put_async
Function gfs2_glock_queue_put() puts a glock reference by enqueuing
glock work instead of putting the reference directly.  This ensures that
the operation won't sleep, but it is costly and really only necessary
when putting the final glock reference.  Replace it with a new
gfs2_glock_put_async() function that only queues glock work when putting
the last glock reference.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andreas Gruenbacher
f80d882edc gfs2: Get rid of gfs2_glock_queue_put in signal_our_withdraw
In function signal_our_withdraw(), we are calling gfs2_glock_queue_put()
in a context in which we are actually allowed to sleep, so replace that
with a simple call to gfs2_glock_put().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andreas Gruenbacher
35264909e9 gfs2: Fix NULL pointer dereference in gfs2_log_flush
In gfs2_jindex_free(), set sdp->sd_jdesc to NULL under the log flush
lock to provide exclusion against gfs2_log_flush().

In gfs2_log_flush(), check if sdp->sd_jdesc is non-NULL before
dereferencing it.  Otherwise, we could run into a NULL pointer
dereference when outstanding glock work races with an unmount
(glock_work_func -> run_queue -> do_xmote -> inode_go_sync ->
gfs2_log_flush).

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andreas Gruenbacher
b01189333e gfs2: Don't forget to complete delayed withdraw
Commit fffe9bee14 ("gfs2: Delay withdraw from atomic context")
switched from gfs2_withdraw() to gfs2_withdraw_delayed() in
gfs2_ail_error(), but failed to then check if a delayed withdraw had
occurred.  Fix that by adding the missing check in __gfs2_ail_flush(),
where the spin locks are already dropped and a withdraw is possible.

Fixes: fffe9bee14 ("gfs2: Delay withdraw from atomic context")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andreas Gruenbacher
3592bfaf74 gfs2: Use [NO_]CREATE consistently for gfs2_glock_get
When calling gfs2_glock_get(), consistently pass the CREATE or NO_CREATE
flag for the create argument.  This is the only place that passes 0
instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andreas Gruenbacher
52c2d389fe gfs2: Follow-up to flag rename in sysfs status file
Follow up to commit e7beb8b6de ("gfs2: Rename SDF_DEACTIVATING to
SDF_KILL") and change the description of the SDF_KILL flag from
"Deactivating" to "Killing".

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andreas Gruenbacher
795405c4b9 gfs2: Remove unnecessary gfs2_meta_check_ii argument
The type argument of gfs2_meta_check_ii() is always set to "magic
number", so remove that argument and hardcode the string in
gfs2_meta_check_ii().  Change the string to "bad magic number" to
emphasize that the problem is the incorrect magic number.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andreas Gruenbacher
b204b1b61e gfs2: Get rid of newlines in log messages
Get rid of attempts to create multi-line syslog entries; this only makes
the messages harder to read.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Andrew Price
10398ef57a gfs2: Improve gfs2_consist_inode() usage
gfs2_consist_inode() logs an error message with the source file and line
number. When we jump before calling it, the line number becomes less
useful as it no longer relates to the source of the error. To aid
troubleshooting, replace the gotos with the gfs2_consist_inode() calls
so that the error messages are more informative.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-04-09 18:35:57 +02:00
Alexander Aring
98808644b9 dlm: remove allocation parameter in msg allocation
Remove the context parameter for message allocations and
always use GFP_ATOMIC. This prepares for softirq message
processing.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09 09:58:14 -05:00
Masami Hiramatsu
c722cea208 fs/proc: Skip bootloader comment if no embedded kernel parameters
If the "bootconfig" kernel command-line argument was specified or if
the kernel was built with CONFIG_BOOT_CONFIG_FORCE, but if there are
no embedded kernel parameter, omit the "# Parameters from bootloader:"
comment from the /proc/bootconfig file.  This will cause automation
to fall back to the /proc/cmdline file, which will be identical to the
comment in this no-embedded-kernel-parameters case.

Link: https://lore.kernel.org/all/20240409044358.1156477-2-paulmck@kernel.org/

Fixes: 8b8ce6c75430 ("fs/proc: remove redundant comments from /proc/bootconfig")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-09 23:36:18 +09:00
Zhenhua Huang
fbbdc255fb fs/proc: remove redundant comments from /proc/bootconfig
commit 717c7c894d ("fs/proc: Add boot loader arguments as comment to
/proc/bootconfig") adds bootloader argument comments into /proc/bootconfig.

/proc/bootconfig shows boot_command_line[] multiple times following
every xbc key value pair, that's duplicated and not necessary.
Remove redundant ones.

Output before and after the fix is like:
key1 = value1
*bootloader argument comments*
key2 = value2
*bootloader argument comments*
key3 = value3
*bootloader argument comments*
...

key1 = value1
key2 = value2
key3 = value3
*bootloader argument comments*
...

Link: https://lore.kernel.org/all/20240409044358.1156477-1-paulmck@kernel.org/

Fixes: 717c7c894d ("fs/proc: Add boot loader arguments as comment to /proc/bootconfig")
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: <linux-trace-kernel@vger.kernel.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-09 23:31:54 +09:00
linke li
8bfb40be31
fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
Currently, the __d_clear_type_and_inode() writes the value flags to
dentry->d_flags, then immediately re-reads it in order to use it in a if
statement. This re-read is useless because no other update to
dentry->d_flags can occur at this point.

This commit therefore re-use flags in the if statement instead of
re-reading dentry->d_flags.

Signed-off-by: linke li <lilinke99@qq.com>
Link: https://lore.kernel.org/r/tencent_5E187BD0A61BA28605E85405F15228254D0A@qq.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09 12:29:03 +02:00
Günther Noack
abe6acfa7d
fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail
These IOCTL commands should be implemented by setting attributes on the
superblock, rather than in the IOCTL hooks in struct file_operations.

By returning -ENOTTY instead of -ENOIOCTLCMD, we instruct the fs/ioctl.c
logic to return -ENOTTY immediately, rather than attempting to call
f_op->unlocked_ioctl() or f_op->compat_ioctl() as a fallback.

Why this is safe:

Before this change, fs/ioctl.c would unsuccessfully attempt calling the
IOCTL hooks, and then return -ENOTTY.  By returning -ENOTTY directly, we
return the same error code immediately, but save ourselves the fallback
attempt.

Motivation:

This simplifies the logic for these IOCTL commands and lets us reason about
the side effects of these IOCTLs more easily.  It will be possible to
permit these IOCTLs under LSM IOCTL policies, without having to worry about
them getting dispatched to problematic device drivers (which sometimes do
work before looking at the IOCTL command number).

Link: https://lore.kernel.org/all/cnwpkeovzbumhprco7q2c2y6zxzmxfpwpwe3tyy6c3gg2szgqd@vfzjaw5v5imr/
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Günther Noack <gnoack@google.com>
Link: https://lore.kernel.org/r/20240405214040.101396-2-gnoack@google.com
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09 12:03:49 +02:00
Matthew Wilcox (Oracle)
886b94d25a
fs: Add FOP_HUGE_PAGES
Instead of checking for specific file_operations, add a bit to
file_operations which denotes a file that only contain hugetlb pages.
This lets us make hugetlbfs_file_operations static, and removes
is_file_shm_hugepages() completely.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240407201122.3783877-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09 10:53:44 +02:00
Justin Stitt
629171657a
orangefs: cleanup uses of strncpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

There is some care taken to ensure these destination buffers are
NUL-terminated by bounding the strncpy()'s by ORANGEFS_NAME_MAX - 1 or
ORANGEFS_MAX_SERVER_ADDR_LEN - 1. Instead, we can use the new 2-argument
version of strscpy() to guarantee NUL-termination on the destination
buffers while simplifying the code.

Based on usage with printf-likes, we can see these buffers are expected
to be NUL-terminated:
|	gossip_debug(GOSSIP_NAME_DEBUG,
|			"%s: doing lookup on %s under %pU,%d\n",
|			__func__,
|			new_op->upcall.req.lookup.d_name,
|			&new_op->upcall.req.lookup.parent_refn.khandle,
|			new_op->upcall.req.lookup.parent_refn.fs_id);
...
|	gossip_debug(GOSSIP_SUPER_DEBUG,
|			"Attempting ORANGEFS Remount via host %s\n",
|			new_op->upcall.req.fs_mount.orangefs_config_server);

NUL-padding isn't required for any of these destination buffers as
they've all been zero-allocated with op_alloc() or kzalloc().

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240322-strncpy-fs-orangefs-dcache-c-v1-1-15d12debbf38@google.com
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09 10:52:13 +02:00
Arnd Bergmann
c473bcdd80
nilfs2: fix out-of-range warning
clang-14 points out that v_size is always smaller than a 64KB
page size if that is configured by the CPU architecture:

fs/nilfs2/ioctl.c:63:19: error: result of comparison of constant 65536 with expression of type '__u16' (aka 'unsigned short') is always false [-Werror,-Wtautological-constant-out-of-range-compare]
        if (argv->v_size > PAGE_SIZE)
            ~~~~~~~~~~~~ ^ ~~~~~~~~~

This is ok, so just shut up that warning with a cast.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240328143051.1069575-7-arnd@kernel.org
Fixes: 3358b4aaa8 ("nilfs2: fix problems of memory allocation in ioctl")
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09 10:52:12 +02:00
Kent Overstreet
5ab4beb759 bcachefs: Don't scan for btree nodes when we can reconstruct
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-09 00:53:14 -04:00
Kent Overstreet
359571c327 bcachefs: Fix check_topology() when using node scan
shoot down journal keys _before_ populating journal keys with pointers
to scanned nodes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-09 00:04:57 -04:00
Kent Overstreet
9c432404b9 bcachefs: fix eytzinger0_find_gt()
- fix return types: promoting from unsigned to ssize_t does not do what
  we want here, and was pointless since the rest of the eytzinger code
  is u32
- nr, not size

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-08 22:56:37 -04:00
Linus Torvalds
20cb38a7af for-6.9-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYQIdIACgkQxWXV+ddt
 WDvjmw/+KahIHfFt17cM5uZpiETcL9v44uT0Y69r0bMpw8Vy/cmE+rmGfyERr8YN
 v68U/hpWHD2mYhxL01EHut2X/MRA4zmAcWUKVu1vk0d/9Vp/01wPJfKyvX6q388/
 dFtPtzqXxj0uIwO5lRIk+dJuvShtfCps2rx/zcBUoaQYljIDNfhrWscfV4nIzqlR
 BF7GX3b22rlw8q1dXAXWW+zTk3tey8Jxj+jmShyoPxcGMDK4jmNyaFu1WSIFfSdc
 ns5Kii7/4tIBqpqPCr/FMGXQjdEZGw9ZTiAO4nUjtyoCTO3l/jMVYoo7llJR9dtv
 Fgtej0MLlAapX2mJ65xOBO6OvCIM8VwrY+DfIDeWxtDONmrGxBUIMTJIjSq3oGEi
 Mh0CbnpISGj9zQlR4raOavtgxmbdXnhdvLcp2Uv+VcJnEyCtHMmVLx9yNMKqjHje
 oJHtuJiEeqlB66xZEYx3qA8SIdaJGhB/HluU9Vyg67AJTJUcCzuxZlqaC+oSOxfj
 GYgY66BHD+ZKRKUFw7EylohnhvsMcmFhMSeBLzMuSaqEig4dmv4cFenad06up6c+
 c0obH8oKsaA05gS3sMshmkNtBm8ms1OP2rWebjQWmmXhCOWLPqcGs5AxYeqvRdzx
 eqFNKhRw+JH1mFmhEtY/Y+4OX6eTlluSxoKxZYWfAX1xvlr94U4=
 =XtPw
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Several fixes to qgroups that have been recently identified by test
  generic/475:

   - fix prealloc reserve leak in subvolume operations

   - various other fixes in reservation setup, conversion or cleanup"

* tag 'for-6.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: always clear PERTRANS metadata during commit
  btrfs: make btrfs_clear_delalloc_extent() free delalloc reserve
  btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans
  btrfs: record delayed inode root in transaction
  btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations
  btrfs: qgroup: correctly model root qgroup rsv in convert
2024-04-08 13:11:11 -07:00
Kent Overstreet
b897b148ee bcachefs: fix bch2_get_acl() transaction restart handling
bch2_acl_from_disk() uses allocate_dropping_locks, and can thus return
a transaction restart - this wasn't handled.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-07 17:15:53 -04:00
Hongbo Li
09e913f582 bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
When allocating bkey_cached from bc->freed_pcpu list, it missed
decreasing the count of nr_freed_pcpu which would cause the mismatch
between the value of nr_freed_pcpu and the list items. This problem
also exists in moving new bkey_cached to bc->freed_pcpu list.
If these happened, the bug info may appear in
bch2_fs_btree_key_cache_exit by the follow code:

   BUG_ON(list_count_nodes(&bc->freed_pcpu) != bc->nr_freed_pcpu);
   BUG_ON(list_count_nodes(&bc->freed_nonpcpu) != bc->nr_freed_nonpcpu);

Fixes: c65c13f0ea ("bcachefs: Run btree key cache shrinker less aggressively")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-07 13:40:35 -04:00
Christian Brauner
210a03c9d5
fs: claw back a few FMODE_* bits
There's a bunch of flags that are purely based on what the file
operations support while also never being conditionally set or unset.
IOW, they're not subject to change for individual files. Imho, such
flags don't need to live in f_mode they might as well live in the fops
structs itself. And the fops struct already has that lonely
mmap_supported_flags member. We might as well turn that into a generic
fop_flags member and move a few flags from FMODE_* space into FOP_*
space. That gets us four FMODE_* bits back and the ability for new
static flags that are about file ops to not have to live in FMODE_*
space but in their own FOP_* space. It's not the most beautiful thing
ever but it gets the job done. Yes, there'll be an additional pointer
chase but hopefully that won't matter for these flags.

I suspect there's a few more we can move into there and that we can also
redirect a bunch of new flag suggestions that follow this pattern into
the fop_flags field instead of f_mode.

Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-07 13:49:02 +02:00
Kent Overstreet
30e615a2ce bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
Multiple bug fixes for journal iters:

 - When the journal keys gap buffer is resized, we have to adjust the
   iterators for moving the gap to the end
 - We don't want to rewind iterators to point to the key we just
   inserted if it's not for the correct btree/level

Also, add some new assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-07 02:22:28 -04:00
Thorsten Blum
2d793e9315 bcachefs: Rename struct field swap to prevent macro naming collision
The struct field swap can collide with the swap() macro defined in
linux/minmax.h. Rename the struct field to prevent such collisions.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-06 17:39:12 -04:00
Kent Overstreet
6088234ce8 bcachefs: JOURNAL_SPACE_LOW
"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant
performance regression (nearly 50%) on multithreaded random writes with
fio.

The reason is that the journal watermark checks multiple things,
including the state of the btree write buffer, and on multithreaded
update heavy workloads we're bottleneked on write buffer flushing - we
don't want kicknig off btree updates to depend on the state of the write
buffer.

This isn't strictly correct; the interior btree update path does do
write buffer updates, but it's a tiny fraction of total accounting
updates and we're more concerned with space in the journal itself.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-06 13:50:26 -04:00
Kent Overstreet
05801b6526 bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
BCH_IOCTL_FSCK_OFFLINE allows the userspace fsck tool to use the kernel
implementation of fsck - primarily when the kernel version is a better
version match.

It should look and act exactly like the normal userspace fsck that the
user expected to be invoking, so errors should never result in a kernel
panic.

We may want to consider further restricting errors=panic - it's only
intended for debugging in controlled test environments, it should have
no purpose it normal usage.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-06 13:50:25 -04:00
Kent Overstreet
374b3d38fe bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
To open an encrypted filesystem, we use request_key() to get the
encryption key from the user's keyring - but request_key() needs to
happen in the context of the process that invoked the ioctl.

This easily fixed by using bch2_fs_open() in nostart mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-06 13:50:22 -04:00
Linus Torvalds
f2f80ac809 nfsd-6.9 fixes:
- Address a slow memory leak with RPC-over-TCP
 - Prevent another NFS4ERR_DELAY loop during CREATE_SESSION
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYReWEACgkQM2qzM29m
 f5fsfxAAhVkcd5Om9iBI7/Ib2QtJdeyn9+Q6hOJi9ITDPpdbSrd1Fmd8ufyKNuxH
 dwGLyV0+ELbUl1RRNfdnl+TkzYHMTURuvDEgUyhYA28GOJVd9GWXwX2KZR7J+AP5
 HtpSGLXt+XvuO7uB+SFS85wwF0DJL39Qy4jCVYCOuN2Z8zqfTg5TwstOQ8X794QN
 b5JzLkUlxQfd6kGRvU+BZHNf7R/yBfjUQWVybyhqzdjnCbbnPH+cl0hTlEIQTYJH
 G31Gty1J/RGt1ZeURuF4OG4lFocRJW/SqoruneweBAOksN9PVcwsoMf6m16l3+AD
 ZMnBt7FInQc/mAqRqIoLTsmYT8OyDa3a6qjubqWCYicCXvj1FxxOd7IaYytXxv/2
 Z8ZvKSSvyXRwM3mUt+3E5DTM8NnsxPxnO9iSGIMUeH7n96LU0X39b/Ll6in6+eu2
 /go8cLe59uuYDF9n2srX/LLWHj5wAWxVi+OgiSsAbsDFYTtJXK+syT2CpsEFXiUZ
 5AYUbfGVqQ8uNtfGaaJd71CNCuEKC5qYpeC5cS2nnruV6SArfG69DMRAO0pxJYAC
 6X7gm9Se1zyI8r9gR0rKjJ5ojeTPQBLfk6oVavum6CCwHzkKQTLG2jHBq8cdpwoL
 KxXc37fhW9m9c2B3g2dikclM2+XrMyUzJ5Ync9SSiwFJN/956I0=
 =dGcu
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Address a slow memory leak with RPC-over-TCP

 - Prevent another NFS4ERR_DELAY loop during CREATE_SESSION

* tag 'nfsd-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: hold a lighter-weight client reference over CB_RECALL_ANY
  SUNRPC: Fix a slow server-side memory leak with RPC-over-TCP
2024-04-06 09:37:50 -07:00
Linus Torvalds
9520c192e8 Bug fixes for 6.9-rc3:
* Allow creating new links to special files which were not associated with a
    project quota.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZgwRrQAKCRAH7y4RirJu
 9OtyAP4m8cXLi+fjRslGLNhQQXzZHIcpaPiWZ9Ec41Y3uzZNBQD/doS6P4aGcH0m
 taYQ+nyzuavEZiOEg+d65OoUIrDZzg4=
 =bgjU
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Chandan Babu:

 - Allow creating new links to special files which were not associated
   with a project quota

* tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: allow cross-linking special files without project quota
2024-04-06 09:14:18 -07:00
Linus Torvalds
119c289409 17 cifs.ko changesets, most also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYQirsACgkQiiy9cAdy
 T1EkRgv+NsMoL0tLx6Ve8wNtJNN6aEFRIXfpIIJn4dzEl6xK5UEiDNM08m8Y2ryg
 GzV4t7Ba1+2kYcKgBF0ANNLC1605XvvWScZNLpco5LggFq/06YLPuKSB4ygQAJpr
 +fvdEWeaDuzKbbJRraB1EAsJCr/4vYRM54q/cfy94uo6l3J1EnWdR467q1fkn5WQ
 ixM8FXUrkFxxOsrlbYoCSRZsgpQukpzTSqlm8QVQ01B7tG4qLwk/GmhqNmdf+1xs
 Y9RNPy1mc+tcvL2UL+Iagz5gipPwqvs+6L/jqw04UFwsS4F9w6mT5rCgevRYST0S
 qhz2WHXYCOHqr+wdrYNegtJ35d6F/XjrUKK54sNBEm/W2stoeukgB4EsIMGLeSE5
 NJtTWNch5B342sq1xUqJ4lL9QwI3MGZSsL4mOUctMJ0xH4l42gQeRa5wecOpSU+C
 Tka6JLJ9+UPVAFAaDvm27xji3K6myPns6JIT2ZLnjlxIsSq4ITUCkOEtghoDQqel
 LZOQZAq9
 =eZXs
 -----END PGP SIGNATURE-----

Merge tag '6.9-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - fix to retry close to avoid potential handle leaks when server
   returns EBUSY

 - DFS fixes including a fix for potential use after free

 - fscache fix

 - minor strncpy cleanup

 - reconnect race fix

 - deal with various possible UAF race conditions tearing sessions down

* tag '6.9-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: fix potential UAF in cifs_signal_cifsd_for_reconnect()
  smb: client: fix potential UAF in smb2_is_network_name_deleted()
  smb: client: fix potential UAF in is_valid_oplock_break()
  smb: client: fix potential UAF in smb2_is_valid_oplock_break()
  smb: client: fix potential UAF in smb2_is_valid_lease_break()
  smb: client: fix potential UAF in cifs_stats_proc_show()
  smb: client: fix potential UAF in cifs_stats_proc_write()
  smb: client: fix potential UAF in cifs_dump_full_key()
  smb: client: fix potential UAF in cifs_debug_files_proc_show()
  smb3: retrying on failed server close
  smb: client: serialise cifs_construct_tcon() with cifs_mount_mutex
  smb: client: handle DFS tcons in cifs_construct_tcon()
  smb: client: refresh referral without acquiring refpath_lock
  smb: client: guarantee refcounted children from parent session
  cifs: Fix caching to try to do open O_WRONLY as rdwr on server
  smb: client: fix UAF in smb2_reconnect_server()
  smb: client: replace deprecated strncpy with strscpy
2024-04-06 09:06:17 -07:00
Kent Overstreet
cf979fca9a bcachefs: fix rand_delete unit test
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-05 16:21:18 -04:00
Dan Carpenter
a6c4162d84 bcachefs: fix ! vs ~ typo in __clear_bit_le64()
The ! was obviously intended to be ~.  As it is, this function does
the equivalent to: "addr[bit / 64] = 0;".

Fixes: 27fcec6c27 ("bcachefs: Clear recovery_passes_required as they complete without errors")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-05 14:42:37 -04:00
Jeff Layton
10396f4df8 nfsd: hold a lighter-weight client reference over CB_RECALL_ANY
Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
client. While a callback job is technically an RPC that counter is
really more for client-driven RPCs, and this has the effect of
preventing the client from being unhashed until the callback completes.

If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
can end up in a situation where the callback can't complete on the (now
dead) callback channel, but the new client can't connect because the old
client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
return on the CREATE_SESSION operation.

The job is only holding a reference to the client so it can clear a flag
after the RPC completes. Fix this by having CB_RECALL_ANY instead hold a
reference to the cl_nfsdfs.cl_ref. Typically we only take that sort of
reference when dealing with the nfsdfs info files, but it should work
appropriately here to ensure that the nfs4_client doesn't disappear.

Fixes: 44df6f439a ("NFSD: add delegation reaper to react to low memory condition")
Reported-by: Vladimir Benes <vbenes@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-05 14:05:35 -04:00
Linus Torvalds
405ac6a572 3 ksmbd changesets, all also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYPU0MACgkQiiy9cAdy
 T1HmrQv/cw5YUwg3O1Ai4GWDJ07hJAZoYP6IdZGJGZSx2TMEJKh0gxAnUTk/Idnv
 TvK80d9GOOrPTUH7ToDYGnFOUH3T1/chx/R/t2OMs9w1Rc02Kq4+XZhv/1HnflEm
 NbXvYG0ZhZOV331GwydqATpq+IjzNkdf1rzb2Agy1YZGpd2uNU1cgb30FGvqIHAQ
 SyYxX7v9uBLBrBU/IPUxHyUKabiLwz7nFZYDLtGNu88oIZaDbTen/lJju5+Dt0Uz
 x5lL6h3kwwqHNa/1BFe84/h/EKBBUM9ha2VL7lZP3S2imBcspBeY8N+Vb8+z18PS
 7KexqIq9tfkTw2FRne4gqcjxF5fSA4n9hXldFf0t+kz35tosU8akpSpsPkwv0cmT
 dbH1u6vS18WilzqIjCDCNfP/e8/G4HwH0DuAIWbWG/IPasDJLeHJ9fiswzuWPJbG
 Nblqu98I6kHiOwTyOuDHcrrqRUBK9AkspgU3bS3PYh5PwYXMCd8+wHkybm6LhJPH
 dpc2BKHG
 =l/a7
 -----END PGP SIGNATURE-----

Merge tag '6.9-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Three fixes, all also for stable:

   - encryption fix

   - memory overrun fix

   - oplock break fix"

* tag '6.9-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: do not set SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1
  ksmbd: validate payload size in ipc response
  ksmbd: don't send oplock break if rename fails
2024-04-05 10:02:09 -07:00
Linus Torvalds
fae0268777 vfs-6.9-rc3.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZg/C8wAKCRCRxhvAZXjc
 oljxAQCneq62ginESgeQLw88fzSBTV4C50xXUA+Qz18AEgA/fgD+J3DlWquEHhMM
 tJmfs3aUn9w7+wDpukcsLjJfJEiSYA8=
 =f2Z6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "This contains a few small fixes. This comes with some delay because I
  wanted to wait on people running their reproducers and the Easter
  Holidays meant that those replies came in a little later than usual:

   - Fix handling of preventing writes to mounted block devices.

     Since last kernel we allow to prevent writing to mounted block
     devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the
     block device is opened with restricted writes. When we switched to
     opening block devices as files we altered the mechanism by which we
     recognize when a block device has been opened with write
     restrictions.

     The detection logic assumed that only read-write mounted
     filesystems would apply write restrictions to their block devices
     from other openers. That of course is not true since it also makes
     sense to apply write restrictions for filesystems that are
     read-only.

     Fix the detection logic using an FMODE_* bit. We still have a few
     left since we freed up a couple a while ago. I also picked up a
     patch to free up four additional FMODE_* bits scheduled for the
     next merge window.

   - Fix counting the number of writers to a block device. This just
     changes the logic to be consistent.

   - Fix a bug in aio causing a NULL pointer derefernce after we
     implemented batched processing in aio.

   - Finally, add the changes we discussed that allows to yield block
     devices early even though file closing itself is deferred.

     This also allows us to remove two holder operations to get and
     release the holder to align lifetime of file and holder of the
     block device"

* tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  aio: Fix null ptr deref in aio_complete() wakeup
  fs,block: yield devices early
  block: count BLK_OPEN_RESTRICT_WRITES openers
  block: handle BLK_OPEN_RESTRICT_WRITES correctly
2024-04-05 09:47:26 -07:00
Gustavo A. R. Silva
68d6f4f3fb
fs: Annotate struct file_handle with __counted_by() and use struct_size()
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

While there, use struct_size() helper, instead of the open-coded
version.

[brauner@kernel.org: contains a fix by Edward for an OOB access]
Reported-by: syzbot+4139435cb1b34cf759c2@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Link: https://lore.kernel.org/r/tencent_A7845DD769577306D813742365E976E3A205@qq.com
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/ZgImCXTdGDTeBvSS@neat
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:47 +02:00
Kefeng Wang
16594e60cd
fs: aio: convert to ring_folios and internal_folios
Since aio use folios in most functions, convert ring/internal_pages
to ring/internal_folios, let's directly use folio instead of page
throughout aio to remove hidden calls to compound_head(), eg,
flush_dcache_page().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20240321131640.948634-4-wangkefeng.wang@huawei.com
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:47 +02:00
Kefeng Wang
5ea0a35511
fs: aio: use a folio in aio_free_ring()
Use a folio throughout aio_free_ring() to remove calls to compound_head(),
also move pr_debug after folio check to remove unnecessary print.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20240321131640.948634-3-wangkefeng.wang@huawei.com
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:46 +02:00
Kefeng Wang
75a07b557a
fs: aio: use a folio in aio_setup_ring()
Use a folio throughout aio_setup_ring() to remove calls to compound_head(),
also use folio_end_read() to simultaneously mark the folio uptodate and
unlock it.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20240321131640.948634-2-wangkefeng.wang@huawei.com
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:46 +02:00
Brian Kubisiak
85a6a1aff0
ecryptfs: Fix buffer size for tag 66 packet
The 'TAG 66 Packet Format' description is missing the cipher code and
checksum fields that are packed into the message packet. As a result,
the buffer allocated for the packet is 3 bytes too small and
write_tag_66_packet() will write up to 3 bytes past the end of the
buffer.

Fix this by increasing the size of the allocation so the whole packet
will always fit in the buffer.

This fixes the below kasan slab-out-of-bounds bug:

  BUG: KASAN: slab-out-of-bounds in ecryptfs_generate_key_packet_set+0x7d6/0xde0
  Write of size 1 at addr ffff88800afbb2a5 by task touch/181

  CPU: 0 PID: 181 Comm: touch Not tainted 6.6.13-gnu #1 4c9534092be820851bb687b82d1f92a426598dc6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2/GNU Guix 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x4c/0x70
   print_report+0xc5/0x610
   ? ecryptfs_generate_key_packet_set+0x7d6/0xde0
   ? kasan_complete_mode_report_info+0x44/0x210
   ? ecryptfs_generate_key_packet_set+0x7d6/0xde0
   kasan_report+0xc2/0x110
   ? ecryptfs_generate_key_packet_set+0x7d6/0xde0
   __asan_store1+0x62/0x80
   ecryptfs_generate_key_packet_set+0x7d6/0xde0
   ? __pfx_ecryptfs_generate_key_packet_set+0x10/0x10
   ? __alloc_pages+0x2e2/0x540
   ? __pfx_ovl_open+0x10/0x10 [overlay 30837f11141636a8e1793533a02e6e2e885dad1d]
   ? dentry_open+0x8f/0xd0
   ecryptfs_write_metadata+0x30a/0x550
   ? __pfx_ecryptfs_write_metadata+0x10/0x10
   ? ecryptfs_get_lower_file+0x6b/0x190
   ecryptfs_initialize_file+0x77/0x150
   ecryptfs_create+0x1c2/0x2f0
   path_openat+0x17cf/0x1ba0
   ? __pfx_path_openat+0x10/0x10
   do_filp_open+0x15e/0x290
   ? __pfx_do_filp_open+0x10/0x10
   ? __kasan_check_write+0x18/0x30
   ? _raw_spin_lock+0x86/0xf0
   ? __pfx__raw_spin_lock+0x10/0x10
   ? __kasan_check_write+0x18/0x30
   ? alloc_fd+0xf4/0x330
   do_sys_openat2+0x122/0x160
   ? __pfx_do_sys_openat2+0x10/0x10
   __x64_sys_openat+0xef/0x170
   ? __pfx___x64_sys_openat+0x10/0x10
   do_syscall_64+0x60/0xd0
   entry_SYSCALL_64_after_hwframe+0x6e/0xd8
  RIP: 0033:0x7f00a703fd67
  Code: 25 00 00 41 00 3d 00 00 41 00 74 37 64 8b 04 25 18 00 00 00 85 c0 75 5b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 85 00 00 00 48 83 c4 68 5d 41 5c c3 0f 1f
  RSP: 002b:00007ffc088e30b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
  RAX: ffffffffffffffda RBX: 00007ffc088e3368 RCX: 00007f00a703fd67
  RDX: 0000000000000941 RSI: 00007ffc088e48d7 RDI: 00000000ffffff9c
  RBP: 00007ffc088e48d7 R08: 0000000000000001 R09: 0000000000000000
  R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
  R13: 0000000000000000 R14: 00007ffc088e48d7 R15: 00007f00a7180040
   </TASK>

  Allocated by task 181:
   kasan_save_stack+0x2f/0x60
   kasan_set_track+0x29/0x40
   kasan_save_alloc_info+0x25/0x40
   __kasan_kmalloc+0xc5/0xd0
   __kmalloc+0x66/0x160
   ecryptfs_generate_key_packet_set+0x6d2/0xde0
   ecryptfs_write_metadata+0x30a/0x550
   ecryptfs_initialize_file+0x77/0x150
   ecryptfs_create+0x1c2/0x2f0
   path_openat+0x17cf/0x1ba0
   do_filp_open+0x15e/0x290
   do_sys_openat2+0x122/0x160
   __x64_sys_openat+0xef/0x170
   do_syscall_64+0x60/0xd0
   entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Fixes: dddfa461fc ("[PATCH] eCryptfs: Public key; packet management")
Signed-off-by: Brian Kubisiak <brian@kubisiak.com>
Link: https://lore.kernel.org/r/5j2q56p6qkhezva6b2yuqfrsurmvrrqtxxzrnp3wqu7xrz22i7@hoecdztoplbl
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:46 +02:00
Kemeng Shi
6a1ee87176
fs/writeback: remove unnecessary return in writeback_inodes_sb
writeback_inodes_sb doesn't have return value, just remove unnecessary
return in it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-7-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:45 +02:00
Kemeng Shi
ba679de9c3
fs/writeback: correct comment of __wakeup_flusher_threads_bdi
Commit e8e8a0c6c9 ("writeback: move nr_pages == 0 logic to one
location") removed parameter nr_pages of __wakeup_flusher_threads_bdi
and we try to writeback all dirty pages in __wakeup_flusher_threads_bdi
now. Just correct stale comment.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-6-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:45 +02:00
Kemeng Shi
639924abc1
fs/writeback: only calculate dirtied_before when b_io is empty
The dirtied_before is only used when b_io is not empty, so only calculate
when b_io is not empty.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-5-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:45 +02:00
Kemeng Shi
2ddc934612
fs/writeback: remove unused parameter wb of finish_writeback_work
Remove unused parameter wb of finish_writeback_work.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-4-shikemeng@huaweicloud.com
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:53:45 +02:00
Kemeng Shi
d92109891f
fs/writeback: bail out if there is no more inodes for IO and queued once
For case there is no more inodes for IO in io list from last wb_writeback,
We may bail out early even there is inode in dirty list should be written
back. Only bail out when we queued once to avoid missing dirtied inode.

This is from code reading...

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-3-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
[brauner@kernel.org: fold in memory corruption fix from Jan in [1]]
Link: https://lore.kernel.org/r/20240405132346.bid7gibby3lxxhez@quack3 [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 15:52:17 +02:00
Kent Overstreet
caeb4b0a11
aio: Fix null ptr deref in aio_complete() wakeup
list_del_init_careful() needs to be the last access to the wait queue
entry - it effectively unlocks access.

Previously, finish_wait() would see the empty list head and skip taking
the lock, and then we'd return - but the completion path would still
attempt to do the wakeup after the task_struct pointer had been
overwritten.

Fixes: 71eb6b6b0b ("fs/aio: obey min_nr when doing wakeups")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/CAHTA-ubfwwB51A5Wg5M6H_rPEQK9pNf8FkAGH=vr=FEkyRrtqw@mail.gmail.com/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/stable/20240331215212.522544-1-kent.overstreet%40linux.dev
Link: https://lore.kernel.org/r/20240331215212.522544-1-kent.overstreet@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05 11:20:28 +02:00
Kent Overstreet
5957e0a28b bcachefs: Fix rebalance from durability=0 device
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-05 03:05:30 -04:00
Jakub Kicinski
cf1ca1f66d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/ip_gre.c
  17af420545 ("erspan: make sure erspan_base_hdr is present in skb->head")
  5832c4a77d ("ip_tunnel: convert __be16 tunnel flags to bitmaps")
https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/

Adjacent changes:

net/ipv6/ip6_fib.c
  d21d40605b ("ipv6: Fix infinite recursion in fib6_dump_done().")
  5fc68320c1 ("ipv6: remove RTNL protection from inet6_dump_fib()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 18:01:07 -07:00
Kuniyuki Iwashima
24457f1be2 nfs: Handle error of rpc_proc_register() in nfs_net_init().
syzkaller reported a warning [0] triggered while destroying immature
netns.

rpc_proc_register() was called in init_nfs_fs(), but its error
has been ignored since at least the initial commit 1da177e4c3
("Linux-2.6.12-rc2").

Recently, commit d47151b79e ("nfs: expose /proc/net/sunrpc/nfs
in net namespaces") converted the procfs to per-netns and made
the problem more visible.

Even when rpc_proc_register() fails, nfs_net_init() could succeed,
and thus nfs_net_exit() will be called while destroying the netns.

Then, remove_proc_entry() will be called for non-existing proc
directory and trigger the warning below.

Let's handle the error of rpc_proc_register() properly in nfs_net_init().

[0]:
name 'nfs'
WARNING: CPU: 1 PID: 1710 at fs/proc/generic.c:711 remove_proc_entry+0x1bb/0x2d0 fs/proc/generic.c:711
Modules linked in:
CPU: 1 PID: 1710 Comm: syz-executor.2 Not tainted 6.8.0-12822-gcd51db110a7e #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:remove_proc_entry+0x1bb/0x2d0 fs/proc/generic.c:711
Code: 41 5d 41 5e c3 e8 85 09 b5 ff 48 c7 c7 88 58 64 86 e8 09 0e 71 02 e8 74 09 b5 ff 4c 89 e6 48 c7 c7 de 1b 80 84 e8 c5 ad 97 ff <0f> 0b eb b1 e8 5c 09 b5 ff 48 c7 c7 88 58 64 86 e8 e0 0d 71 02 eb
RSP: 0018:ffffc9000c6d7ce0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8880422b8b00 RCX: ffffffff8110503c
RDX: ffff888030652f00 RSI: ffffffff81105045 RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: ffffffff81bb62cb R12: ffffffff84807ffc
R13: ffff88804ad6fcc0 R14: ffffffff84807ffc R15: ffffffff85741ff8
FS:  00007f30cfba8640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff51afe8000 CR3: 000000005a60a005 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 rpc_proc_unregister+0x64/0x70 net/sunrpc/stats.c:310
 nfs_net_exit+0x1c/0x30 fs/nfs/inode.c:2438
 ops_exit_list+0x62/0xb0 net/core/net_namespace.c:170
 setup_net+0x46c/0x660 net/core/net_namespace.c:372
 copy_net_ns+0x244/0x590 net/core/net_namespace.c:505
 create_new_namespaces+0x2ed/0x770 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xae/0x160 kernel/nsproxy.c:228
 ksys_unshare+0x342/0x760 kernel/fork.c:3322
 __do_sys_unshare kernel/fork.c:3393 [inline]
 __se_sys_unshare kernel/fork.c:3391 [inline]
 __x64_sys_unshare+0x1f/0x30 kernel/fork.c:3391
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x46/0x4e
RIP: 0033:0x7f30d0febe5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007f30cfba7cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00000000004bbf80 RCX: 00007f30d0febe5d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000006c020600
RBP: 00000000004bbf80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 000000000000000b R14: 00007f30d104c530 R15: 0000000000000000
 </TASK>

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-04-04 18:27:13 -04:00
Linus Torvalds
ec25bd8d98 bcachefs repair code for 6.9-rc3
A couple more small fixes, and new repair code.
 
 We can now automatically recover from arbitrary corrupted interior btree
 nodes by scanning, and we can reconstruct metadata as needed to bring a
 filesystem back into a working, consistent, read-write state and
 preserve access to whatevver wasn't corrupted.
 
 Meaning - you can blow away all metadata except for extents and dirents
 leaf nodes, and repair will reconstruct everything else and give you
 your data, and under the correct paths. If inodes are missing i_size
 will be slightly off and permissions/ownership/timestamps will be gone,
 and we do still need the snapshots btree if snapshots were in use - in
 the future we'll be able to guess the snapshot tree structure in some
 situations.
 
 IOW - aside from shaking out remaining bugs (fuzz testing is still
 coming), repair code should be complete and if repair ever doesn't work
 that's the highest priority bug that I want to know about immediately.
 
 This patchset was kindly tested by a user from India who accidentally
 wiped one drive out of a three drive filesystem with no replication on
 the family computer - it took a couple weeks but we got everything
 important back.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYNq9IACgkQE6szbY3K
 bnaG9w/+Od0iq4Nqx62Mf8+O5DLnZZNu3c9aUOEiuzdXlNrpUr4S9j4WwDxTb/EN
 2a3ldXY5AhauZqEW7Qv+WBZvVVbm3GYH+oOYQo8V+yf1oGNB3+AGxBCCmruHJGLk
 5nmwsRyVm1ihAKxn1oxwrDDPtOlxbGOlc4peR+nCY/b5QnlXegGkGfRAHO/z9bul
 4JdBYBqR4KBGdevIV8EG2WVa6ASA6mF1QOboeB6INekD4klDpm41gK/0S9Uf2oXm
 q1PiN655YHquXbJTT9k/HtVX4WhlcaHv+R4KeZ5TEReJjB57ot/M8Rx57lgsYHP6
 TeyR4Y5VYGLYqlwMK5RiKyGLB92qNFcSlg5inASyTCUNi1KKu12SpqS3+Nel6+tF
 gu4F4ElSvAcsmJ6LrfsfP9B8u0ULDkIyq9xBFFbLTIpLuDOqz8FcgFpZrpiO445w
 F6FcYXqt2/fP7gxA3GzdFjeUojIjWNMJapgpsePg/HGNArBsoAsBL8rAhAyetG3Z
 EOJlrJ8m59/QoPgXBpScfbS7cxk3JgrUzfSI/oKaEr2lS0YNlYjQANYHoEHTFaxA
 bMWKXwMkvqz49MMm5WLaMIOYDJRDtrt0qpnW7x+qU7ik/VkHeUTJr07bSRIKT0z1
 yNCynYtdbeQVfekZQS6JwsyTs/ehbI1OVN8MGwVRCrQTonYz+BA=
 =7/rR
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-04-03' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs repair code from Kent Overstreet:
 "A couple more small fixes, and new repair code.

  We can now automatically recover from arbitrary corrupted interior
  btree nodes by scanning, and we can reconstruct metadata as needed to
  bring a filesystem back into a working, consistent, read-write state
  and preserve access to whatevver wasn't corrupted.

  Meaning - you can blow away all metadata except for extents and
  dirents leaf nodes, and repair will reconstruct everything else and
  give you your data, and under the correct paths. If inodes are missing
  i_size will be slightly off and permissions/ownership/timestamps will
  be gone, and we do still need the snapshots btree if snapshots were in
  use - in the future we'll be able to guess the snapshot tree structure
  in some situations.

  IOW - aside from shaking out remaining bugs (fuzz testing is still
  coming), repair code should be complete and if repair ever doesn't
  work that's the highest priority bug that I want to know about
  immediately.

  This patchset was kindly tested by a user from India who accidentally
  wiped one drive out of a three drive filesystem with no replication on
  the family computer - it took a couple weeks but we got everything
  important back"

* tag 'bcachefs-2024-04-03' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: reconstruct_inode()
  bcachefs: Subvolume reconstruction
  bcachefs: Check for extents that point to same space
  bcachefs: Reconstruct missing snapshot nodes
  bcachefs: Flag btrees with missing data
  bcachefs: Topology repair now uses nodes found by scanning to fill holes
  bcachefs: Repair pass for scanning for btree nodes
  bcachefs: Don't skip fake btree roots in fsck
  bcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()
  bcachefs: Etyzinger cleanups
  bcachefs: bch2_shoot_down_journal_keys()
  bcachefs: Clear recovery_passes_required as they complete without errors
  bcachefs: ratelimit informational fsck errors
  bcachefs: Check for bad needs_discard before doing discard
  bcachefs: Improve bch2_btree_update_to_text()
  mean_and_variance: Drop always failing tests
  bcachefs: fix nocow lock deadlock
  bcachefs: BCH_WATERMARK_interior_updates
  bcachefs: Fix btree node reserve
2024-04-04 14:36:32 -07:00
Kent Overstreet
9802ff48f3 bcachefs: Print shutdown journal sequence number
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-04 16:56:44 -04:00
Kent Overstreet
d880a43836 bcachefs: Further improve btree_update_to_text()
Print start and end level of the btree update; also a bit of cleanup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-04 16:56:44 -04:00
Kent Overstreet
9fb3036fe3 bcachefs: Move btree_updates to debugfs
sysfs is limited to PAGE_SIZE, and when we're debugging strange
deadlocks/priority inversions we need to see the full list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-04 16:56:44 -04:00
Kent Overstreet
be42e4a621 bcachefs: Bump limit in btree_trans_too_many_iters()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-04 16:53:12 -04:00
Kent Overstreet
01e5f4fc0f bcachefs: Make snapshot_is_ancestor() safe
Snapshot table accesses generally need to be checking for invalid
snapshot ID now, fix one that was missed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-04 16:52:42 -04:00
Thomas Bertschinger
e60aa47240 bcachefs: create debugfs dir for each btree
This creates a subdirectory for each individual btree under the btrees/
debugfs directory.

Directory structure, before:

/sys/kernel/debug/bcachefs/$FS_ID/btrees/
├── alloc
├── alloc-bfloat-failed
├── alloc-formats
├── backpointers
├── backpointers-bfloat-failed
├── backpointers-formats
...

Directory structure, after:

/sys/kernel/debug/bcachefs/$FS_ID/btrees/
├── alloc
│   ├── bfloat-failed
│   ├── formats
│   └── keys
├── backpointers
│   ├── bfloat-failed
│   ├── formats
│   └── keys
...

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 20:32:10 -04:00
Paulo Alcantara
e0e50401cc smb: client: fix potential UAF in cifs_signal_cifsd_for_reconnect()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:45:15 -05:00
Paulo Alcantara
63981561ff smb: client: fix potential UAF in smb2_is_network_name_deleted()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:45:06 -05:00
Paulo Alcantara
69ccf040ac smb: client: fix potential UAF in is_valid_oplock_break()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:44:42 -05:00
Paulo Alcantara
22863485a4 smb: client: fix potential UAF in smb2_is_valid_oplock_break()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:43:45 -05:00
Paulo Alcantara
705c76fbf7 smb: client: fix potential UAF in smb2_is_valid_lease_break()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:43:45 -05:00
Paulo Alcantara
0865ffefea smb: client: fix potential UAF in cifs_stats_proc_show()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:43:45 -05:00
Paulo Alcantara
d3da25c5ac smb: client: fix potential UAF in cifs_stats_proc_write()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:43:45 -05:00
Paulo Alcantara
58acd1f497 smb: client: fix potential UAF in cifs_dump_full_key()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:43:45 -05:00
Paulo Alcantara
ca545b7f08 smb: client: fix potential UAF in cifs_debug_files_proc_show()
Skip sessions that are being teared down (status == SES_EXITING) to
avoid UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:43:34 -05:00
Ritvik Budhiraja
173217bd73 smb3: retrying on failed server close
In the current implementation, CIFS close sends a close to the
server and does not check for the success of the server close.
This patch adds functionality to check for server close return
status and retries in case of an EBUSY or EAGAIN error.

This can help avoid handle leaks

Cc: stable@vger.kernel.org
Signed-off-by: Ritvik Budhiraja <rbudhiraja@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-03 14:43:24 -05:00
Kent Overstreet
09d4c2acbf bcachefs: reconstruct_inode()
If an inode is missing, but corresponding extents and dirent still
exist, it's well worth recreating it - this does so.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet
cc0532900b bcachefs: Subvolume reconstruction
We can now recreate missing subvolumes from dirents and/or inodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet
4c02e63dad bcachefs: Check for extents that point to same space
In backpointer repair, if we get a missing backpointer - but there's
already a backpointer that points to an existing extent - we've got
multiple extents that point to the same space and need to decide which
to keep.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet
a292be3b68 bcachefs: Reconstruct missing snapshot nodes
When the snapshots btree is going, we'll have to delete huge amounts of
data - unless we can reconstruct it by looking at the keys that refer to
it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet
55936afe11 bcachefs: Flag btrees with missing data
We need this to know when we should attempt to reconstruct the snapshots
btree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet
43f5ea4646 bcachefs: Topology repair now uses nodes found by scanning to fill holes
With the new btree node scan code, we can now recover from corrupt btree
roots - simply create a new fake root at depth 1, and then insert all
the leaves we found.

If the root wasn't corrupt but there's corruption elsewhere in the
btree, we can fill in holes as needed with the newest version of a given
node(s) from the scan; we also check if a given btree node is older than
what we found from the scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:45:30 -04:00
Kent Overstreet
4409b8081d bcachefs: Repair pass for scanning for btree nodes
If a btree root or interior btree node goes bad, we're going to lose a
lot of data, unless we can recover the nodes that it pointed to by
scanning.

Fortunately btree node headers are fully self describing, and
additionally the magic number is xored with the filesytem UUID, so we
can do so safely.

This implements the scanning - next patch will rework topology repair to
make use of the found nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
b268aa4e7f bcachefs: Don't skip fake btree roots in fsck
When a btree root is unreadable, we might still have keys fro the
journal to walk and mark.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
f2f61f4192 bcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
ca1e02f7e9 bcachefs: Etyzinger cleanups
Pull out eytzinger.c and kill eytzinger_cmp_fn. We now provide
eytzinger0_sort and eytzinger0_sort_r, which use the standard cmp_func_t
and cmp_r_func_t callbacks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
bdbf953b3c bcachefs: bch2_shoot_down_journal_keys()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
27fcec6c27 bcachefs: Clear recovery_passes_required as they complete without errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Linus Torvalds
c85af715ca vboxsf fixes for v6.9-1
Highlights:
 - Compiler warning fixes
 - Explicitly deny setlease attempts
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmYNemEUHGhkZWdvZWRl
 QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9wc1Af/fqfUUusaYW408D3PukjUaOVF+0wo
 6wluwCxy/DEMBxIQbGACwYoQuULHkgyK5chcEZvdB56vullqePCwOKeJUeKs75MR
 HzG9NLs2qIN9WJ6cSHTQlBzvVIK7WV64BDtauD8FH3Afa5c5ojr1JqEAxebnlonI
 cmFUm5x1TlMQryXcY8rPU9sdeaowlNiE/g7qRNqRfsjCGz2zWJdtjskf8YjOY5yB
 KqulZnye04dEb6Wp8fGuNWauUAJ6gTwSJxlcPU0oHv+fRaYebnqTZZaJrg5kKF4a
 SF4llaPM3d714udHOZP3Ro2K+SRoj5jUNSfO7jxNNk6DZ4xB47iXqNJ/Sw==
 =HDgT
 -----END PGP SIGNATURE-----

Merge tag 'vboxsf-v6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux

Pull vboxsf fixes from Hans de Goede:

 - Compiler warning fixes

 - Explicitly deny setlease attempts

* tag 'vboxsf-v6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux:
  vboxsf: explicitly deny setlease attempts
  vboxsf: Remove usage of the deprecated ida_simple_xx() API
  vboxsf: Avoid an spurious warning if load_nls_xxx() fails
  vboxsf: remove redundant variable out_len
2024-04-03 10:30:52 -07:00
Roberto Sassu
701b38995e security: Place security_path_post_mknod() where the original IMA call was
Commit 08abce60d6 ("security: Introduce path_post_mknod hook")
introduced security_path_post_mknod(), to replace the IMA-specific call
to ima_post_path_mknod().

For symmetry with security_path_mknod(), security_path_post_mknod() was
called after a successful mknod operation, for any file type, rather
than only for regular files at the time there was the IMA call.

However, as reported by VFS maintainers, successful mknod operation does
not mean that the dentry always has an inode attached to it (for
example, not for FIFOs on a SAMBA mount).

If that condition happens, the kernel crashes when
security_path_post_mknod() attempts to verify if the inode associated to
the dentry is private.

Move security_path_post_mknod() where the ima_post_path_mknod() call was,
which is obviously correct from IMA/EVM perspective. IMA/EVM are the only
in-kernel users, and only need to inspect regular files.

Reported-by: Steve French <smfrench@gmail.com>
Closes: https://lore.kernel.org/linux-kernel/CAH2r5msAVzxCUHHG8VKrMPUKQHmBpE6K9_vjhgDa1uAvwx4ppw@mail.gmail.com/
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 08abce60d6 ("security: Introduce path_post_mknod hook")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-03 10:21:32 -07:00
Jeff Layton
1ece2c43b8 vboxsf: explicitly deny setlease attempts
vboxsf does not break leases on its own, so it can't properly handle the
case where the hypervisor changes the data. Don't allow file leases on
vboxsf.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240319-setlease-v1-1-5997d67e04b3@kernel.org
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-04-03 16:06:39 +02:00
Christophe JAILLET
0141d68f86 vboxsf: Remove usage of the deprecated ida_simple_xx() API
ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

This is less verbose.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/b3c057c86b73f0309a6362031d21f4d7ebb60587.1698835730.git.christophe.jaillet@wanadoo.fr
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-04-03 16:06:11 +02:00
Christophe JAILLET
de3f64b738 vboxsf: Avoid an spurious warning if load_nls_xxx() fails
If an load_nls_xxx() function fails a few lines above, the 'sbi->bdi_id' is
still 0.
So, in the error handling path, we will call ida_simple_remove(..., 0)
which is not allocated yet.

In order to prevent a spurious "ida_free called for id=0 which is not
allocated." message, tweak the error handling path and add a new label.

Fixes: 0fd1695766 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/d09eaaa4e2e08206c58a1a27ca9b3e81dc168773.1698835730.git.christophe.jaillet@wanadoo.fr
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-04-03 16:05:51 +02:00
Colin Ian King
0200ceed30 vboxsf: remove redundant variable out_len
The variable out_len is being used to accumulate the number of
bytes but it is not being used for any other purpose. The variable
is redundant and can be removed.

Cleans up clang scan build warning:
fs/vboxsf/utils.c:443:9: warning: variable 'out_len' set but not
used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240229225138.351909-1-colin.i.king@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-04-03 15:55:33 +02:00
Kent Overstreet
fa14b50460 bcachefs: ratelimit informational fsck errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 20:24:00 -04:00
Kent Overstreet
7ee88737ab bcachefs: Check for bad needs_discard before doing discard
In the discard worker, we were failing to validate the bucket state -
meaning a corrupt needs_discard btree could cause us to discard a bucket
that we shouldn't.

If check_alloc_info hasn't run yet we just want to bail out, otherwise
it's a filesystem inconsistent error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 20:24:00 -04:00
Kent Overstreet
e0319af2b6 bcachefs: Improve bch2_btree_update_to_text()
Print out the mode as a string, and also print out the btree and
watermark.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 17:13:46 -04:00
Kunwu Chan
ca0dcef7cf dlm: Simplify the allocation of slab caches in dlm_lowcomms_msg_cache_create
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Acked-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-02 15:24:40 -05:00
Guenter Roeck
97ca7c1f93 mean_and_variance: Drop always failing tests
mean_and_variance_test_2 and mean_and_variance_test_4 always fail.
The input parameters to those tests are identical to the input parameters
to tests 1 and 3, yet the expected result for tests 2 and 4 is different
for the mean and stddev tests. That will always fail.

     Expected mean_and_variance_get_mean(mv) == mean[i], but
        mean_and_variance_get_mean(mv) == 22 (0x16)
        mean[i] == 10 (0xa)

Drop the bad tests.

Fixes: 65bc410907 ("mean and variance: More tests")
Closes: https://lore.kernel.org/lkml/065b94eb-6a24-4248-b7d7-d3212efb4787@roeck-us.net/
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 14:45:08 -04:00
Boris Burkov
6e68de0bb0 btrfs: always clear PERTRANS metadata during commit
It is possible to clear a root's IN_TRANS tag from the radix tree, but
not clear its PERTRANS, if there is some error in between. Eliminate
that possibility by moving the free up to where we clear the tag.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:19:13 +02:00
Boris Burkov
3c6f0c5ecc btrfs: make btrfs_clear_delalloc_extent() free delalloc reserve
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around.  This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.

Helps with the pass rate of generic/269 and generic/475.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:19:08 +02:00
Boris Burkov
211de93367 btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans
The transaction is only able to free PERTRANS reservations for a root
once that root has been recorded with the TRANS tag on the roots radix
tree. Therefore, until we are sure that this root will get tagged, it
isn't safe to convert. Generally, this is not an issue as *some*
transaction will likely tag the root before long and this reservation
will get freed in that transaction, but technically it could stick
around until unmount and result in a warning about leaked metadata
reservation space.

This path is most exercised by running the generic/269 fstest with
CONFIG_BTRFS_DEBUG.

Fixes: a649684967 ("btrfs: fix start transaction qgroup rsv double free")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:47 +02:00
Boris Burkov
71537e35c3 btrfs: record delayed inode root in transaction
When running delayed inode updates, we do not record the inode's root in
the transaction, but we do allocate PREALLOC and thus converted PERTRANS
space for it. To be sure we free that PERTRANS meta rsv, we must ensure
that we record the root in the transaction.

Fixes: 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:33 +02:00
Boris Burkov
74e9795812 btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations
Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.

However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.

The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:

1. any failure before record_root_in_trans is called successfully
   results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
   transaction owns freeing the reservation.

This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.

Fixes: e85fde5162 ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:23 +02:00
Boris Burkov
141fb8cd20 btrfs: qgroup: correctly model root qgroup rsv in convert
We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
pertrans reservations for subvolumes when quotas are enabled. The
convert function does not properly increment pertrans after decrementing
prealloc, so the count is not accurate.

Note: we check that the fs is not read-only to mirror the logic in
qgroup_convert_meta, which checks that before adding to the pertrans rsv.

Fixes: 8287475a20 ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:04 +02:00
Paulo Alcantara
93cee45ccf smb: client: serialise cifs_construct_tcon() with cifs_mount_mutex
Serialise cifs_construct_tcon() with cifs_mount_mutex to handle
parallel mounts that may end up reusing the session and tcon created
by it.

Cc: stable@vger.kernel.org # 6.4+
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 10:12:22 -05:00
Paulo Alcantara
4a5ba0e0bf smb: client: handle DFS tcons in cifs_construct_tcon()
The tcons created by cifs_construct_tcon() on multiuser mounts must
also be able to failover and refresh DFS referrals, so set the
appropriate fields in order to get a full DFS tcon.  They could be
shared among different superblocks later, too.

Cc: stable@vger.kernel.org # 6.4+
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404021518.3Xu2VU4s-lkp@intel.com/
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 10:12:13 -05:00
Paulo Alcantara
0a05ad21d7 smb: client: refresh referral without acquiring refpath_lock
Avoid refreshing DFS referral with refpath_lock acquired as the I/O
could block for a while due to a potentially disconnected or slow DFS
root server and then making other threads - that use same @server and
don't require a DFS root server - unable to make any progress.

Cc: stable@vger.kernel.org # 6.4+
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 10:11:43 -05:00
Paulo Alcantara
062a7f0ff4 smb: client: guarantee refcounted children from parent session
Avoid potential use-after-free bugs when walking DFS referrals,
mounting and performing DFS failover by ensuring that all children
from parent @tcon->ses are also refcounted.  They're all needed across
the entire DFS mount.  Get rid of @tcon->dfs_ses_list while we're at
it, too.

Cc: stable@vger.kernel.org # 6.4+
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404021527.ZlRkIxgv-lkp@intel.com/
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 10:09:57 -05:00
David Howells
e9e62243a3 cifs: Fix caching to try to do open O_WRONLY as rdwr on server
When we're engaged in local caching of a cifs filesystem, we cannot perform
caching of a partially written cache granule unless we can read the rest of
the granule.  This can result in unexpected access errors being reported to
the user.

Fix this by the following: if a file is opened O_WRONLY locally, but the
mount was given the "-o fsc" flag, try first opening the remote file with
GENERIC_READ|GENERIC_WRITE and if that returns -EACCES, try dropping the
GENERIC_READ and doing the open again.  If that last succeeds, invalidate
the cache for that file as for O_DIRECT.

Fixes: 70431bfd82 ("cifs: Support fscache indexing rewrite")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 09:29:55 -05:00
Namjae Jeon
5ed11af19e ksmbd: do not set SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1
SMB2_GLOBAL_CAP_ENCRYPTION flag should be used only for 3.0 and
3.0.2 dialects. This flags set cause compatibility problems with
other SMB clients.

Reported-by: James Christopher Adduono <jc@adduono.com>
Tested-by: James Christopher Adduono <jc@adduono.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 09:21:25 -05:00
Namjae Jeon
a677ebd8ca ksmbd: validate payload size in ipc response
If installing malicious ksmbd-tools, ksmbd.mountd can return invalid ipc
response to ksmbd kernel server. ksmbd should validate payload size of
ipc response from ksmbd.mountd to avoid memory overrun or
slab-out-of-bounds. This patch validate 3 ipc response that has payload.

Cc: stable@vger.kernel.org
Reported-by: Chao Ma <machao2019@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 09:21:01 -05:00
Namjae Jeon
c1832f6703 ksmbd: don't send oplock break if rename fails
Don't send oplock break if rename fails. This patch fix
smb2.oplock.batch20 test.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 09:20:35 -05:00
Kent Overstreet
c42cd606e4 bcachefs: fix nocow lock deadlock
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 01:04:10 -04:00
Kent Overstreet
e2a316b3cc bcachefs: BCH_WATERMARK_interior_updates
This adds a new watermark, higher priority than BCH_WATERMARK_reclaim,
for interior btree updates. We've seen a deadlock where journal replay
triggers a ton of btree node merges, and these use up all available open
buckets and then interior updates get stuck.

One cause of this is that we're currently lacking btree node merging on
write buffer btrees - that needs to be fixed as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 21:14:02 -04:00
Kent Overstreet
ba947ecd39 bcachefs: Fix btree node reserve
Sign error when checking the watermark - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 21:14:02 -04:00
Paulo Alcantara
24a9799aa8 smb: client: fix UAF in smb2_reconnect_server()
The UAF bug is due to smb2_reconnect_server() accessing a session that
is already being teared down by another thread that is executing
__cifs_put_smb_ses().  This can happen when (a) the client has
connection to the server but no session or (b) another thread ends up
setting @ses->ses_status again to something different than
SES_EXITING.

To fix this, we need to make sure to unconditionally set
@ses->ses_status to SES_EXITING and prevent any other threads from
setting a new status while we're still tearing it down.

The following can be reproduced by adding some delay to right after
the ipc is freed in __cifs_put_smb_ses() - which will give
smb2_reconnect_server() worker a chance to run and then accessing
@ses->ipc:

kinit ...
mount.cifs //srv/share /mnt/1 -o sec=krb5,nohandlecache,echo_interval=10
[disconnect srv]
ls /mnt/1 &>/dev/null
sleep 30
kdestroy
[reconnect srv]
sleep 10
umount /mnt/1
...
CIFS: VFS: Verify user has a krb5 ticket and keyutils is installed
CIFS: VFS: \\srv Send error in SessSetup = -126
CIFS: VFS: Verify user has a krb5 ticket and keyutils is installed
CIFS: VFS: \\srv Send error in SessSetup = -126
general protection fault, probably for non-canonical address
0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc2 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
Workqueue: cifsiod smb2_reconnect_server [cifs]
RIP: 0010:__list_del_entry_valid_or_report+0x33/0xf0
Code: 4f 08 48 85 d2 74 42 48 85 c9 74 59 48 b8 00 01 00 00 00 00 ad
de 48 39 c2 74 61 48 b8 22 01 00 00 00 00 74 69 <48> 8b 01 48 39 f8 75
7b 48 8b 72 08 48 39 c6 0f 85 88 00 00 00 b8
RSP: 0018:ffffc900001bfd70 EFLAGS: 00010a83
RAX: dead000000000122 RBX: ffff88810da53838 RCX: 6b6b6b6b6b6b6b6b
RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffc02f6878 RDI: ffff88810da53800
RBP: ffff88810da53800 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88810c064000
R13: 0000000000000001 R14: ffff88810c064000 R15: ffff8881039cc000
FS: 0000000000000000(0000) GS:ffff888157c00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe3728b1000 CR3: 000000010caa4000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? die_addr+0x36/0x90
 ? exc_general_protection+0x1c1/0x3f0
 ? asm_exc_general_protection+0x26/0x30
 ? __list_del_entry_valid_or_report+0x33/0xf0
 __cifs_put_smb_ses+0x1ae/0x500 [cifs]
 smb2_reconnect_server+0x4ed/0x710 [cifs]
 process_one_work+0x205/0x6b0
 worker_thread+0x191/0x360
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xe2/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x34/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-01 13:39:20 -05:00
Alexander Aring
2bec1bbd55 dlm: remove callback reference counting
Get rid of the unnecessary refcounting on callback structs.
Copy interesting callback info into the lkb struct rather
than maintaining pointers to callback structs from the lkb.
This goes back to the way things were done prior to
commit 61bed0baa4 ("fs: dlm: use a non-static queue for callbacks").

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01 13:31:13 -05:00
Alexander Aring
986ae3c2a8 dlm: fix race between final callback and remove
This patch fixes the following issue:

node 1 is dir
node 2 is master
node 3 is other

1->2: unlock
2: put final lkb, rsb moved to toss
2->1: unlock_reply
1: queue lkb callback with EUNLOCK
2->1: remove
1: receive_remove ignored (rsb on keep because of queued lkb callback)
1: complete lkb callback, put_lkb, move rsb to toss
3->1: lookup
1->3: lookup_reply master=2
3->2: request
2->3: request_reply EBADR

In summary:
An unexpected lkb reference causes the rsb to remain on the wrong list.
The rsb being on the wrong list causes receive_remove to be ignored.
An ignored receive_remove causes inconsistent dir and master state.

This sequence requires an unusually long delay in delivering the unlock
callback, because the remove message from 2->1 usually happens after
some seconds.  So, it's not known exactly how frequently this sequence
occurs in pratice.  It's possible that the same end result could also
have another unknown cause.

The solution for this issue is to further separate callback state
from the lkb, so that an lkb reference (and from that, an rsb ref)
are not held while a callback remains queued.  Then, within the
unlock_reply, the lkb will be freed and the rsb moved to the toss
list. So, the receive_remove will not be ignored.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01 13:31:12 -05:00
Alexander Aring
0175e51b51 dlm: combine switch case fail and default statements
This patch combines the failure and default cases for enqueue and
dequeue a callback to the lkb callback queue that should end in both
cases as it should never happen.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01 13:31:12 -05:00
Alexander Aring
4ed4242809 dlm: save callback debug info earlier
Save lkb callback info when queueing the callback so that the
lkb struct is not needed in the callback workqueue processing.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01 13:31:12 -05:00
Alexander Aring
16e98462b7 dlm: remove callback queue debugfs functionality
Remove the ability to dump pending lkb callbacks from debugfs.
The prepares for separating lkb structs from callbacks.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01 13:31:12 -05:00
Alexander Aring
1131f33908 dlm: remove lkb from callback tracepoints
Stop using lkb structs in the callback tracepoints so that lkb
references are not needed. This prepares for separating lkb
structs from callbacks.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01 13:31:12 -05:00
Kunwu Chan
609ed5bde2 dlm: Simplify the allocation of slab caches in dlm_midcomms_cache_create
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01 13:31:12 -05:00
Alexander Aring
ad191e0eee dlm: fix user space lock decision to copy lvb
This patch fixes the copy lvb decision for user space lock requests.
Checking dlm_lvb_operations is done earlier, where granted/requested
lock modes are available to use in the matrix.

The decision had been moved to the wrong location, where granted mode
and requested mode where the same, which causes the dlm_lvb_operations
matix to produce the wrong copy decision. For PW or EX requests, the
caller could get invalid lvb data.

Fixes: 61bed0baa4 ("fs: dlm: use a non-static queue for callbacks")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01 13:31:12 -05:00
Christoph Hellwig
fa1af65bf8 btrfs use bio_list_merge_init
Use bio_list_merge_init instead of open coding bio_list_merge and
bio_list_init.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240328084147.2954434-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:53:37 -06:00
Alexander Lobakin
4ca532d646 btrfs: rename bitmap_set_bits() -> btrfs_bitmap_set_bits()
bitmap_set_bits() does not start with the FS' prefix and may collide
with a new generic helper one day. It operates with the FS-specific
types, so there's no change those two could do the same thing.
Just add the prefix to exclude such possible conflict.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 10:49:28 +01:00
Alexander Lobakin
3f5ef5109f fs/ntfs3: add prefix to bitmap_size() and use BITS_TO_U64()
bitmap_size() is a pretty generic name and one may want to use it for
a generic bitmap API function. At the same time, its logic is
NTFS-specific, as it aligns to the sizeof(u64), not the sizeof(long)
(although it uses ideologically right ALIGN() instead of division).
Add the prefix 'ntfs3_' used for that FS (not just 'ntfs_' to not mix
it with the legacy module) and use generic BITS_TO_U64() while at it.

Suggested-by: Yury Norov <yury.norov@gmail.com> # BITS_TO_U64()
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 10:49:27 +01:00
Andrey Albershteyn
e23d7e82b7 xfs: allow cross-linking special files without project quota
There's an issue that if special files is created before quota
project is enabled, then it's not possible to link this file. This
works fine for normal files. This happens because xfs_quota skips
special files (no ioctls to set necessary flags). The check for
having the same project ID for source and destination then fails as
source file doesn't have any ID.

mkfs.xfs -f /dev/sda
mount -o prjquota /dev/sda /mnt/test

mkdir /mnt/test/foo
mkfifo /mnt/test/foo/fifo1

xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test
> Setting up project 9 (path /mnt/test/foo)...
> xfs_quota: skipping special file /mnt/test/foo/fifo1
> Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1).

ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link
> ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link

mkfifo /mnt/test/foo/fifo2
ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link

Fix this by allowing linking of special files to the project quota
if special files doesn't have any ID set (ID = 0).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-01 11:55:49 +05:30
Kent Overstreet
b3c7fd35c0 bcachefs: On emergency shutdown, print out current journal sequence number
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 01:07:24 -04:00
Kent Overstreet
eab3a3ce2d bcachefs: Fix overlapping extent repair
overlapping extent repair was colliding with extent past end of inode
checks - don't update "extent ends at" until we know we have an extent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 01:05:50 -04:00
Kent Overstreet
8ce1db8091 bcachefs: Fix remove_dirent()
We were missing an iter_traverse().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 00:52:32 -04:00
Kent Overstreet
cecfed9b44 bcachefs: Logged op errors should be ignored
If something is wrong with a logged op, we just want to delete it -
there's nothing to repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 00:04:10 -04:00
Kent Overstreet
13c1e583f9 bcachefs: Improve -o norecovery; opts.recovery_pass_limit
This adds opts.recovery_pass_limit, and redoes -o norecovery to make use
of it; this fixes some issues with -o norecovery so it can be safely
used for data recovery.

Norecovery means "don't do journal replay"; it's an important data
recovery tool when we're getting stuck in journal replay.

When using it this way we need to make sure we don't free journal keys
after startup, so we continue to overlay them: thus it needs to imply
retain_recovery_info, as well as nochanges.

recovery_pass_limit is an explicit option for telling recovery to exit
after a specific recovery pass; this is a much cleaner way of
implementing -o norecovery, as well as being a useful debug feature in
its own right.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
060ff30a85 bcachefs: bch2_run_explicit_recovery_pass_persistent()
Flag that we need to run a recovery pass and run it - persistenly, so if
we crash it'll still get run.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
0a34c058fc bcachefs: Ensure bch_sb_field_ext always exists
This makes bch_sb_field_ext more consistent with the rest of -o
nochanges - we don't want to be varying other codepaths based on -o
nochanges, since it's used for testing in dry run mode; also fixes some
potential null ptr derefs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
4fe0eeeae4 bcachefs: Flush journal immediately after replay if we did early repair
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
af855a5f5e bcachefs: Resume logged ops after fsck
Finishing logged ops requires the filesystem to be in a reasonably
consistent state - and other fsck passes don't require it to have
completed, so just run it last.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
e5aa804641 bcachefs: Add error messages to logged ops fns
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
d2554263ad bcachefs: Split out recovery_passes.c
We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.

So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
11d5568d3e bcachefs: fix backpointer for missing alloc key msg
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
7f9e508036 bcachefs: Fix bch2_btree_increase_depth()
When we haven't yet allocated any btree nodes for a given btree, we
first need to call the regular split path to allocate one.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
47d2080e30 bcachefs: Kill bch2_bkey_ptr_data_type()
Remove some duplication, and inconsistency between check_fix_ptrs and
the main ptr marking paths

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
dcc1c04587 bcachefs: Fix use after free in check_root_trans()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
83bb585390 bcachefs: Fix repair path for missing indirect extents
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
6f5869ffd9 bcachefs: Fix use after free in bch2_check_fix_ptrs()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
812a929793 bcachefs: Fix btree node keys accounting in topology repair path
When dropping keys now outside a now because we're changing the node
min/max, we need to redo the node's accounting as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
805b535a8a bcachefs: Check btree ptr min_key in .invalid
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
zhuxiaohui
bb66009958 bcachefs: add REQ_SYNC and REQ_IDLE in write dio
when writing file with direct_IO on bcachefs, then performance is
much lower than other fs due to write back throttle in block layer:

        wbt_wait+1
        __rq_qos_throttle+32
        blk_mq_submit_bio+394
        submit_bio_noacct_nocheck+649
        bch2_submit_wbio_replicas+538
        __bch2_write+2539
        bch2_direct_write+1663
        bch2_write_iter+318
        aio_write+355
        io_submit_one+1224
        __x64_sys_io_submit+169
        do_syscall_64+134
        entry_SYSCALL_64_after_hwframe+110

add set REQ_SYNC and REQ_IDLE in bio->bi_opf as standard dirct-io

Signed-off-by: zhuxiaohui <zhuxiaohui.400@bytedance.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
79032b0781 bcachefs: Improved topology repair checks
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(),
and replace them with an improved version,
bch2_btree_node_check_topology().

This checks that children of an interior node correctly span the full
range of the parent node with no overlaps.

Also, ensure that topology repairs at runtime are always a fatal error;
in particular, this adds a check in btree_iter_down() - if we don't find
a key while walking down the btree that's indicative of a topology error
and should be flagged as such, not a null ptr deref.

Some checks in btree_update_interior.c remaining BUG_ONS(), because we
already checked the node for topology errors when starting the update,
and the assertions indicate that we _just_ corrupted the btree node -
i.e. the problem can't be that existing on disk corruption, they
indicate an actual algorithmic bug.

In the future, we'll be annotating the fsck errors list with which
recovery pass corrects them; the open coded "run explicit recovery pass
or fatal error" in bch2_btree_node_check_topology() will in the future
be done for every fsck_err() call.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
40cb26233a bcachefs: Be careful about btree node splits during journal replay
Don't pick a pivot that's going to be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
048f47e83f bcachefs: btree_and_journal_iter now respects trans->journal_replay_not_finished
btree_and_journal_iter is now safe to use at runtime, not just during
recovery before journal keys have been freed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Hongbo Li
36f9ef109b bcachefs: fix trans->mem realloc in __bch2_trans_kmalloc
The old code doesn't consider the mem alloced from mempool when call
krealloc on trans->mem. Also in bch2_trans_put, using mempool_free to
free trans->mem by condition "trans->mem_bytes == BTREE_TRANS_MEM_MAX"
is inaccurate when trans->mem was allocated by krealloc function.
Instead, we use used_mempool stuff to record the situation, and realloc
or free the trans->mem in elegant way.

Also, after krealloc failed in __bch2_trans_kmalloc, the old data
should be copied to the new buffer when alloc from mempool_alloc.

Fixes: 31403dca5b ("bcachefs: optimize __bch2_trans_get(), kill DEBUG_TRANSACTIONS")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
57339b24a0 bcachefs: Don't do extent merging before journal replay is finished
We don't normally do extent updates this early in recovery, but some of
the repair paths have to and when we do, we don't want to do anything
that requires the snapshots table.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
ec9cc18fc2 bcachefs: Add checks for invalid snapshot IDs
Previously, we assumed that keys were consistent with the snapshots
btree - but that's not correct as fsck may not have been run or may not
be complete.

This adds checks and error handling when using the in-memory snapshots
table (that mirrors the snapshots btree).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
63332394c7 bcachefs: Move snapshot table size to struct snapshot_table
We need to add bounds checking for snapshot table accesses - it turns
out there are cases where we do need to use the snapshots table before
fsck checks have completed (and indeed, fsck may not have been run).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
aa6e130e3c bcachefs: Add an assertion for trying to evict btree root
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
4bd02d3fb3 bcachefs: fix mount error path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:10 -04:00
Thomas Bertschinger
688d750d10 bcachefs: fix misplaced newline in __bch2_inode_unpacked_to_text()
before:

u64s 18 type inode_v3 0:1879048192:U32_MAX len 0 ver 0:   mode=40700
  flags= (15300000)
  journal_seq=4
  bi_size=0
  bi_sectors=0

  bi_version=0bi_atime=227064388944
  ...

after:

u64s 18 type inode_v3 0:1879048192:U32_MAX len 0 ver 0:   mode=40700
  flags= (15300000)
  journal_seq=4
  bi_size=0
  bi_sectors=0
  bi_version=0
  bi_atime=227064388944
  ...

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:10 -04:00
Kent Overstreet
8aad8e1f65 bcachefs: Fix journal pins in btree write buffer
btree write buffer flush has two phases
 - in natural key order, which is more efficient but may fail
 - then in journal order

The journal order flush was assuming that keys were still correctly
ordered by journal sequence number - but due to coalescing by the
previous phase, we need an additional sort.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:10 -04:00
Kent Overstreet
a5e3dce493 bcachefs: Fix assert in bch2_backpointer_invalid()
Backpointers that point to invalid devices are caught by fsck, not
.key_invalid; so .key_invalid needs to check for them instead of hitting
asserts.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:10 -04:00
Justin Stitt
ebd9779683 smb: client: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

In cifssmb.c:
Using strncpy with a length argument equal to strlen(src) is generally
dangerous because it can cause string buffers to not be NUL-terminated.
In this case, however, there was extra effort made to ensure the buffer
was NUL-terminated via a manual NUL-byte assignment. In an effort to rid
the kernel of strncpy() use, let's swap over to using strscpy() which
guarantees NUL-termination on the destination buffer.

To handle the case where ea_name is NULL, let's use the ?: operator to
substitute in an empty string, thereby allowing strscpy to still
NUL-terminate the destintation string.

Interesting note: this flex array buffer may go on to also have some
value encoded after the NUL-termination:
|	if (ea_value_len)
|		memcpy(parm_data->list.name + name_len + 1,
|			ea_value, ea_value_len);

Now for smb2ops.c and smb2transport.c:
Both of these cases are simple, strncpy() is used to copy string
literals which have a length less than the destination buffer's size. We
can simply swap in the new 2-argument version of strscpy() introduced in
Commit e6584c3964 ("string: Allow 2-argument strscpy()").

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-31 17:35:14 -05:00
Linus Torvalds
7e40c2100c Kbuild fixes for v6.9
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
 
  - Fix unselectable choice entry in MIPS Kconfig, and forbid this
    structure
 
  - Remove unused include/asm-generic/export.h
 
  - Fix a NULL pointer dereference bug in modpost
 
  - Enable -Woverride-init warning consistently with W=1
 
  - Drop KCSAN flags from *.mod.c files
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmYJVK0VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGf/sP/3GOk//cQGwPyWCgtCEUo6T4yyD7
 1m2TTR0JQk/lcohSFtYk0I20rhKRqU6yAMAERmyehI66D2QY7lhiYVc16ram5y04
 x0nWxd9IqerIlGJtaWePOvNqKdCw2EP9fS9NKz58rEDMGlsSf0Rd3NEdSsWoH8td
 dECtt8yCawENAMStb/rAfsnL6kn2JIhVMyqwo0RdQfiaVT5Zk6Qgpko0Oq0ncRP2
 qdNgHbvnJdKMy81FHSBAi0QEZOYvhFNX+E+6lFfWEsX6xT+wvXddCNQzJf/YV3Cw
 Klw1tGveV7UGzlZ4fsnFrv4V6g1KO2AD3342efdDo++ypBEBpImVODc+Rp0jE9Nk
 OgdOQRe2k9a5keH0LWY0ehvDbQlSbfNxk0wNtAfo5Kk5e41nHmHJBWCwGG+cXrjJ
 mPJjSrTpuNVSaGV0kt3EskHbDBeBmIIg+5QPbldmW2qcC88kWoavkyLD3WPFsg/a
 CAuR/HqH7MDfxzvsqTCjonlVcyDKX6aW66LrQ1NCtmphI4F8mdKp746CzGlziuIm
 gjYJL/UWVlx0VebMo8dwDpaHvez4/4s6xAJcyqtA+TS5HbrQWKQuwFkiv4iWQxNd
 MvyVdzgKhcMdoXhfFpUZ0LlFvHGefJ+Z6N1FQLoQJkTirt5aqRbEAjP0VXwQB4eH
 zYygkhvvtiH9/STu
 =tx+2
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Deduplicate Kconfig entries for CONFIG_CXL_PMU

 - Fix unselectable choice entry in MIPS Kconfig, and forbid this
   structure

 - Remove unused include/asm-generic/export.h

 - Fix a NULL pointer dereference bug in modpost

 - Enable -Woverride-init warning consistently with W=1

 - Drop KCSAN flags from *.mod.c files

* tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: Fix typo HEIGTH to HEIGHT
  Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer
  kbuild: Disable KCSAN for autogenerated *.mod.c intermediaries
  kbuild: make -Woverride-init warnings more consistent
  modpost: do not make find_tosym() return NULL
  export.h: remove include/asm-generic/export.h
  kconfig: do not reparent the menu inside a choice block
  MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice
  cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig
2024-03-31 11:23:51 -07:00
Yuezhang Mo
d7ed5232f0 exfat: fix timing of synchronizing bitmap and inode
Commit(f55c096f62 exfat: do not zero the extended part) changed
the timing of synchronizing bitmap and inode in exfat_cont_expand().
The change caused xfstests generic/013 to fail if 'dirsync' or 'sync'
is enabled. So this commit restores the timing.

Fixes: f55c096f62 ("exfat: do not zero the extended part")
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-31 23:39:29 +09:00
Arnd Bergmann
c40845e319 kbuild: make -Woverride-init warnings more consistent
The -Woverride-init warn about code that may be intentional or not,
but the inintentional ones tend to be real bugs, so there is a bit of
disagreement on whether this warning option should be enabled by default
and we have multiple settings in scripts/Makefile.extrawarn as well as
individual subsystems.

Older versions of clang only supported -Wno-initializer-overrides with
the same meaning as gcc's -Woverride-init, though all supported versions
now work with both. Because of this difference, an earlier cleanup of
mine accidentally turned the clang warning off for W=1 builds and only
left it on for W=2, while it's still enabled for gcc with W=1.

There is also one driver that only turns the warning off for newer
versions of gcc but not other compilers, and some but not all the
Makefiles still use a cc-disable-warning conditional that is no
longer needed with supported compilers here.

Address all of the above by removing the special cases for clang
and always turning the warning off unconditionally where it got
in the way, using the syntax that is supported by both compilers.

Fixes: 2cd3271b7a ("kbuild: avoid duplicate warning options")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-03-31 11:32:26 +09:00
Linus Torvalds
712e14250d Bug fixes for 6.9-rc2:
* Allow stripe unit/width value passed via mount option to be written over
    existing values in the super block.
  * Do not set current->journal_info to avoid its value from being miused by
    another filesystem context.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZgKa+AAKCRAH7y4RirJu
 9IL1APwPBMzSowijBI/rCD5BGlISn7mCRlZwvyXE1avmRmbQPAEApU5yRhBHWi62
 629azfSr1I5m678xM7WQKh6X3/VUDAg=
 =pqNH
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.9-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Allow stripe unit/width value passed via mount option to be written
   over existing values in the super block

 - Do not set current->journal_info to avoid its value from being miused
   by another filesystem context

* tag 'xfs-6.9-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't use current->journal_info
  xfs: allow sunit mount option to repair bad primary sb stripe values
2024-03-30 13:51:58 -07:00
Linus Torvalds
091619baac 2 cifs.ko changesets
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYG6B4ACgkQiiy9cAdy
 T1H8vAv/ULRl/PQreMq6FaXSeW0S5w7RiMeeZ1+314C7W/iLLbd8n6fmOshAS1nn
 d8vC6qx5ulTNRwFR7eYexolWgqG7bisKTiGzp+UZelo01LlPU90cJQvPZfaETVPR
 XpqRUwrQE/sU9n48PGvcGARqd2sz1WC3l/oeN1P0QXJ7xWKJJNcLqtzT0eeNgAto
 u/EKw9SZ5D20V1GrGYxHY57M9L0rRcndUoWZDn9s0NCkOGnibOOiTxpQV8sepJ35
 6CRKxLTatzAJahUaQ/C5uvM0cSfRcTQdp3u4B01+gCSPhP7mwIS0o5EDzw2lWk6W
 9zQz9de2PPw98Q0sbZn1fuherbLSKDXZIIUKEH+oZ2WXJfva5jIVBpp5EIR92VY4
 A86DznmRJC5br3q+FuBOc1NPyslV01l80ZmhWppeX6/YA8egkm4KWgFB/02n7Agh
 Ug2Fe8MNGVPiwmijXBswf9CSjn3ctAH9cRvqs0QXj3aPkduvu+jDVGwbO/zWdvlY
 c1gs/eF7
 =ai8C
 -----END PGP SIGNATURE-----

Merge tag '6.9-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Add missing trace point (noticed when debugging the recent mknod LSM
   regression)

 - fscache fix

* tag '6.9-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix duplicate fscache cookie warnings
  smb3: add trace event for mknod
2024-03-29 12:06:09 -07:00
Linus Torvalds
d8e8fbec00 nfsd-6.9 fixes:
- Address three recently introduced regressions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYFtQkACgkQM2qzM29m
 f5fSbA//S0YiPCU+bVwN+mK47vhai1Dzaw1kd2rzlfnvTCyBFVocfQU85bFV8uns
 ZMcHhO6jPjRg1o/qkCzeNf+n6AVGwvgOn1jC628jTP1wJOS8WPsZtdwz8vHX/xw5
 qnMwEUtfFPiDn85VyK6z6mJBWRcF/vStPc022Ie5ba5ydxD0rFmaeIc6zYX8csUe
 8tbHlulYtRR97mXjxgtXC/H0scvag07MA1fRKiZRNk0jwUdtcAsHktiBfWPV2KXO
 ew8saA+BLyBBdE01cq1CoMH/3wt/a5rpaiV5ePPtYkmlPVT5IEmSLZUwUVZAjMuu
 qlvKF9xrH7x8/7g/SwOJdH8po0IjiRdW9EKID+6PPqXz+YYKyTe7JqZtvEzt/4x9
 tGbIfyHxpUPPTZ6VedlXYW95uWG+lrKzur+zHQwsNf+U0d5Xila8euEKqvB4kDtk
 riHUlvs8O6WroeHcOQ5pK72l7x8gELdVqHP6n+E0t5+2VKp/Vcqf8WlWV0suoDfa
 D2EYGIzxn8NrNuivGESHfzafmHKKMn51UGrm1Sl0vF5w2LlqMfL4Yo4NIT+e69VR
 E3h6wIhK7Jor9px5/tBD3Y6o9k8wT9upjA+qsrOUhxyU3T9r+s88I64gvGyABTIM
 kZJHLJYROoNO6A5FSRySSH+bk/UU+wJ0s6avgHL1A2yYnyrQUzU=
 =YV5O
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Address three recently introduced regressions

* tag 'nfsd-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies
  SUNRPC: Revert 561141dd49
  nfsd: Fix error cleanup path in nfsd_rename()
2024-03-28 14:35:32 -07:00
Jeff Layton
7a84602297
9p: explicitly deny setlease attempts
9p is a remote network protocol, and it doesn't support asynchronous
notifications from the server. Ensure that we don't hand out any leases
since we can't guarantee they'll be broken when a file's contents
change.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 19:52:55 +00:00
Joakim Sindholt
4e5d208cc9
fs/9p: fix the cache always being enabled on files with qid flags
I'm not sure why this check was ever here. After updating to 6.6 I
suddenly found caching had been turned on by default and neither
cache=none nor the new directio would turn it off. After walking through
the new code very manually I realized that it's because the caching has
to be, in effect, turned off explicitly by setting P9L_DIRECT and
whenever a file has a flag, in my case QTAPPEND, it doesn't get set.

Setting aside QTDIR which seems to ignore the new fid->mode entirely,
the rest of these either should be subject to the same cache rules as
every other QTFILE or perhaps very explicitly not cached in the case of
QTAUTH.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 15:10:29 +00:00
Joakim Sindholt
87de39e705
fs/9p: translate O_TRUNC into OTRUNC
This one hits both 9P2000 and .u as it appears v9fs has never translated
the O_TRUNC flag.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 15:10:28 +00:00
Joakim Sindholt
cd25e15e57
fs/9p: only translate RWX permissions for plain 9P2000
Garbage in plain 9P2000's perm bits is allowed through, which causes it
to be able to set (among others) the suid bit. This was presumably not
the intent since the unix extended bits are handled explicitly and
conditionally on .u.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 13:59:23 +00:00
Linus Torvalds
8d025e2092 Changes since last update:
- Add a new reviewer Sandeep Dhavale to build a healthier community;
 
  - Drop experimental warning for FSDAX.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmYE1ucRHHhpYW5nQGtl
 cm5lbC5vcmcACgkQUXZn5Zlu5qoSnw//Q35Sc7fIQb6YfuOOQ2VQX7TVE4jTrLej
 +3hIQUz3BA8wRwrBYdeJLcvjda4y+0gNxlpw2ycNS6S4vxAiVEAuwPJYO6oAC0Il
 ETa6opc1vpTMeXgwwtbXC7ACnWas9EQC61Z4E8W5zeVcNqQZnZyInMJ9Rkjqs/iJ
 VjJH2wWR5MIgWJdEPqchPx/28nqbBOcztc7ARJqpujyZEvu+OBVIPv8P/7N0a5yG
 bpHkDrzoelBMkpktpuvrkv84ymyCDC7LH9mq+Wk6dY5wRFxa2BtoZVh6YcpcjrpR
 75GZVg3BN3Ph41JCwYHqxyRGpoLO11dSYi6DNDVngxOGtkTNRVGrJ1FYEapWV+o8
 1MnEbl0vZSHUkjrIFbfZTFSqpvW2XSfEOa3heNDFknmzT/ISobSGENULp9cggcYI
 jhS6wtVG4bl5bCiCKCZluByr8/J8TCQc/5t5f5bQLy2MWqlyjaSx82uWuDpzO1Hh
 +q+p+MB+ketMUwxaIUAuTNgzhFPFT4Na/ni9WqP7Ri3GJY6pdjDUUtrtoIBK4oPQ
 ajUWhPlOk5zMwLq9Jl4MiG1ostBI9P37ZerjsdaLDZYElGhTjwjPu/xlh7p26Inq
 Ufq3QaQH2wai+oAVS6Sli3dJfb399XVJhmT2WFMH+0DmQW6JzvsGTdUX2fMgv5sb
 I7dVfuceTs0=
 =+ZW9
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:

 - Add a new reviewer Sandeep Dhavale to build a healthier community

 - Drop experimental warning for FSDAX

* tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  MAINTAINERS: erofs: add myself as reviewer
  erofs: drop experimental warning for FSDAX
2024-03-27 20:24:09 -07:00
Linus Torvalds
4076fa1612 fs/9p: fix errors found in 6.9-rc1
Two of these fix syzbot reported issues, and
 the other fixes a unused variable in some configurations.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmYEjyAACgkQiP/V+0pf
 /5jrDhAAkrEkFry9Zp7txMNG7fyRVRW8BPkvCUbORBdmLm01lswTnvHDT5pbSYcf
 Z3KxSpo0DvO2G3XEHQHQN24Sbx78Xg8XssXGFc0j1/hQbGpwIyXM/NTa3VVnfmwH
 lV2ysLa5zCR81k0hu8QzGe8DSOFsyE9oz3ABE7nCjmQsOI0zgyOvn7Uy1rwT5B11
 lBYN4rotl2Ie1wPoVTf8PNbeAWtuVR3FN9GTSvZuNDFbYjYEv0zFangxXIBFhn2g
 pUJ22C6CYenhLzTKBRVW/CcSxPVVS4jFok1nws07MmlXozrbklPYHuN6XQYFetLt
 wtnZxmUmy1u4pmpG2xPziUOGb7wdm6cDC1aIrcYPlbk/9U2iYHuVgQz0WOcEIbA3
 g5LrgCpUZO8UPxEvtFYDq9oUAhxhjf6BI7/rR0fNWp8JgJ0scz3d/e+Q/srWWr23
 Pej+10N5JijTEuD39BUnWy/yTz0WVBg1nP/C44buB0zSWJbxiPOJd0imSW6ipoIk
 Onr3FmhnRSiIfuUKEH/jI+QsqKzZTdBe9e3+SYBEg4KXvM3e1ltJgDs+BIdWbo00
 ih/jkZL7iX+uzkOdVR3o/1KagkwHTcU8P+4i8wELD7+CquQXvsedfqeTmz5JvNIR
 9Df1Gu1xCeINn6jnlSaVAonnCi1awJoaxFW+azCb+q5MP8lP6IU=
 =1ev6
 -----END PGP SIGNATURE-----

Merge tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p fixes from Eric Van Hensbergen:
 "Two of these fix syzbot reported issues, and the other fixes a unused
  variable in some configurations"

* tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: fix uninitialized values during inode evict
  fs/9p: remove redundant pointer v9ses
  fs/9p: fix uaf in in v9fs_stat2inode_dotl
2024-03-27 14:53:56 -07:00
Linus Torvalds
400dd456bd for-6.9-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYEfJYACgkQxWXV+ddt
 WDucIg/+IupuqdLKnj6bepxX/VufnFAjecD3sRgZQQLIMfm3MQX3TzbNoPYEiAGU
 tNG6jxYgkGRoyhN3aIQnsJmRFje5epYjNA5+ueUNT2/KfyKonnS2TIKQt6u7XBls
 fl4SCTSNRX7w/QUNUWwyY5/86yzV4F8w19X5nVOKcp7Nz3hUBdeDZWAmMlYyHuFW
 N2YRyNdCxB4Y0U9g1vgI63wFjOac0F+7RTHGsDH7ueOZ2dbtDM38lHBoCbdX3jmy
 5nG7wVJZp2H/zCmzrVQJ897CMfr3h9r9Kxx8EE3JDJaJ5sMMaRh361rgsZTaGsjz
 SwUzT6Z0u0hsBANSTOUZixhfX5sqArmemG+XpFu6Rq+732DqS+c4vWRSu7c8Rc8i
 +4HIQNsjJqm/d1u2IyxXfuqSbaULLnyYQ8rdEx3o2AM37JnuTvOWoB+v/JqPb9TI
 aG+bOPvg7GM9Sl3IoM5sR+j3bEebranZbUF+UiDEujZJiY+uiw3vMbFvyOBRWaUU
 ODTpNoyCmz94mWg79hyosOjM9A/NCEkRH4oSc+YeqOvzTIBG3V+D3HxN/DX4FVTy
 VDxMdptu0aPIEkUQ3nvsj4t3OKgj1w9rxZFpRYH33zJVvRqZ8VrgtC9V6zgPv3h7
 suQL4s4i4EiIgAk2Z0OR23wDwg1TwXVWLGErfQVHslvhl/a2Qb4=
 =Lhhp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix race when reading extent buffer and 'uptodate' status is missed
   by one thread (introduced in 6.5)

 - do additional validation of devices using major:minor numbers

 - zoned mode fixes:
     - use zone-aware super block access during scrub
     - fix use-after-free during device replace (found by KASAN)
     - also delete zones that are 100% unusable to reclaim space

 - extent unpinning fixes:
     - fix extent map leak after error handling
     - print correct range in error message

 - error code and message updates

* tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix race in read_extent_buffer_pages()
  btrfs: return accurate error code on open failure in open_fs_devices()
  btrfs: zoned: don't skip block groups with 100% zone unusable
  btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
  btrfs: fix message not properly printing interval when adding extent map
  btrfs: fix warning messages not printing interval at unpin_extent_range()
  btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
  btrfs: validate device maj:min during open
  btrfs: zoned: fix use-after-free in do_zone_finish()
  btrfs: zoned: use zone aware sb location for scrub
2024-03-27 13:56:41 -07:00
Chuck Lever
99dc2ef039 NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies
There are one or two cases where CREATE_SESSION returns
NFS4ERR_DELAY in order to force the client to wait a bit and try
CREATE_SESSION again. However, after commit e4469c6cc6 ("NFSD: Fix
the NFSv4.1 CREATE_SESSION operation"), NFSD caches that response in
the CREATE_SESSION slot. Thus, when the client resends the
CREATE_SESSION, the server always returns the cached NFS4ERR_DELAY
response rather than actually executing the request and properly
recording its outcome. This blocks the client from making further
progress.

RFC 8881 Section 15.1.1.3 says:
> If NFS4ERR_DELAY is returned on an operation other than SEQUENCE
> that validly appears as the first operation of a request ... [t]he
> request can be retried in full without modification. In this case
> as well, the replier MUST avoid returning a response containing
> NFS4ERR_DELAY as the response to an initial operation of a request
> solely on the basis of its presence in the reply cache.

Neither the original NFSD code nor the discussion in section 18.36.4
refer explicitly to this important requirement, so I missed it.

Note also that not only must the server not cache NFS4ERR_DELAY, but
it has to not advance the CREATE_SESSION slot sequence number so
that it can properly recognize and accept the client's retry.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Fixes: e4469c6cc6 ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation")
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-27 13:19:47 -04:00
David Howells
8876a37277 cifs: Fix duplicate fscache cookie warnings
fscache emits a lot of duplicate cookie warnings with cifs because the
index key for the fscache cookies does not include everything that the
cifs_find_inode() function does.  The latter is used with iget5_locked() to
distinguish between inodes in the local inode cache.

Fix this by adding the creation time and file type to the fscache cookie
key.

Additionally, add a couple of comments to note that if one is changed the
other must be also.

Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: 70431bfd82 ("cifs: Support fscache indexing rewrite")
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-27 12:04:06 -05:00
Linus Torvalds
f4a432914a execve fixes for v6.9-rc2
- Fix selftests to conform to the TAP output format (Muhammad Usama Anjum)
 
 - Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)
 
 - Replace deprecated strncpy usage (Justin Stitt)
 
 - Replace another /bin/sh instance in selftests
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmYDT3sWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjxlD/49PYpA4hMReEJ/01UkMn7IT2DP
 QWV9IfaPTodj9tjQngalhcF7r6O5guRR7MRfZxyaXriq4aJNzOLm2STmwSG1cOgP
 hP9D0HnMSc5CrqMJ2kSTr3ETK0a2mTivWl375TUgGdW+QJo7YYInHYaH2THhme1Z
 MkLHqSkruHw6YVvSvzoWiwZ4taiia7op8HbAEvJQiwnJdiVeCLIYbf2AxXNop2xv
 xcmoGkSh6KSiQ0XQ7VXs4LC3v/ElHBINSbChoXPBDY5kBWZybyxRwYCVt8mJftgF
 mVGXBFFpnaLU/gDayPg/Pyq9sW1bLpi8w0BBu419BVfAQ475K+YZ/V8nj4fm95e3
 gIWm3x1O48r0OxdzmPb5re/s7lG5uNLzzFEWIus18NmqgA8S1CyFveRB3Zh8LlXB
 9UEt4mlcgp/CLAo1Zv6IBe6UDcAf4AR4Tq+d+etmORTqHmM7n399XivNuft9myyB
 9ObLCfKvOa71uF0n714XLHc5STk2KTK70Me2L/H5gitSqjIEKFNQ5SOaSbsGImDv
 i4YPnptCJFTQumE0Tu5hna8uyjOXFIxq/zkfDmzc1wP8FcijwRx3UPoO6WlQsdfx
 5cmJSaIX1bhFC+4gxAoEHUDWPh/f4kLeDpIXX6NPH28Do1wxLnri3ryvkfgkw5Vj
 /1E03LXfcnnSbjQAPQ==
 =Siss
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fixes from Kees Cook:

 - Fix selftests to conform to the TAP output format (Muhammad Usama
   Anjum)

 - Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)

 - Replace deprecated strncpy usage (Justin Stitt)

 - Replace another /bin/sh instance in selftests

* tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt: replace deprecated strncpy
  exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()
  selftests/exec: Convert remaining /bin/sh to /bin/bash
  selftests/exec: execveat: Improve debug reporting
  selftests/exec: recursion-depth: conform test to TAP format output
  selftests/exec: load_address: conform test to TAP format output
  selftests/exec: binfmt_script: Add the overall result line according to TAP
2024-03-27 09:57:30 -07:00
Christian Brauner
22650a9982
fs,block: yield devices early
Currently a device is only really released once the umount returns to
userspace due to how file closing works. That ultimately could cause
an old umount assumption to be violated that concurrent umount and mount
don't fail. So an exclusively held device with a temporary holder should
be yielded before the filesystem is gone. Add a helper that allows
callers to do that. This also allows us to remove the two holder ops
that Linus wasn't excited about.

Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org
Fixes: f3a608827d ("bdev: open block device as files") # mainline only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-27 13:17:15 +01:00
Steve French
e9e9fbeb83 smb3: add trace event for mknod
Add trace points to help debug mknod and mkfifo:

   smb3_mknod_done
   smb3_mknod_enter
   smb3_mknod_err

Example output:

      TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
         | |         |   |||||     |         |
    mkfifo-6163    [003] .....   960.425558: smb3_mknod_enter: xid=12 sid=0xb55130f6 tid=0x46e6241c path=\fifo1
    mkfifo-6163    [003] .....   960.432719: smb3_mknod_done: xid=12 sid=0xb55130f6 tid=0x46e6241c

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-26 14:56:36 -05:00
Tavian Barnes
ef1e68236b btrfs: fix race in read_extent_buffer_pages()
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.

To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:

    /* (1) */
    if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
        return 0;

    /* (2) */
    if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
        goto done;

At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does

    /* (3) */
    set_extent_buffer_uptodate(eb);

    /* (4) */
    clear_bit(EXTENT_BUFFER_READING, &eb->bflags);

Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning.  Unfortunately, there is
a racey interleaving:

    Thread A | Thread B | Thread C
    ---------+----------+---------
       (1)   |          |
             |    (1)   |
       (2)   |          |
       (3)   |          |
       (4)   |          |
             |    (2)   |
             |          |    (1)

When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress.  This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:

    BTRFS critical (device dm-0): corrupted node, root=256
    block=8550954455682405139 owner mismatch, have 11858205567642294356
    expect [256, 18446744073709551360]

Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.

Fixes: d7172f52e9 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Anand Jain
2f1aeab9fc btrfs: return accurate error code on open failure in open_fs_devices()
When attempting to exclusive open a device which has no exclusive open
permission, such as a physical device associated with the flakey dm
device, the open operation will fail, resulting in a mount failure.

In this particular scenario, we erroneously return -EINVAL instead of the
correct error code provided by the bdev_open_by_path() function, which is
-EBUSY.

Fix this, by returning error code from the bdev_open_by_path() function.
With this correction, the mount error message will align with that of
ext4 and xfs.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Johannes Thumshirn
a8b70c7f86 btrfs: zoned: don't skip block groups with 100% zone unusable
Commit f4a9f21941 ("btrfs: do not delete unused block group if it may be
used soon") changed the behaviour of deleting unused block-groups on zoned
filesystems. Starting with this commit, we're using
btrfs_space_info_used() to calculate the number of used bytes in a
space_info. But btrfs_space_info_used() also accounts
btrfs_space_info::bytes_zone_unusable as used bytes.

So if a block group is 100% zone_unusable it is skipped from the deletion
step.

In order not to skip fully zone_unusable block-groups, also check if the
block-group has bytes left that can be used on a zoned filesystem.

Fixes: f4a9f21941 ("btrfs: do not delete unused block group if it may be used soon")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
2133460061 btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
At btrfs_add_extent_mapping(), if we failed to merge the extent map, which
is unexpected and theoretically should never happen, we use WARN_ONCE() to
log a message which is not great because we don't get information about
which filesystem it relates to in case we have multiple btrfs filesystems
mounted. So change this to use btrfs_warn() and surround the error check
with WARN_ON() so we always get a useful stack trace and the condition is
flagged as "unlikely" since it's not expected to ever happen.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
379c872393 btrfs: fix message not properly printing interval when adding extent map
At btrfs_add_extent_mapping(), if we are unable to merge the existing
extent map, we print a warning message that suggests interval ranges in
the form "[X, Y)", where the first element is the inclusive start offset
of a range and the second element is the exclusive end offset. However
we end up printing the length of the ranges instead of the exclusive end
offsets. So fix this by printing the range end offsets.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
4dc1d69c2b btrfs: fix warning messages not printing interval at unpin_extent_range()
At unpin_extent_range() we print warning messages that are supposed to
print an interval in the form "[X, Y)", with the first element being an
inclusive start offset and the second element being the exclusive end
offset of a range. However we end up printing the range's length instead
of the range's exclusive end offset, so fix that to avoid having confusing
and non-sense messages in case we hit one of these unexpected scenarios.

Fixes: 00deaf04df ("btrfs: log messages at unpin_extent_range() during unexpected cases")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Filipe Manana
8a565ec04d btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
At unpin_extent_cache() if we happen to find an extent map with an
unexpected start offset, we jump to the 'out' label and never release the
reference we added to the extent map through the call to
lookup_extent_mapping(), therefore resulting in a leak. So fix this by
moving the free_extent_map() under the 'out' label.

Fixes: c03c89f821 ("btrfs: handle errors returned from unpin_extent_cache()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Anand Jain
9f7eb8405d btrfs: validate device maj:min during open
Boris managed to create a device capable of changing its maj:min without
altering its device path.

Only multi-devices can be scanned. A device that gets scanned and remains
in the btrfs kernel cache might end up with an incorrect maj:min.

Despite the temp-fsid feature patch did not introduce this bug, it could
lead to issues if the above multi-device is converted to a single device
with a stale maj:min. Subsequently, attempting to mount the same device
with the correct maj:min might mistake it for another device with the same
fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature.

To address this, this patch validates the device's maj:min at the time of
device open and updates it if it has changed since the last scan.

CC: stable@vger.kernel.org # 6.7+
Fixes: a5b8a5f9f8 ("btrfs: support cloned-device mount capability")
Reported-by: Boris Burkov <boris@bur.io>
Co-developed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>#
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Johannes Thumshirn
1ec17ef591 btrfs: zoned: fix use-after-free in do_zone_finish()
Shinichiro reported the following use-after-free triggered by the device
replace operation in fstests btrfs/070.

 BTRFS info (device nullb1): scrub: finished on devid 1 with status: 0
 ==================================================================
 BUG: KASAN: slab-use-after-free in do_zone_finish+0x91a/0xb90 [btrfs]
 Read of size 8 at addr ffff8881543c8060 by task btrfs-cleaner/3494007

 CPU: 0 PID: 3494007 Comm: btrfs-cleaner Tainted: G        W          6.8.0-rc5-kts #1
 Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x90
  print_report+0xcf/0x670
  ? __virt_addr_valid+0x200/0x3e0
  kasan_report+0xd8/0x110
  ? do_zone_finish+0x91a/0xb90 [btrfs]
  ? do_zone_finish+0x91a/0xb90 [btrfs]
  do_zone_finish+0x91a/0xb90 [btrfs]
  btrfs_delete_unused_bgs+0x5e1/0x1750 [btrfs]
  ? __pfx_btrfs_delete_unused_bgs+0x10/0x10 [btrfs]
  ? btrfs_put_root+0x2d/0x220 [btrfs]
  ? btrfs_clean_one_deleted_snapshot+0x299/0x430 [btrfs]
  cleaner_kthread+0x21e/0x380 [btrfs]
  ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
  kthread+0x2e3/0x3c0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x70
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

 Allocated by task 3493983:
  kasan_save_stack+0x33/0x60
  kasan_save_track+0x14/0x30
  __kasan_kmalloc+0xaa/0xb0
  btrfs_alloc_device+0xb3/0x4e0 [btrfs]
  device_list_add.constprop.0+0x993/0x1630 [btrfs]
  btrfs_scan_one_device+0x219/0x3d0 [btrfs]
  btrfs_control_ioctl+0x26e/0x310 [btrfs]
  __x64_sys_ioctl+0x134/0x1b0
  do_syscall_64+0x99/0x190
  entry_SYSCALL_64_after_hwframe+0x6e/0x76

 Freed by task 3494056:
  kasan_save_stack+0x33/0x60
  kasan_save_track+0x14/0x30
  kasan_save_free_info+0x3f/0x60
  poison_slab_object+0x102/0x170
  __kasan_slab_free+0x32/0x70
  kfree+0x11b/0x320
  btrfs_rm_dev_replace_free_srcdev+0xca/0x280 [btrfs]
  btrfs_dev_replace_finishing+0xd7e/0x14f0 [btrfs]
  btrfs_dev_replace_by_ioctl+0x1286/0x25a0 [btrfs]
  btrfs_ioctl+0xb27/0x57d0 [btrfs]
  __x64_sys_ioctl+0x134/0x1b0
  do_syscall_64+0x99/0x190
  entry_SYSCALL_64_after_hwframe+0x6e/0x76

 The buggy address belongs to the object at ffff8881543c8000
  which belongs to the cache kmalloc-1k of size 1024
 The buggy address is located 96 bytes inside of
  freed 1024-byte region [ffff8881543c8000, ffff8881543c8400)

 The buggy address belongs to the physical page:
 page:00000000fe2c1285 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1543c8
 head:00000000fe2c1285 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
 flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
 page_type: 0xffffffff()
 raw: 0017ffffc0000840 ffff888100042dc0 ffffea0019e8f200 dead000000000002
 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff8881543c7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff8881543c7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff8881543c8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                        ^
  ffff8881543c8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8881543c8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

This UAF happens because we're accessing stale zone information of a
already removed btrfs_device in do_zone_finish().

The sequence of events is as follows:

btrfs_dev_replace_start
  btrfs_scrub_dev
   btrfs_dev_replace_finishing
    btrfs_dev_replace_update_device_in_mapping_tree <-- devices replaced
    btrfs_rm_dev_replace_free_srcdev
     btrfs_free_device                              <-- device freed

cleaner_kthread
 btrfs_delete_unused_bgs
  btrfs_zone_finish
   do_zone_finish              <-- refers the freed device

The reason for this is that we're using a cached pointer to the chunk_map
from the block group, but on device replace this cached pointer can
contain stale device entries.

The staleness comes from the fact, that btrfs_block_group::physical_map is
not a pointer to a btrfs_chunk_map but a memory copy of it.

Also take the fs_info::dev_replace::rwsem to prevent
btrfs_dev_replace_update_device_in_mapping_tree() from changing the device
underneath us again.

Note: btrfs_dev_replace_update_device_in_mapping_tree() is holding
fs_info::mapping_tree_lock, but as this is a spinning read/write lock we
cannot take it as the call to blkdev_zone_mgmt() requires a memory
allocation which may not sleep.
But btrfs_dev_replace_update_device_in_mapping_tree() is always called with
the fs_info::dev_replace::rwsem held in write mode.

Many thanks to Shinichiro for analyzing the bug.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
CC: stable@vger.kernel.org # 6.8
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:41:01 +01:00
Bill O'Donnell
7cd7bfe593
minix: convert minix to use the new mount api
Convert the minix filesystem to use the new mount API.

Tested using mount and remount on minix device.

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Link: https://lore.kernel.org/r/20240307163325.998723-1-bodonnel@redhat.com
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:04:55 +01:00
David Howells
78ff640819
vfs: Convert tracefs to use the new mount API
Convert the tracefs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: forward port to modern kernel, fix remounting]
Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com
cc: Steven Rostedt <rostedt@goodmis.org>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:04:54 +01:00
David Howells
a20971c187
vfs: Convert debugfs to use the new mount API
Convert the debugfs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: forward port to modern kernel, fix remounting]
Link: https://lore.kernel.org/r/49d1f108-46e3-443f-85a3-6dd730c5d076@redhat.com
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:04:54 +01:00
Eric Sandeen
8f27829974
openpromfs: finish conversion to the new mount API
The original mount API conversion inexplicably left out the change
from ->remount_fs to ->reconfigure; do that now.

Fixes: 7ab2fa7693 ("vfs: Convert openpromfs to use the new mount API")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/90b968aa-c979-420f-ba37-5acc3391b28f@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:04:54 +01:00
Eric Sandeen
b01e1a6359
freevxfs: Convert freevxfs to the new mount API.
Convert the freevxfs filesystem to the new mount API.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/b0d1a423-4b8e-4bc1-a021-a1078aee915f@redhat.com
Tested-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:04:53 +01:00
Bill O'Donnell
491681d44b
qnx6: convert qnx6 to use the new mount api
Convert the qnx6 filesystem to use the new mount API.

Mostly untested, since there is no qnx6 fs image readily available.
Testing did include parsing of the mmi_fs option.

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Link: https://lore.kernel.org/r/20240302165714.859504-1-bodonnel@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:04:53 +01:00
Kemeng Shi
ac0c18f2c6
fs/writeback: avoid to writeback non-expired inode in kupdate writeback
In kupdate writeback, only expired inode (have been dirty for longer than
dirty_expire_interval) is supposed to be written back. However, kupdate
writeback will writeback non-expired inode left in b_io or b_more_io from
last wb_writeback. As a result, writeback will keep being triggered
unexpected when we keep dirtying pages even dirty memory is under
threshold and inode is not expired. To be more specific:
Assume dirty background threshold is > 1G and dirty_expire_centisecs is
> 60s. When we running fio -size=1G -invalidate=0 -ioengine=libaio
--time_based -runtime=60... (keep dirtying), the writeback will keep
being triggered as following:
wb_workfn
  wb_do_writeback
    wb_check_background_flush
      /*
       * Wb dirty background threshold starts at 0 if device was idle and
       * grows up when bandwidth of wb is updated. So a background
       * writeback is triggered.
       */
      wb_over_bg_thresh
      /*
       * Dirtied inode will be written back and added to b_more_io list
       * after slice used up (because we keep dirtying the inode).
       */
      wb_writeback

Writeback is triggered per dirty_writeback_centisecs as following:
wb_workfn
  wb_do_writeback
    wb_check_old_data_flush
      /*
       * Write back inode left in b_io and b_more_io from last wb_writeback
       * even the inode is non-expired and it will be added to b_more_io
       * again as slice will be used up (because we keep dirtying the
       * inode)
       */
      wb_writeback

Fix this by moving non-expired inode to dirty list instead of more io
list for kupdate writeback in requeue_inode.

Test as following:
/* make it more easier to observe the issue */
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 100 > /proc/sys/vm/dirty_writeback_centisecs
/* create a idle device */
mkfs.ext4 -F /dev/vdb
mount /dev/vdb /bdi1/
/* run buffer write with fio */
fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K \
-iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0

Fio result before fix (run three tests):
1360MB/s
1329MB/s
1455MB/s

Fio result after fix (run three tests):
1737MB/s
1729MB/s
1789MB/s

Writeback for non-expired inode is gone as expeted. Observe this with trace
writeback_start and writeback_written as following:
echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_start/enab
echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_written/enable
cat /sys/kernel/tracing/trace_pipe

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-2-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:01:18 +01:00
Yang Li
fc253215f8
fs: Add kernel-doc comments to proc_create_net_data_write()
This commit adds kernel-doc style comments with complete parameter
descriptions for the function proc_create_net_data_write.

Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240315073805.77463-1-yang.lee@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:01:18 +01:00
Luis Henriques (SUSE)
7b30851a70
fs_parser: move fsparam_string_empty() helper into header
Since both ext4 and overlayfs define the same macro to specify string
parameters that may allow empty values, define it in an header file so
that this helper can be shared.

Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Link: https://lore.kernel.org/r/20240312104757.27333-1-luis.henriques@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:01:18 +01:00
Kent Overstreet
2a82bb0294
statx: stx_subvol
Add a new statx field for (sub)volume identifiers, as implemented by
btrfs and bcachefs.

This includes bcachefs support; we'll definitely want btrfs support as
well.

Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240308022914.196982-1-kent.overstreet@linux.dev
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:01:18 +01:00