Commit Graph

78696 Commits

Author SHA1 Message Date
Jakub Kicinski
9c5d03d362 genetlink: start to validate reserved header bytes
We had historically not checked that genlmsghdr.reserved
is 0 on input which prevents us from using those precious
bytes in the future.

One use case would be to extend the cmd field, which is
currently just 8 bits wide and 256 is not a lot of commands
for some core families.

To make sure that new families do the right thing by default
put the onus of opting out of validation on existing families.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel)
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-29 12:47:15 +01:00
David Howells
9c8b7a293f smb3: fix temporary data corruption in insert range
insert range doesn't discard the affected cached region
so can risk temporarily corrupting file data.

Also includes some minor cleanup (avoiding rereading
inode size repeatedly unnecessarily) to make it clearer.

Cc: stable@vger.kernel.org
Fixes: 7fe6fe95b9 ("cifs: add FALLOC_FL_INSERT_RANGE support")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-28 22:34:08 -05:00
Steve French
fa30a81f25 smb3: fix temporary data corruption in collapse range
collapse range doesn't discard the affected cached region
so can risk temporarily corrupting the file data. This
fixes xfstest generic/031

I also decided to merge a minor cleanup to this into the same patch
(avoiding rereading inode size repeatedly unnecessarily) to make it
clearer.

Cc: stable@vger.kernel.org
Fixes: 5476b5dd82 ("cifs: add support for FALLOC_FL_COLLAPSE_RANGE")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-28 22:34:07 -05:00
David Howells
c3a72bb213 smb3: Move the flush out of smb2_copychunk_range() into its callers
Move the flush out of smb2_copychunk_range() into its callers.  This will
allow the pagecache to be invalidated between the flush and the operation
in smb3_collapse_range() and smb3_insert_range().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-28 22:34:07 -05:00
Linus Torvalds
b467192ec7 Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more hotfixes from Andrew Morton:
 "Seventeen hotfixes.  Mostly memory management things.

  Ten patches are cc:stable, addressing pre-6.0 issues"

* tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  .mailmap: update Luca Ceresoli's e-mail address
  mm/mprotect: only reference swap pfn page if type match
  squashfs: don't call kmalloc in decompressors
  mm/damon/dbgfs: avoid duplicate context directory creation
  mailmap: update email address for Colin King
  asm-generic: sections: refactor memory_intersects
  bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
  ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
  Revert "memcg: cleanup racy sum avoidance code"
  mm/zsmalloc: do not attempt to free IS_ERR handle
  binder_alloc: add missing mmap_lock calls when using the VMA
  mm: re-allow pinning of zero pfns (again)
  vmcoreinfo: add kallsyms_num_syms symbol
  mailmap: update Guilherme G. Piccoli's email addresses
  writeback: avoid use-after-free after removing device
  shmem: update folio if shmem_replace_page() updates the page
  mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
2022-08-28 14:49:59 -07:00
Phillip Lougher
1f13dff09f squashfs: don't call kmalloc in decompressors
The decompressors may be called while in an atomic section.  So move the
kmalloc() out of this path, and into the "page actor" init function.

This fixes a regression introduced by commit
f268eedddf ("squashfs: extend "page actor" to handle missing pages")

Link: https://lkml.kernel.org/r/20220822215430.15933-1-phillip@squashfs.org.uk
Fixes: f268eedddf ("squashfs: extend "page actor" to handle missing pages")
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:45 -07:00
Heming Zhao
550842cc60 ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
After commit 0737e01de9 ("ocfs2: ocfs2_mount_volume does cleanup job
before return error"), any procedure after ocfs2_dlm_init() fails will
trigger crash when calling ocfs2_dlm_shutdown().

ie: On local mount mode, no dlm resource is initialized.  If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling will call
ocfs2_dlm_shutdown(), then does dlm resource cleanup job, which will
trigger kernel crash.

This solution should bypass uninitialized resources in
ocfs2_dlm_shutdown().

Link: https://lkml.kernel.org/r/20220815085754.20417-1-heming.zhao@suse.com
Fixes: 0737e01de9 ("ocfs2: ocfs2_mount_volume does cleanup job before return error")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:45 -07:00
Khazhismel Kumykov
f87904c075 writeback: avoid use-after-free after removing device
When a disk is removed, bdi_unregister gets called to stop further
writeback and wait for associated delayed work to complete.  However,
wb_inode_writeback_end() may schedule bandwidth estimation dwork after
this has completed, which can result in the timer attempting to access the
just freed bdi_writeback.

Fix this by checking if the bdi_writeback is alive, similar to when
scheduling writeback work.

Since this requires wb->work_lock, and wb_inode_writeback_end() may get
called from interrupt, switch wb->work_lock to an irqsafe lock.

Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
Fixes: 45a2966fd6 ("writeback: fix bandwidth estimate for spiky workload")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:43 -07:00
Linus Torvalds
8379c0b31f Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "Fixes:

   - check that subvolume is writable when changing xattrs from security
     namespace

   - fix memory leak in device lookup helper

   - update generation of hole file extent item when merging holes

   - fix space cache corruption and potential double allocations; this
     is a rare bug but can be serious once it happens, stable backports
     and analysis tool will be provided

   - fix error handling when deleting root references

   - fix crash due to assert when attempting to cancel suspended device
     replace, add message what to do if mount fails due to missing
     replace item

  Regressions:

   - don't merge pages into bio if their page offset is not contiguous

   - don't allow large NOWAIT direct reads, this could lead to short
     reads eg. in io_uring"

* tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add info when mount fails due to stale replace target
  btrfs: replace: drop assert for suspended replace
  btrfs: fix silent failure when deleting root reference
  btrfs: fix space cache corruption and potential double allocations
  btrfs: don't allow large NOWAIT direct reads
  btrfs: don't merge pages into bio if their page offset is not contiguous
  btrfs: update generation of hole file extent item when merging holes
  btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
  btrfs: check if root is readonly while setting security xattr
2022-08-28 10:44:04 -07:00
Linus Torvalds
c7bb3fbc1b Merge tag '6.0-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cfis fixes from Steve French:

 - two locking fixes (zero range, punch hole)

 - DFS 9 fix (padding), affecting some servers

 - three minor cleanup changes

* tag '6.0-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Add helper function to check smb1+ server
  cifs: Use help macro to get the mid header size
  cifs: Use help macro to get the header preamble size
  cifs: skip extra NULL byte in filenames
  smb3: missing inode locks in punch hole
  smb3: missing inode locks in zero range
2022-08-28 10:35:16 -07:00
Andreas Gruenbacher
204c0300c4 gfs2: Switch from strlcpy to strscpy
Switch from strlcpy to strscpy and make sure that @count is the size of
the smaller of the source and destination buffers.  This prevents
reading beyond the end of the source buffer when the source string isn't
null terminated.

Found by a modified version of syzkaller.

Suggested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-08-26 15:27:06 +02:00
Bob Peterson
86934198ee gfs2: Clear flags when withdraw prevents xmote
There are a couple places in function do_xmote where normal processing
is circumvented due to withdraws in progress. However, since we bypass
most of do_xmote() we bypass telling dlm to lock the dlm lock, which
means dlm will never respond with a completion callback. Since the
completion callback ordinarily clears GLF_LOCK, this patch changes
function do_xmote to handle those situations more gracefully so the
file system may be unmounted after withdraw.

A very similar situation happens with the GLF_DEMOTE_IN_PROGRESS flag,
which is cleared by function finish_xmote(). Since the withdraw causes
us to skip the majority of do_xmote, it therefore also skips the call
to finish_xmote() so the DEMOTE_IN_PROGRESS flag needs to be cleared
manually.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-08-25 17:11:39 +02:00
Bob Peterson
053640a738 gfs2: Dequeue waiters when withdrawn
When a withdraw occurs, ordinary (not system) glocks may not be granted
anymore. Later, when the file system is unmounted, gfs2_gl_hash_clear()
tries to clear out all the glocks, but these un-grantable pending
waiters prevent some glocks from being freed. So the unmount hangs, at
least for its ten-minute timeout period.

This patch takes measures to remove any pending waiters from
the glocks that will never be granted. This allows the unmount to
proceed in a reasonable period of time.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-08-25 17:11:14 +02:00
Bob Peterson
04133b607a gfs2: Prevent double iput for journal on error
When a gfs2 file system is withdrawn it does iput on its journal to
allow recovery from another cluster node. If it's unable to get a
replacement inode for whatever reason, the journal descriptor would
still be pointing at the evicted inode. So when unmount clears out the
list of journals, it would do a second iput referencing the pointer.
To avoid this, set the inode pointer to NULL.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-08-25 17:10:59 +02:00
Bob Peterson
c412a97cf6 gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes
Before this patch, delete_work_func() would check for the GLF_DEMOTE
flag on the iopen glock and if set, it would perform special processing.
However, there was a race whereby the GLF_DEMOTE flag could be set by
another process after the check. Then when it called
gfs2_lookup_by_inum() which calls gfs2_inode_lookup(), it tried to lock
the iopen glock in SH mode, but the GLF_DEMOTE flag prevented the
request from being granted. But the iopen glock could never be demoted
because that happens when the inode is evicted, and the evict was never
completed because of the failed lookup.

To fix that, change function gfs2_inode_lookup() so that when
GFS2_BLKST_UNLINKED inodes are searched, it uses the LM_FLAG_TRY flag
for the iopen glock.  If the locking request fails, fail
gfs2_inode_lookup() with -EAGAIN so that delete_work_func() can retry
the operation later.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-08-25 15:25:16 +02:00
Zhang Xiaoxu
d291e703f4 cifs: Add helper function to check smb1+ server
SMB1 server's header_preamble_size is not 0, add use is_smb1 function
to simplify the code, no actual functional changes.

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-24 22:30:09 -05:00
Zhang Xiaoxu
b6b3624d01 cifs: Use help macro to get the mid header size
It's better to use MID_HEADER_SIZE because the unfolded expression
too long. No actual functional changes, minor readability improvement.

Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-24 22:30:04 -05:00
Zhang Xiaoxu
9789de8bdc cifs: Use help macro to get the header preamble size
It's better to use HEADER_PREAMBLE_SIZE because the unfolded expression
too long. No actual functional changes, minor readability improvement.

Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-24 22:29:59 -05:00
Yosry Ahmed
ebc97a52b5 mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.
We keep track of several kernel memory stats (total kernel memory, page
tables, stack, vmalloc, etc) on multiple levels (global, per-node,
per-memcg, etc). These stats give insights to users to how much memory
is used by the kernel and for what purposes.

Currently, memory used by KVM mmu is not accounted in any of those
kernel memory stats. This patch series accounts the memory pages
used by KVM for page tables in those stats in a new
NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account
for other types of secondary pages tables (e.g. iommu page tables).

KVM has a decent number of large allocations that aren't for page
tables, but for most of them, the number/size of those allocations
scales linearly with either the number of vCPUs or the amount of memory
assigned to the VM. KVM's secondary page table allocations do not scale
linearly, especially when nested virtualization is in use.

From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's
per-VM pages_{4k,2m,1g} stats unless the guest is doing something
bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is
forced to allocate a large number of page tables even though the guest
isn't accessing that much memory). However, someone would need to either
understand how KVM works to make that connection, or know (or be told) to
go look at KVM's stats if they're running VMs to better decipher the stats.

Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE
is informative. For example, when backing a VM with THP vs. HugeTLB,
NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order
of magnitude higher with THP. So having this stat will at the very least
prove to be useful for understanding tradeoffs between VM backing types,
and likely even steer folks towards potential optimizations.

The original discussion with more details about the rationale:
https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org

This stat will be used by subsequent patches to count KVM mmu
memory usage.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220823004639.2387269-2-yosryahmed@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-08-24 13:51:42 -07:00
Paulo Alcantara
a1d2eb51f0 cifs: skip extra NULL byte in filenames
Since commit:
 cifs: alloc_path_with_tree_prefix: do not append sep. if the path is empty
alloc_path_with_tree_prefix() function was no longer including the
trailing separator when @path is empty, although @out_len was still
assuming a path separator thus adding an extra byte to the final
filename.

This has caused mount issues in some Synology servers due to the extra
NULL byte in filenames when sending SMB2_CREATE requests with
SMB2_FLAGS_DFS_OPERATIONS set.

Fix this by checking if @path is not empty and then add extra byte for
separator.  Also, do not include any trailing NULL bytes in filename
as MS-SMB2 requires it to be 8-byte aligned and not NULL terminated.

Cc: stable@vger.kernel.org
Fixes: 7eacba3b00 ("cifs: alloc_path_with_tree_prefix: do not append sep. if the path is empty")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-24 12:22:24 -05:00
Linus Torvalds
062d26ad0b Merge tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull file_remove_privs() fix from Christian Brauner:
 "As part of Stefan's and Jens' work to add async buffered write
  support to xfs we refactored file_remove_privs() and added
  __file_remove_privs() to avoid calling __remove_privs() when
  IOCB_NOWAIT is passed.

  While debugging a recent performance regression report I found that
  during review we missed that commit faf99b5635 ("fs: add
  __remove_file_privs() with flags parameter") accidently changed
  behavior when dentry_needs_remove_privs() returns zero.

  Before the commit it would still call inode_has_no_xattr() setting
  the S_NOSEC bit and thereby avoiding even calling into
  dentry_needs_remove_privs() the next time this function is called.
  After that commit inode_has_no_xattr() would only be called if
  __remove_privs() had to be called.

  Restore the old behavior. This is likely the cause of the performance
  regression"

* tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fs: __file_remove_privs(): restore call to inode_has_no_xattr()
2022-08-23 19:17:26 -07:00
Linus Torvalds
95607ad99b Merge tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "Thirteen fixes, almost all for MM.

  Seven of these are cc:stable and the remainder fix up the changes
  which went into this -rc cycle"

* tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  kprobes: don't call disarm_kprobe() for disabled kprobes
  mm/shmem: shmem_replace_page() remember NR_SHMEM
  mm/shmem: tmpfs fallocate use file_modified()
  mm/shmem: fix chattr fsflags support in tmpfs
  mm/hugetlb: support write-faults in shared mappings
  mm/hugetlb: fix hugetlb not supporting softdirty tracking
  mm/uffd: reset write protection when unregister with wp-mode
  mm/smaps: don't access young/dirty bit if pte unpresent
  mm: add DEVICE_ZONE to FOR_ALL_ZONES
  kernel/sys_ni: add compat entry for fadvise64_64
  mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  Revert "zram: remove double compression logic"
  get_maintainer: add Alan to .get_maintainer.ignore
2022-08-23 13:33:08 -07:00
Anand Jain
f2c3bec215 btrfs: add info when mount fails due to stale replace target
If the replace target device reappears after the suspended replace is
cancelled, it blocks the mount operation as it can't find the matching
replace-item in the metadata. As shown below,

   BTRFS error (device sda5): replace devid present without an active replace item

To overcome this situation, the user can run the command

   btrfs device scan --forget <replace target device>

and try the mount command again. And also, to avoid repeating the issue,
superblock on the devid=0 must be wiped.

   wipefs -a device-path-to-devid=0.

This patch adds some info when this situation occurs.

Reported-by: Samuel Greiner <samuel@balkonien.org>
Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 22:15:21 +02:00
Anand Jain
59a3991984 btrfs: replace: drop assert for suspended replace
If the filesystem mounts with the replace-operation in a suspended state
and try to cancel the suspended replace-operation, we hit the assert. The
assert came from the commit fe97e2e173 ("btrfs: dev-replace: replace's
scrub must not be running in suspended state") that was actually not
required. So just remove it.

 $ mount /dev/sda5 /btrfs

    BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
    BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'

 $ mount -o degraded /dev/sda5 /btrfs <-- success.

 $ btrfs replace cancel /btrfs

    kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
    kernel: ------------[ cut here ]------------
    kernel: kernel BUG at fs/btrfs/ctree.h:3750!

After the patch:

 $ btrfs replace cancel /btrfs

    BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled

Fixes: fe97e2e173 ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 22:15:21 +02:00
Filipe Manana
47bf225a8d btrfs: fix silent failure when deleting root reference
At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
up returning from the function with a value of 0 (success). This happens
because the function returns the value stored in the variable 'err',
which is 0, while the error value we got from btrfs_search_slot() is
stored in the 'ret' variable.

So fix it by setting 'err' with the error value.

Fixes: 8289ed9f93 ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 22:15:21 +02:00
Omar Sandoval
ced8ecf026 btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:

1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
   space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
   and also marked as free in the free space tree, ranges that were
   marked as allocated twice in the extent tree, or ranges that were
   marked as free twice in the free space tree. If the latter made it
   onto disk, the next reboot would hit the BUG_ON() in
   add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
   in-memory space cache (dumped with drgn) disagreed with the free
   space tree.

All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa55 ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.

Specifically, the race is as follows:

1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
   ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
   add_to_free_space_tree() adds the deleted extent back to the free
   space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
   btrfs_cache_block_group() queues up the block group to get cached.
   block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
   switch_commit_roots(). It sets block_group->last_byte_to_unpin to
   block_group->progress, which is block_group->start because the block
   group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
   were already switched, load_free_space_tree() sees the deleted extent
   as free and adds it to the space cache. It finishes caching and sets
   block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
   TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
   transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
   commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
    switch_commit_roots(). This time, the block group has already been
    cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
    btrfs_finish_extent_commit(), which calls unpin_extent_range() for
    the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
    transaction B!), so it adds the deleted extent to the space cache
    again!

This explains all of our symptoms above:

* If the sequence of events is exactly as described above, when the free
  space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
  and 11, then step 11 will silently re-add that space to the space
  cache as free even though it is actually allocated. Then, if that
  space is allocated *again*, the free space tree will be corrupted
  (namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
  to get worse as extents are deleted and reallocated.

The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.

The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.

A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:

1. Factor btrfs_caching_ctl_wait_done() out of
   btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
   that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
   btrfs_wait_space_cache_v1_finished() to
   btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
   space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
   space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
   `bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
   btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
   used outside of block-group.c anymore.

Fixes: d0c2f4fa55 ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 22:13:54 +02:00
Alexander Aring
56171e0db2 fs: dlm: const void resource name parameter
The resource name parameter should never be changed by DLM so we declare
it as const. At some point it is handled as a char pointer, a resource
name can be a non printable ascii string as well. This patch change it
to handle it as void pointer as it is offered by DLM API.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 15:02:47 -05:00
Alexander Aring
9cb16d4271 fs: dlm: LSFL_CB_DELAY only for kernel lockspaces
This patch only set/clear the LSFL_CB_DELAY bit when it's actually a
kernel lockspace signaled by if ls->ls_callback_wq is set or not set in
this case. User lockspaces will never evaluate this flag.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 15:02:46 -05:00
Alexander Aring
12cda13cfd fs: dlm: remove DLM_LSFL_FS from uapi
The DLM_LSFL_FS flag is set in lockspaces created directly
for a kernel user, as opposed to those lockspaces created
for user space applications.  The user space libdlm allowed
this flag to be set for lockspaces created from user space,
but then used by a kernel user.  No kernel user has ever
used this method, so remove the ability to do it.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:54:54 -05:00
Alexander Aring
7a3de7324c fs: dlm: trace user space callbacks
This patch adds trace callbacks for user locks. Unfortenately user locks
are handled in a different way than kernel locks in some cases. User
locks never call the dlm_lock()/dlm_unlock() kernel API and use the next
step internal API of dlm. Adding those traces from user API callers
should make it possible for dlm trace system to see lock handling for
user locks as well.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:54:04 -05:00
Alexander Aring
296d9d1e98 fs: dlm: change ls_clear_proc_locks to spinlock
This patch changes the ls_clear_proc_locks to a spinlock because there
is no need to handle it as a mutex as there is no sleepable context when
ls_clear_proc_locks is held. This allows us to call those functionality
in non-sleepable contexts.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:54:02 -05:00
Alexander Aring
e152c38dc0 fs: dlm: remove dlm_del_ast prototype
This patch removes dlm_del_ast() prototype which is not being used in
the dlm subsystem because there is not implementation for it.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:54:01 -05:00
Alexander Aring
f45307d395 fs: dlm: handle rcom in else if branch
Currently we handle in dlm_receive_buffer() everything else than a
DLM_MSG type as DLM_RCOM message. Although a different message than
DLM_MSG should be a DLM_RCOM we should explicit check on DLM_RCOM and
drop a log_error() if we see something unexpected.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:53:59 -05:00
Alexander Aring
b5c9d37c7f fs: dlm: allow lockspaces have zero lvblen
A dlm user may not use the DLM_LKF_VALBLK flag in the DLM API,
so a zero lvblen should be allowed as a lockspace parameter.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:53:16 -05:00
Alexander Aring
7175e131eb fs: dlm: fix invalid derefence of sb_lvbptr
I experience issues when putting a lkbsb on the stack and have sb_lvbptr
field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash
with the following kernel message, the dangled pointer is here
0xdeadbeef as example:

[  102.749317] BUG: unable to handle page fault for address: 00000000deadbeef
[  102.749320] #PF: supervisor read access in kernel mode
[  102.749323] #PF: error_code(0x0000) - not-present page
[  102.749325] PGD 0 P4D 0
[  102.749332] Oops: 0000 [#1] PREEMPT SMP PTI
[  102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G        W         5.19.0-rc3+ #1565
[  102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[  102.749344] RIP: 0010:memcpy_erms+0x6/0x10
[  102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[  102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202
[  102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040
[  102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070
[  102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001
[  102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70
[  102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001
[  102.749368] FS:  0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000
[  102.749372] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0
[  102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  102.749379] PKRU: 55555554
[  102.749381] Call Trace:
[  102.749382]  <TASK>
[  102.749383]  ? send_args+0xb2/0xd0
[  102.749389]  send_common+0xb7/0xd0
[  102.749395]  _unlock_lock+0x2c/0x90
[  102.749400]  unlock_lock.isra.56+0x62/0xa0
[  102.749405]  dlm_unlock+0x21e/0x330
[  102.749411]  ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[  102.749416]  torture_unlock+0x5a/0x90 [dlm_locktorture]
[  102.749419]  ? preempt_count_sub+0xba/0x100
[  102.749427]  lock_torture_writer+0xbd/0x150 [dlm_locktorture]
[  102.786186]  kthread+0x10a/0x130
[  102.786581]  ? kthread_complete_and_exit+0x20/0x20
[  102.787156]  ret_from_fork+0x22/0x30
[  102.787588]  </TASK>
[  102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw
[  102.792536] CR2: 00000000deadbeef
[  102.792930] ---[ end trace 0000000000000000 ]---

This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags
is set when copying the lvbptr array instead of if it's just null which
fixes for me the issue.

I think this patch can fix other dlm users as well, depending how they
handle the init, freeing memory handling of sb_lvbptr and don't set
DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a
hidden issue all the time. However with checking on DLM_LKF_VALBLK the
user always need to provide a sb_lvbptr non-null value. There might be
more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and
non-null to report the user the way how DLM API is used is wrong but can
be added for later, this will only fix the current behaviour.

Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:52:26 -05:00
Alexander Aring
9ac8ba46a7 fs: dlm: handle -EINVAL as log_error()
If the user generates -EINVAL it's probably because they are
using DLM incorrectly.  Change the log level to make these
errors more visible.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:49:09 -05:00
Alexander Aring
c2d76a62d8 fs: dlm: use __func__ for function name
Avoid hard-coded function names inside message format strings.
(Prevents checkpatch warnings.)

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:46:49 -05:00
Alexander Aring
420ba3cd03 fs: dlm: handle -EBUSY first in unlock validation
This patch checks for -EBUSY conditions in dlm_unlock() before
checking for -EINVAL conditions (except for CANCEL and
FORCEUNLOCK calls where a busy condition is expected.)
There are no problems with the current ordering of checks,
but this makes dlm_unlock() consistent with dlm_lock(), and
may avoid future problems if other checks are added.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:42:32 -05:00
Alexander Aring
44637ca41d fs: dlm: handle -EBUSY first in lock arg validation
During lock arg validation, first check for -EBUSY cases, then for
-EINVAL cases. The -EINVAL checks look at lkb state variables
which are not stable when an lkb is busy and would cause an
-EBUSY result, e.g. lkb->lkb_grmode.

Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:39:51 -05:00
Alexander Aring
eef6ec9bf3 fs: dlm: fix race between test_bit() and queue_work()
This patch fixes a race by using ls_cb_mutex around the bit
operations and conditional code blocks for LSFL_CB_DELAY.

The function dlm_callback_stop() expects to stop all callbacks and
flush all currently queued onces. The set_bit() is not enough because
there can still be queue_work() after the workqueue was flushed.
To avoid queue_work() after set_bit(), surround both by ls_cb_mutex.

Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:37:14 -05:00
Alexander Aring
30ea3257e8 fs: dlm: fix race in lowcomms
This patch fixes a race between queue_work() in
_dlm_lowcomms_commit_msg() and srcu_read_unlock(). The queue_work() can
take the final reference of a dlm_msg and so msg->idx can contain
garbage which is signaled by the following warning:

[  676.237050] ------------[ cut here ]------------
[  676.237052] WARNING: CPU: 0 PID: 1060 at include/linux/srcu.h:189 dlm_lowcomms_commit_msg+0x41/0x50
[  676.238945] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common iTCO_wdt iTCO_vendor_support qxl kvm_intel drm_ttm_helper vmw_vsock_virtio_transport kvm vmw_vsock_virtio_transport_common ttm irqbypass crc32_pclmul joydev crc32c_intel serio_raw drm_kms_helper vsock virtio_scsi virtio_console virtio_balloon snd_pcm drm syscopyarea sysfillrect sysimgblt snd_timer fb_sys_fops i2c_i801 lpc_ich snd i2c_smbus soundcore pcspkr
[  676.244227] CPU: 0 PID: 1060 Comm: lock_torture_wr Not tainted 5.19.0-rc3+ #1546
[  676.245216] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[  676.246460] RIP: 0010:dlm_lowcomms_commit_msg+0x41/0x50
[  676.247132] Code: fe ff ff ff 75 24 48 c7 c6 bd 0f 49 bb 48 c7 c7 38 7c 01 bd e8 00 e7 ca ff 89 de 48 c7 c7 60 78 01 bd e8 42 3d cd ff 5b 5d c3 <0f> 0b eb d8 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[  676.249253] RSP: 0018:ffffa401c18ffc68 EFLAGS: 00010282
[  676.249855] RAX: 0000000000000001 RBX: 00000000ffff8b76 RCX: 0000000000000006
[  676.250713] RDX: 0000000000000000 RSI: ffffffffbccf3a10 RDI: ffffffffbcc7b62e
[  676.251610] RBP: ffffa401c18ffc70 R08: 0000000000000001 R09: 0000000000000001
[  676.252481] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000005
[  676.253421] R13: ffff8b76786ec370 R14: ffff8b76786ec370 R15: ffff8b76786ec480
[  676.254257] FS:  0000000000000000(0000) GS:ffff8b7777800000(0000) knlGS:0000000000000000
[  676.255239] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  676.255897] CR2: 00005590205d88b8 CR3: 000000017656c003 CR4: 0000000000770ee0
[  676.256734] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  676.257567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  676.258397] PKRU: 55555554
[  676.258729] Call Trace:
[  676.259063]  <TASK>
[  676.259354]  dlm_midcomms_commit_mhandle+0xcc/0x110
[  676.259964]  queue_bast+0x8b/0xb0
[  676.260423]  grant_pending_locks+0x166/0x1b0
[  676.261007]  _unlock_lock+0x75/0x90
[  676.261469]  unlock_lock.isra.57+0x62/0xa0
[  676.262009]  dlm_unlock+0x21e/0x330
[  676.262457]  ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[  676.263183]  torture_unlock+0x5a/0x90 [dlm_locktorture]
[  676.263815]  ? preempt_count_sub+0xba/0x100
[  676.264361]  ? complete+0x1d/0x60
[  676.264777]  lock_torture_writer+0xb8/0x150 [dlm_locktorture]
[  676.265555]  kthread+0x10a/0x130
[  676.266007]  ? kthread_complete_and_exit+0x20/0x20
[  676.266616]  ret_from_fork+0x22/0x30
[  676.267097]  </TASK>
[  676.267381] irq event stamp: 9579855
[  676.267824] hardirqs last  enabled at (9579863): [<ffffffffbb14e6f8>] __up_console_sem+0x58/0x60
[  676.268896] hardirqs last disabled at (9579872): [<ffffffffbb14e6dd>] __up_console_sem+0x3d/0x60
[  676.270008] softirqs last  enabled at (9579798): [<ffffffffbc200349>] __do_softirq+0x349/0x4c7
[  676.271438] softirqs last disabled at (9579897): [<ffffffffbb0d54c0>] irq_exit_rcu+0xb0/0xf0
[  676.272796] ---[ end trace 0000000000000000 ]---

I reproduced this warning with dlm_locktorture test which is currently
not upstream. However this patch fix the issue by make a additional
refcount between dlm_lowcomms_new_msg() and dlm_lowcomms_commit_msg().
In case of the race the kref_put() in dlm_lowcomms_commit_msg() will be
the final put.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23 14:26:37 -05:00
David Howells
ba0803050d smb3: missing inode locks in punch hole
smb3 fallocate punch hole was not grabbing the inode or filemap_invalidate
locks so could have race with pagemap reinstantiating the page.

Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-23 03:15:27 -05:00
David Howells
c919c164fc smb3: missing inode locks in zero range
smb3 fallocate zero range was not grabbing the inode or filemap_invalidate
locks so could have race with pagemap reinstantiating the page.

Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-23 01:09:08 -05:00
Eric Biggers
272ac15003 fscrypt: remove fscrypt_set_test_dummy_encryption()
Now that all its callers have been converted to
fscrypt_parse_test_dummy_encryption() and fscrypt_add_test_dummy_key()
instead, fscrypt_set_test_dummy_encryption() can be removed.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220513231605.175121-6-ebiggers@kernel.org
2022-08-22 13:47:51 -07:00
Linus Torvalds
072e51356c Merge tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Stable fixes:
   - NFS: Fix another fsync() issue after a server reboot

  Bugfixes:
   - NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
   - NFS: Fix missing unlock in nfs_unlink()
   - Add sanity checking of the file type used by __nfs42_ssc_open
   - Fix a case where we're failing to set task->tk_rpc_status

  Cleanups:
   - Remove the NFS_CONTEXT_RESEND_WRITES flag that got obsoleted by the
     fsync() fix"

* tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: RPC level errors should set task->tk_rpc_status
  NFSv4.2 fix problems with __nfs42_ssc_open
  NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
  NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES
  NFS: Remove a bogus flag setting in pnfs_write_done_resend_to_mds
  NFS: Fix another fsync() issue after a server reboot
  NFS: Fix missing unlock in nfs_unlink()
2022-08-22 11:40:01 -07:00
Linus Torvalds
d3cd67d671 Merge tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull idmapping fixes from Christian Brauner:

 - Since Seth joined as co-maintainer for idmapped mounts we decided to
   use a shared git tree. Konstantin suggested we use vfs/idmapping.git
   on kernel.org under the vfs/ namespace. So this updates the tree in
   the maintainers file.

 - Ensure that POSIX ACLs checking, getting, and setting works correctly
   for filesystems mountable with a filesystem idmapping that want to
   support idmapped mounts.

   Since no filesystems mountable with an fs_idmapping do yet support
   idmapped mounts there is no problem. But this could change in the
   future, so add a check to refuse to create idmapped mounts when the
   mounter is not privileged over the mount's idmapping.

 - Check that caller is privileged over the idmapping that will be
   attached to a mount.

   Currently no FS_USERNS_MOUNT filesystems support idmapped mounts,
   thus this is not a problem as only CAP_SYS_ADMIN in init_user_ns is
   allowed to set up idmapped mounts. But this could change in the
   future, so add a check to refuse to create idmapped mounts when the
   mounter is not privileged over the mount's idmapping.

 - Fix POSIX ACLs for ntfs3. While looking at our current POSIX ACL
   handling in the context of some overlayfs work I went through a range
   of other filesystems checking how they handle them currently and
   encountered a few bugs in ntfs3.

   I've sent this some time ago and the fixes haven't been picked up
   even though the pull request for other ntfs3 fixes got sent after.
   This should really be fixed as right now POSIX ACLs are broken in
   certain circumstances for ntfs3.

* tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  ntfs: fix acl handling
  fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
  MAINTAINERS: update idmapping tree
  acl: handle idmapped mounts for idmapped filesystems
2022-08-22 11:33:02 -07:00
Linus Torvalds
b20ee4813f Merge tag 'filelock-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking fix from Jeff Layton:
 "Just a single patch for a bugfix in the flock() codepath, introduced
  by a patch that went in recently"

* tag 'filelock-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: Fix dropped call to ->fl_release_private()
2022-08-22 10:40:09 -07:00
Josef Bacik
79d3d1d12e btrfs: don't allow large NOWAIT direct reads
Dylan and Jens reported a problem where they had an io_uring test that
was returning short reads, and bisected it to ee5b46a353 ("btrfs:
increase direct io read size limit to 256 sectors").

The root cause is their test was doing larger reads via io_uring with
NOWAIT and async.  This was triggering a page fault during the direct
read, however the first page was able to work just fine and thus we
submitted a 4k read for a larger iocb.

Btrfs allows for partial IO's in this case specifically because we don't
allow page faults, and thus we'll attempt to do any io that we can,
submit what we could, come back and fault in the rest of the range and
try to do the remaining IO.

However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
partial dio is done, which is incorrect.  In the sync case we can exit
the iomap code, submit more io's, and return with the amount of IO we
were able to complete successfully.

We were always doing short reads in this case, but for NOWAIT we were
getting saved by the fact that we were limiting direct reads to
sectorsize, and if we were larger than that we would return EAGAIN.

Fix the regression by simply returning EAGAIN in the NOWAIT case with
larger reads, that way io_uring can retry and get the larger IO and have
the fault logic handle everything properly.

This still leaves the AIO short read case, but that existed before this
change.  The way to properly fix this would be to handle partial iocb
completions, but that's a lot of work, for now deal with the regression
in the most straightforward way possible.

Reported-by: Dylan Yudaken <dylany@fb.com>
Fixes: ee5b46a353 ("btrfs: increase direct io read size limit to 256 sectors")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:08:07 +02:00
Qu Wenruo
4a445b7b61 btrfs: don't merge pages into bio if their page offset is not contiguous
[BUG]
Zygo reported on latest development branch, he could hit
ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
corrupt one disk, and let btrfs to recover the data during read/scrub).

And The following minimal reproducer can cause extent state leakage at
rmmod time:

  mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
  mount $dev1 $mnt
  fsstress -w -d $mnt -n 25 -s 1660807876
  sync
  fssum -A -f -w /tmp/fssum.saved $mnt
  umount $mnt

  # Wipe the dev1 but keeps its super block
  xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
  mount $dev1 $mnt
  fssum -r /tmp/fssum.saved $mnt > /dev/null
  umount $mnt
  rmmod btrfs

This will lead to the following extent states leakage:

  BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
  BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
  BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
  BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
  BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
  BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
  BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
  BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1

[CAUSE]
Since commit 7aa51232e2 ("btrfs: pass a btrfs_bio to
btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
determine the file offset of a page.

But that usage assume that, one bio has all its page having a continuous
page offsets.

Unfortunately that's not true, btrfs only requires the logical bytenr
contiguous when assembling its bios.

From above script, we have one bio looks like this:

  fssum-27671  submit_one_bio: bio logical=217739264 len=36864
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664

Note that the 1st and the 2nd page has non-contiguous page offsets.

This means, at repair time, we will have completely wrong file offset
passed in:

   kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192

Since the file offset is incorrect, we latter incorrectly set the extent
states, and no way to really release them.

Thus later it causes the leakage.

In fact, this can be even worse, since the file offset is incorrect, we
can hit cases like the incorrect file offset belongs to a HOLE, and
later cause btrfs_num_copies() to trigger error, finally hit
BUG_ON()/ASSERT() later.

[FIX]
Add an extra condition in btrfs_bio_add_page() for uncompressed IO.

Now we will have more strict requirement for bio pages:

- They should all have the same mapping
  (the mapping check is already implied by the call chain)

- Their logical bytenr should be adjacent
  This is the same as the old condition.

- Their page_offset() (file offset) should be adjacent
  This is the new check.
  This would result a slightly increased amount of bios from btrfs
  (needs holes and inside the same stripe boundary to trigger).

  But this would greatly reduce the confusion, as it's pretty common
  to assume a btrfs bio would only contain continuous page cache.

Later we may need extra cleanups, as we no longer needs to handle gaps
between page offsets in endio functions.

Currently this should be the minimal patch to fix commit 7aa51232e2
("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 7aa51232e2 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:06:58 +02:00
Filipe Manana
e6e3dec6c3 btrfs: update generation of hole file extent item when merging holes
When punching a hole into a file range that is adjacent with a hole and we
are not using the no-holes feature, we expand the range of the adjacent
file extent item that represents a hole, to save metadata space.

However we don't update the generation of hole file extent item, which
means a full fsync will not log that file extent item if the fsync happens
in a later transaction (since commit 7f30c07288 ("btrfs: stop copying
old file extents when doing a full fsync")).

For example, if we do this:

    $ mkfs.btrfs -f -O ^no-holes /dev/sdb
    $ mount /dev/sdb /mnt
    $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
    $ sync

We end up with 2 file extent items in our file:

1) One that represents the hole for the file range [0, 2M), with a
   generation of 7;

2) Another one that represents an extent covering the range [2M, 4M).

After that if we do the following:

    $ xfs_io -c "fpunch 2M 2M" /mnt/foobar

We end up with a single file extent item in the file, which represents a
hole for the range [0, 4M) and with a generation of 7 - because we end
dropping the data extent for range [2M, 4M) and then update the file
extent item that represented the hole at [0, 2M), by increasing
length from 2M to 4M.

Then doing a full fsync and power failing:

    $ xfs_io -c "fsync" /mnt/foobar
    <power failure>

will result in the full fsync not logging the file extent item that
represents the hole for the range [0, 4M), because its generation is 7,
which is lower than the generation of the current transaction (8).
As a consequence, after mounting again the filesystem (after log replay),
the region [2M, 4M) does not have a hole, it still points to the
previous data extent.

So fix this by always updating the generation of existing file extent
items representing holes when we merge/expand them. This solves the
problem and it's the same approach as when we merge prealloc extents that
got written (at btrfs_mark_extent_written()). Setting the generation to
the current transaction's generation is also what we do when merging
the new hole extent map with the previous one or the next one.

A test case for fstests, covering both cases of hole file extent item
merging (to the left and to the right), will be sent soon.

Fixes: 7f30c07288 ("btrfs: stop copying old file extents when doing a full fsync")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:06:42 +02:00