Commit Graph

91022 Commits

Author SHA1 Message Date
Josef Bacik
7f6af7c434 btrfs: remove the btrfs_delayed_ref_node container helpers
Now that we don't use these helpers anywhere, remove them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
efc7d5dbf8 btrfs: stop referencing btrfs_delayed_tree_ref directly
We only ever need to use this to get the level of the tree block ref, so
use the btrfs_delayed_ref_owner() helper, which returns the level for
the given reference.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
44cc2e38e6 btrfs: stop referencing btrfs_delayed_data_ref directly
Now that most of our elements are inside of btrfs_delayed_ref_node
directly and we have helpers for the delayed_data_ref bits, go ahead and
remove all direct usage of btrfs_delayed_data_ref and use the helpers
where needed.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
b4b5934ac1 btrfs: make the insert backref helpers take a btrfs_delayed_ref_node
We don't need to pass in all the elements for the backrefs as function
arguments, simply pass through the btrfs_delayed_ref_node and then
extract the values we need from that.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
85bb9f544e btrfs: drop unnecessary arguments from __btrfs_free_extent
We have all the information we need in our btrfs_delayed_ref_node, which
we already pass into __btrfs_free_extent.  Drop the extra arguments and
just extract the values from btrfs_delayed_ref_node.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
a502f112ad btrfs: make __btrfs_inc_extent_ref take a btrfs_delayed_ref_node
We're just extracting the values from btrfs_delayed_ref_node and passing
them through, simply pass the btrfs_delayed_ref_node into
__btrfs_inc_extent_ref and shrink the function arguments.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
5366763446 btrfs: rename btrfs_data_ref->ino to ->objectid
This is how we refer to it in the rest of the extent reference related
code, make it consistent.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
cf4f04325b btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node
These two members are shared by both the tree refs and data refs, so
move them into btrfs_delayed_ref_node proper.  This allows us to greatly
simplify the comparison code, as the shared refs always only sort on
parent, and the non shared refs always sort first on ref_root, and then
only data refs sort on their specific fields.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
12390e42b6 btrfs: rename ->len to ->num_bytes in btrfs_ref
We consistently use ->num_bytes everywhere through the delayed ref code,
except in btrfs_ref.  Rename btrfs_ref to match all the other code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
f75464f7bb btrfs: unify the btrfs_add_delayed_*_ref helpers into one helper
Now that these helpers are identical, create a helper function that
handles everything properly and strip the individual helpers down to use
just the common helper. This cleans up a significant amount of
duplicated code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
1bff6d4f87 btrfs: simplify delayed ref tracepoints
Now that all of the delayed ref information is in the delayed ref node,
drastically simplify the delayed ref tracepoints by simply passing in
the btrfs_delayed_ref_node and populating the tracepoints with the
values from the structure itself.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
0ea4703cc2 btrfs: move ref specific initialization into init_delayed_ref_common
Now that the btrfs_delayed_ref_node contains a union of the data and
metadata specific information we can move the initialization into
init_delayed_ref_common and just use the btrfs_ref to initialize the
correct fields of the reference.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
0509cc5661 btrfs: initialize btrfs_delayed_ref_head with btrfs_ref
We are calling init_delayed_ref_head with all of the elements from
btrfs_ref, clean this up to simply pass in the btrfs_ref and initialize
the btrfs_delayed_ref_head with the values from the btrfs_ref directly.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
da3c548541 btrfs: pass btrfs_ref to init_delayed_ref_common
We're extracting all of these values from the btrfs_ref we passed in
already, just pass the btrfs_ref through to init_delayed_ref_common and
get the values directly from the struct.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
f2e69a77aa btrfs: move ref_root into btrfs_ref
We have this in both btrfs_tree_ref and btrfs_data_ref, which is just
wasting space and making the code more complicated.  Move this into
btrfs_ref proper and update all the call sites to do the assignment in
btrfs_ref.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
4d09b4e942 btrfs: do not use a function to initialize btrfs_ref
btrfs_ref currently has ->owning_root, and ->ref_root is shared between
the tree ref and data ref, so in order to move that into btrfs_ref
proper I would need to add another root parameter to the initialization
function.  This function has too many arguments, and adding another root
will make it easy to make mistakes about which root goes where.

Drop the generic ref init function and statically initialize the
btrfs_ref in every usage.  This makes the code easier to read because we
can see what elements we're assigning, and will make the upcoming change
moving the ref_root into the btrfs_ref more clear and less error prone
than adding a new element to the initialization function.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
d3fbb00f5e btrfs: embed data_ref and tree_ref in btrfs_delayed_ref_node
We have been embedding btrfs_delayed_ref_node in the
btrfs_delayed_data_ref and btrfs_delayed_tree_ref, and then we have two
sets of cachep's and a variety of handling that is awkward because of
this separation.

Instead union these two members inside of btrfs_delayed_ref_node and
make that the first class object.  This allows us to go down to one
cachep for our delayed ref nodes instead of two.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
0eea355fc0 btrfs: add a helper to get the delayed ref node from the data/tree ref
We have several different ways we refer to references throughout the
code and it's not consistent and there's a bit of duplication.  In order
to clean this up I want to have one structure we use to define reference
information, and one structure we use for the delayed reference
information.  Start this process by adding a helper to get from the
btrfs_delayed_data_ref/btrfs_delayed_tree_ref to the
btrfs_delayed_ref_node so that it'll make moving these structures around
simpler.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Filipe Manana
26c0fae3e7 btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries()
Currently btrfs_prune_dentries() has open code to find the first inode in
a root with a minimum inode number. Remove that code and make it use the
helper btrfs_find_first_inode() for that task.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Filipe Manana
5e485ac6f0 btrfs: export find_next_inode() as btrfs_find_first_inode()
Export the relocation private helper find_next_inode() to inode.c, as this
same logic is also used at btrfs_prune_dentries() and will be used by an
upcoming change that adds an extent map shrinker. The next patch will
change btrfs_prune_dentries() to use this helper.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Filipe Manana
ed48adf83e btrfs: simplify add_extent_mapping() by removing pointless label
The add_extent_mapping() function is short and trivial, there's no need to
have a label for a quick exit in case of an error, even because there's no
error handling needed, we just need to return the error. So remove that
label and return directly.

Also while at it remove the redundant initialization of 'ret', as that may
help avoid some warnings with clang tools such as the one reported/fixed
by commit 966de47ff0 ("btrfs: remove redundant initialization of
variables in log_new_ancestors").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Filipe Manana
071533da5f btrfs: tests: error out on unexpected extent map reference count
In the extent map self tests, when freeing all extent maps from a test
extent map tree we are not expecting to find any extent map with a
reference count different from 1 (the tree reference). If we find any,
we just log a message but we don't fail the test, which makes it very easy
to miss any bug/regression - no one reads the test messages unless a test
fails. So change the behaviour to make a test fail if we find an extent
map in the tree with a reference count different from 1. Make the failure
happen only after removing all extent maps, so that we don't leak memory.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
0a308f8095 btrfs: pass an inode to btrfs_add_extent_mapping()
Instead of passing fs_info and extent map tree arguments to
btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps
are always inserted in the extent map tree of an inode, and the fs_info
can be extracted from the inode (inode->root->fs_info). The only exception
is in the self tests where we allocate an extent map tree and then use it
to insert/update/remove extent maps. However the tests can be changed to
use a test inode and then use the inode's extent map tree.

So change btrfs_add_extent_mapping() to have an inode as an argument
instead of a fs_info and an extent map tree. This reduces the number of
parameters and will also be needed for an upcoming change.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
236e3107fc btrfs: open code csum_exist_in_range()
The csum_exist_in_range() function is now too trivial and is only used in
one place, so open code it in its single caller.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
8d2a83a97f btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.

But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.

So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.

The following test with fio was used to measure performance:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0

  cat <<EOF > /tmp/fio-job.ini
  [global]
  name=fio-rand-write
  filename=$MNT/fio-rand-write
  rw=randwrite
  bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
  direct=1
  numjobs=16
  fallocate=posix
  time_based
  runtime=300

  [file1]
  size=8G
  ioengine=io_uring
  iodepth=16
  EOF

  umount $MNT &> /dev/null
  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  fio /tmp/fio-job.ini
  umount $MNT

The test was run on a release kernel (Debian's default kernel config).

The results before this patch:

  WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec

The results after this patch:

  WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
fb90e1caf0 btrfs: simplify error path for btrfs_lookup_csums_list()
In the error path we have this while loop that keeps iterating over the
csums of the list and then delete them from the list and free them,
testing for an error (ret < 0) and list emptyness as the conditions of
the while loop.

Simplify this by using list_for_each_entry_safe() so there's no need to
delete elements from the list and need to test the error condition on
each iteration.

Also rename the 'fail' label to 'out' since the label is not exclusive
to a failure path, as we also end up there when the function succeeds,
and it's also a more common label name.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
c0dce8b6a3 btrfs: remove use of a temporary list at btrfs_lookup_csums_list()
There's no need to use a temporary list to add the checksums, we can just
add them to input list and then on error delete and free any checksums
that were added. So simplify and remove the temporary list.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
afcb80624f btrfs: remove search_commit parameter from btrfs_lookup_csums_list()
All the callers of btrfs_lookup_csums_list() pass a value of 0 as the
"search_commit" parameter. So remove it and make the function behave as
to always search from the regular root.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
d800a9065b btrfs: add function comment to btrfs_lookup_csums_list()
Add a function comment to btrfs_lookup_csums_list() to document it.
With another upcoming change its parameter list and return value will be
less obvious. So add the documentation now so that it can be updated where
needed later.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
0ddefc2a7c btrfs: move btrfs_page_mkwrite() from inode.c into file.c
btrfs_page_mkwrite() is a struct vm_operations_struct callback and we
define that structure in file.c. Currently the function is in inode.c and
has to be exported to be used in file.c, which makes no sense because it's
not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into
file.c.

While at it do a few minor style changes:

1) Capitalize the first word of every comment and end each sentence with
   punctuation;

2) Avoid splitting some statements into two lines when everything fits in
   85 characters or less.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
590e2c4a1e btrfs: remove no longer used btrfs_clone_chunk_map()
There are no more users of btrfs_clone_chunk_map(), the last one (and
only one ever) was removed in commit 1ec17ef591 ("btrfs: zoned: fix
use-after-free in do_zone_finish()"). So remove btrfs_clone_chunk_map().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
606a1c5de1 btrfs: remove list_empty() check at warn_about_uncommitted_trans()
At warn_about_uncommitted_trans(), there's no need to check if the list
is empty and return, because list_for_each_entry_safe() is safe to call
for an empty list, it simply does nothing. So remove the check.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
47f6944877 btrfs: remove pointless return value assignment at btrfs_finish_one_ordered()
At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret'
variable because if it has a non-zero value (error), we have already
jumped to the 'out' label. So remove that redundant assignment.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Filipe Manana
2e438442ba btrfs: remove not needed mod_start and mod_len from struct extent_map
The mod_start and mod_len fields of struct extent_map were introduced by
commit 4e2f84e63d ("Btrfs: improve fsync by filtering extents that we
want") in order to avoid too low performance when fsyncing a file that
keeps getting extent maps merge, because it resulted in each fsync logging
again csum ranges that were already merged before.

We don't need this anymore as extent maps in the list of modified extents
are never merged with other extent maps and once we log an extent map we
remove it from the list of modified extent maps, so it's never logged
twice.

So remove the mod_start and mod_len fields from struct extent_map and use
instead the start and len fields when logging checksums in the fast fsync
path. This also makes EXTENT_FLAG_FILLING unused so remove it as well.

Running the reproducer from the commit mentioned before, with a larger
number of extents and against a null block device, so that IO is fast
and we can better see any impact from searching checksums items and
logging them, gave the following results from dd:

Before this change:

   409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s

After this change:

   409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s

So no changes in throughput.
The test was done in a release kernel (non-debug, Debian's default kernel
config) and its steps are the following:

   $ mkfs.btrfs -f /dev/nullb0
   $ mount /dev/sdb /mnt
   $ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync
   $ umount /mnt

This also reduces the size of struct extent_map from 128 bytes down to 112
bytes, so now we can have 36 extents maps per 4K page instead of 32.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Boris Burkov
5f2fb819f6 btrfs: free PERTRANS at the end of cleanup_transaction()
Some of the operations after the free might convert more PERTRANS
metadata. Do the freeing as late as possible to eliminate a source of
leaked PERTRANS metadata.

This helps with the pass rate of generic/269 and generic/475.

Reviewed-by: Qu Wenruo <qwu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
400b172b8c btrfs: compression: migrate compression/decompression paths to folios
For both compression and decompression paths, we always require a
"struct page **pages" and "unsigned long nr_pages", this involves quite
some part of the btrfs compression paths:

- All the compression entry points

- compressed_bio structure
  This affects both compression and decompression.

- async_extent structure

Unfortunately with all those involved parts, there is no good way to
split the conversion into smaller patches while still passing compiling.
So do this in one big conversion in one go.

Please note this is direct page->folio conversion, no change on the page
sized folio requirement yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
11e03f2f4b btrfs: introduce btrfs_alloc_folio_array()
The new helper will do the same thing as btrfs_alloc_page_array(), but
with folios.

One extra difference is, there is no extra helper for bulk allocation,
thus it may not be as efficient as the page version.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
ae0d22a7fc btrfs: migrate insert_inline_extent() to folio interfaces
Since insert_inline_extent() now only accepts a single page, it's much
easier to convert it to use folio interfaces.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
eb1fa9ab47 btrfs: make insert_inline_extent() accept one page directly
Since our inline extent cannot accept anything larger than a sector,
there is really no need to pass all the compressed pages to
insert_inline_extent().

And just in case, expand the ASSERT()s to make sure we only try inline
with compressed size no larger than sectorsize.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
98fe01af7e btrfs: compression: convert page allocation to folio interfaces
Currently we have two wrappers to allocate and free a page for
compression usage:

- btrfs_alloc_compr_page()
- btrfs_free_compr_page()

The allocator would try to grab a page from the pool, and only allocate
a new page if the pool is empty.

The reclaimer would check if the pool is full, and if not full it would
put the page into the pool.

This patch converts both helpers to use folio interfaces, and allowing
further conversion of compression path to folios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
6de3595473 btrfs: compression: add error handling for missed page cache
For all the supported compression algorithms, the compression path would
always need to grab the page cache, then do the compression.

Normally we would get a page reference without any problem, since the
write path should have already locked the pages in the write range.
For the sake of error handling, we should handle the page cache miss
case.

Adds a common wrapper, btrfs_compress_find_get_page(), which calls
find_get_page(), and do the error handling along with an error message.

Callers inside compression path would only need to call
btrfs_compress_find_get_page(), and error out if it returned any error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Filipe Manana
5d6f0e9890 btrfs: stop locking the source extent range during reflink
Nowadays before starting a reflink operation we do this:

1) Take the VFS lock of the inodes in exclusive mode (a rw semaphore);

2) Take the  mmap lock of the inodes (struct btrfs_inode::i_mmap_lock);

3) Flush all delalloc in the source and target ranges;

4) Wait for all ordered extents in the source and target ranges to
   complete;

5) Lock the source and destination ranges in the inodes' io trees.

In step 5 we lock the source range because:

1) We needed to serialize against mmap writes, but that is not needed
   anymore because nowadays we do that through the inode's i_mmap_lock
   (step 2). This happens since commit 8c99516a8c ("btrfs: exclude mmaps
   while doing remap");

2) To serialize against a concurrent relocation and avoid generating
   a delayed ref for an extent that was just dropped by relocation, see
   commit d8b5524242 ("Btrfs: fix race between reflink/dedupe and
   relocation").

Locking the source range however blocks any concurrent reads for that
range and makes test case generic/733 fail.

So instead of locking the source range during reflinks, make relocation
read lock the inode's i_mmap_lock, so that it serializes with a concurrent
reflink while still able to run concurrently with mmap writes and allow
concurrent reads too.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Dan Carpenter
4a43d735a6 btrfs: qgroup: delete unnecessary check in btrfs_qgroup_check_inherit()
This check "if (inherit->num_qgroups > PAGE_SIZE)" is confusing and
unnecessary.

The problem with the check is that static checkers flag it as a
potential mixup of between units of bytes vs number of elements.
Fortunately, the check can safely be deleted because the next check is
correct and applies an even stricter limit:

	if (size != struct_size(inherit, qgroups, inherit->num_qgroups))
		return -EINVAL;

The "inherit" struct ends in a variable array of __u64 and
"inherit->num_qgroups" is the number of elements in the array.  At the
start of the function we check that:

	if (size < sizeof(*inherit) || size > PAGE_SIZE)
		return -EINVAL;

Thus, since we verify that the whole struct fits within one page, that
means that the number of elements in the inherit->qgroups[] array must
be less than PAGE_SIZE.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Goldwyn Rodrigues
01b69bf990 btrfs: convert put_file_data() to folios
Use folio instead of page in put_file_data(). Add a warning in case
higher order folio is found, this will be implemented in the future.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Goldwyn Rodrigues
a16c2c48f4 btrfs: convert relocate_one_page() to folios and rename
Convert page references to folios and call the respective folio
functions.  Since find_or_create_page() takes a mask argument, call
__filemap_get_folio() instead of filemap_grab_folio().

The patch assumes folio size is PAGE_SIZE, add a warning in case it's a
higher order that will be implemented in the future.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Goldwyn Rodrigues
8d6e5f9a0a btrfs: page to folio conversion: prealloc_file_extent_cluster()
Convert usage of page to folio in prealloc_file_extent_cluster()

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
70f1e5b6db btrfs: rename err to ret in btrfs_direct_write()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
aefee7f1d8 btrfs: rename err to ret in prepare_pages()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
35cb2e90f4 btrfs: rename err to ret in btrfs_dirty_pages()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
04e4e189dd btrfs: rename err to ret in create_reloc_inode()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
fdee5e557f btrfs: rename err to ret in __btrfs_end_transaction()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
d5b634ae1f btrfs: rename err to ret in convert_extent_bit()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
cbb6b5d208 btrfs: rename err to ret in __set_extent_bit()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
93bc66f4b6 btrfs: rename err to ret in btrfs_ioctl_snap_destroy()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:01 +02:00
Anand Jain
5e45b044b7 btrfs: rename err to ret in btrfs_cont_expand()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Anand Jain
c3a1cc8ff4 btrfs: rename err to ret in btrfs_rmdir()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Anand Jain
c87b979d9f btrfs: rename err to ret in btrfs_initxattrs()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Tavian Barnes
f32f20e2bd btrfs: warn if EXTENT_BUFFER_UPTODATE is set while reading
We recently tracked down a race condition that triggered a read for an
extent buffer with EXTENT_BUFFER_UPTODATE already set.  While this read
was in progress, other concurrent readers would see the UPTODATE bit and
return early as if the read was already complete, making accesses to the
extent buffer conflict with the read operation that was overwriting it.

Add a WARN_ON() to end_bbio_meta_read() for this situation to make
similar races easier to spot in the future.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Tavian Barnes
1e2d183709 btrfs: add helper to clear EXTENT_BUFFER_READING
We are clearing the bit and waking up any waiters in two different
places.  Factor that code out into a static helper function.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
c79f57eafc btrfs: avoid pointless wake ups of drew lock readers
When unlocking a write lock on a drew lock, at btrfs_drew_write_unlock(),
it's pointless to wake up tasks waiting to acquire a read lock if we
didn't decrement the 'writers' counter down to 0, since a read lock can
only be acquired when the counter reaches a value of 0. Doing so is
harmless from a functional point of view, but it's not efficient due to
unnecessarily waking up tasks just for them to sleep again on the
waitqueue.

So change this to wake up readers only if we decremented the 'writers'
counter to 0.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
c66f2afc71 btrfs: remove pointless writepages callback wrapper
There's no point in having a static writepages callback in inode.c that
does nothing besides calling extent_writepages from extent_io.c.
So just remove the callback at inode.c and rename extent_writepages()
to btrfs_writepages().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
7938d38b94 btrfs: remove pointless readahead callback wrapper
There's no point in having a static readahead callback in inode.c that
does nothing besides calling extent_readahead() from extent_io.c.
So just remove the callback at inode.c and rename extent_readahead()
to btrfs_readahead().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
2066bbfccf btrfs: locking: rename __btrfs_tree_lock() and __btrfs_tree_read_lock()
The __btrfs_tree_lock() and __btrfs_tree_read_lock() are using a naming
with a double underscore prefix, which is specially not proper for
exported functions. Remove the double underscore prefix from their name
and add the "_nested" suffix.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
f40ca9cb58 btrfs: locking: inline btrfs_tree_lock() and btrfs_tree_read_lock()
The functions btrfs_tree_lock() and btrfs_tree_read_lock() are very
trivial so that can be made inline and avoid call overhead, as they
are very often called inside critical sections (when searching a btree
for example, attempting to lock a child node/leaf while holding a lock
on the parent).

So make them static inline, which even reduces the size of the btrfs
module a little bit.

Before this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1718786	 156276	  16920	1891982	 1cde8e	fs/btrfs/btrfs.ko

After this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1718650	 156260	  16920	1891830	 1cddf6	fs/btrfs/btrfs.ko

Running fs_mark also showed a tiny improvement with this script:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/nullb0
   MNT=/mnt/nullb0
   FILES=100000
   THREADS=$(nproc --all)

   echo "performance" | \
       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

   umount $DEV &> /dev/null
   mkfs.btrfs -f $DEV
   mount $DEV $MNT

   OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
   for ((i = 1; i <= $THREADS; i++)); do
        OPTS="$OPTS -d $MNT/d$i"
   done

   fs_mark $OPTS

   umount $MNT

Before this change:

   FSUse%        Count         Size    Files/sec     App Overhead
       10      1200000            0     180894.0         10705410
       16      2400000            0     228211.4         10765738
       23      3600000            0     215969.6         11011072
       30      4800000            0     199077.1         11145587
       46      6000000            0     176624.1         11658470

After this change:

   FSUse%        Count         Size    Files/sec     App Overhead
       10      1200000            0     185312.3         10708377
       16      2400000            0     229320.4         10858013
       23      3600000            0     217958.7         11006167
       30      4800000            0     205122.9         11112899
       46      6000000            0     178039.1         11438852

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
05aa024382 btrfs: remove pointless BUG_ON() when creating snapshot
When creating a snapshot we first check with btrfs_lookup_dir_item() if
there is a name collision in the parent directory and then return an error
if there's a collision. Then later on when trying to insert a dir item for
the snapshot we BUG_ON() if the return value is -EEXIST or -EOVERFLOW:

  static noinline int create_pending_snapshot(...)
  {
     (...)

     /* check if there is a file/dir which has the same name. */
     dir_item = btrfs_lookup_dir_item(...);
     (...)

     ret = btrfs_insert_dir_item(...);
     /* We have check then name at the beginning, so it is impossible. */
     BUG_ON(ret == -EEXIST || ret == -EOVERFLOW);
     if (ret) {
        btrfs_abort_transaction(trans, ret);
        goto fail;
     }

     (...)
  }

It's impossible to get the -EEXIST because we previously checked for a
potential collision with btrfs_lookup_dir_item() and we know that after
that no one could have added a colliding name because at this point the
transaction is in its critical section, state TRANS_STATE_COMMIT_DOING,
so no one can join this transaction to add a colliding name and neither
can anyone start a new transaction to do that.

As for the -EOVERFLOW, that can't happen as long as we have the extended
references feature enabled, which is a mkfs default for many years now.

In either case, the BUG_ON() is excessive as we can properly deal with
any error and can abort the transaction and jump to the 'fail' label,
in which case we'll also get the useful stack trace (just like a BUG_ON())
from the abort if the error is either -EEXIST or -EOVERFLOW.

So remove the BUG_ON().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Kent Overstreet
6e297a73bc bcachefs: Add missing sched_annotate_sleep() in bch2_journal_flush_seq_async()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-07 11:02:37 -04:00
Kent Overstreet
54541c1f78 bcachefs: Fix race in bch2_write_super()
bch2_write_super() was looping over online devices multiple times -
dropping and retaking io_ref each time.

This meant it could race with device removal; it could increment the
sequence number on a device but fail to write it - and then if the
device was re-added, it would get confused the next time around thinking
a superblock write was silently dropped.

Fix this by taking io_ref once, and stashing pointers to online devices
in a darray.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-07 11:02:36 -04:00
Wolfram Sang
c1c53c26e3 gfs2: make timeout values more explicit
'timeout' is a vague name for the return value of wait_event_*_timeout
because it actually returns the time left. Because the variable is never
used later, just drop the return value. Since variable 'timeout' is then
only used to carry a fixed timeout value, drop this in favor of a fixed
function argument as in the other call to wait_event_timeout() above.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-05-07 12:42:48 +02:00
Linus Torvalds
dccb07f291 for-6.9-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmY5LaQACgkQxWXV+ddt
 WDs7aQ/8DbxhYTNqHmEv6860w/o7Sb856foIqlZ81v1r55XYIFGhTbvQntjQgvcI
 Kf8+Du6ijpYeAO2Iuj/7EQP/6yA+f5NCogpW8Nsr24riUCvzjhXr49KQbVg1SsdX
 i8iec0UCQlxq7RncbpGiIwgxPRMJhUEG/wRHWnGR3jOJFXSvsLJpywZbn+Yw1d7w
 kcUbHEzZPZqrWPAIcifpv/7qVCd1sPN8P3mMevcWtc1diEhQlHVVF7JnCcHxrwBP
 dIsNSWyt0YmgIt231GW6GKDwuQHyv870yHK9gumvpePsfcZnDBgeMuMvv0TykhgJ
 BHV2gwhIK11bNala1pw1F7CX4oiiHEeI/09/nh7xopcjnULMRFItGus2dkqDagSa
 ex4g48J412crWayZ5uFqAVYeO9MNufvLvCutUj1sD/teh2ymMq82gHzQO0FTu5GL
 NjWLoJXXyU18BgbXTmbm5rSMycDf1BG9Hv+MdxwEFrasF2q6Lhp+EIljUxN7+n49
 i9GrLWptd8sBx/GtZXhsZlWP+vPSuHqdjZe61LD4B3IgBeGDJg6tJmHv8rEFO4Ws
 9nkvaDVF03pHWxWOocDIzbrkpVwOLBaDHGwjH9Cn/lgIHL+zjXVpMaKz4/klpOr8
 4/ehUajrOK6Wmyoi3fKYxZACnWK5HhFHYcB8zc1R8+zt+Pj/mbk=
 =2no9
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two more fixes, both have some visible effects on user space:

   - add check if quotas are enabled when passing qgroup inheritance
     info, this affects snapper that could fail to create a snapshot

   - do check for leaf/node flag WRITTEN earlier so that nodes are
     completely validated before access, this used to be done by
     integrity checker but it's been removed and left an unhandled case"

* tag 'for-6.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: make sure that WRITTEN is set on all metadata blocks
  btrfs: qgroup: do not check qgroup inherit if qgroup is disabled
2024-05-06 13:43:13 -07:00
Kent Overstreet
71dac2482a bcachefs: BCH_SB_LAYOUT_SIZE_BITS_MAX
Define a constant for the max superblock size, to avoid a too-large
shift.

Reported-by: syzbot+a8b0fb419355c91dda7f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
88ab10186c bcachefs: Add missing skcipher_request_set_callback() call
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
8060bf1d83 bcachefs: Fix snapshot_t() usage in bch2_fs_quota_read_inode()
bch2_fs_quota_read_inode() wasn't entirely updated to the
bch2_snapshot_tree() helper, which takes rcu lock.

Reported-by: syzbot+a3a9a61224ed3b7f0010@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
0ec5b3b7cc bcachefs: Fix shift-by-64 in bformat_needs_redo()
Ancient versions of bcachefs produced packed formats that could
represent keys that our in memory format cannot represent;
bformat_needs_redo() has some tricky shifts to check for this sort of
overflow.

Reported-by: syzbot+594427aebfefeebe91c6@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
2bb9600d5d bcachefs: Guard against unknown k.k->type in __bkey_invalid()
For forwards compatibility we have to allow unknown key types, and only
run the checks that make sense against them.

Fix a missing guard on k.k->type being known.

Reported-by: syzbot+ae4dc916da3ce51f284f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
f39055220f bcachefs: Add missing validation for superblock section clean
We were forgetting to check for jset entries that overrun the end of the
section - both in validate and to_text(); to_text() needs to be safe for
types that fail to validate.

Reported-by: syzbot+c48865e11e7e893ec4ab@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
6b8cbfc3db bcachefs: Fix assert in bch2_alloc_v4_invalid()
Reported-by: syzbot+10827fa6b176e1acf1d0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Reed Riley
9a0ec04511 bcachefs: fix overflow in fiemap
filefrag (and potentially other utilities that call fiemap) sometimes
pass ULONG_MAX as the length.  fiemap_prep clamps excessively large
lengths - but the calculation of end can overflow if it occurs before
calling fiemap_prep.  When this happens, filefrag assumes it has read to
the end and exits.

Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
db42549d40 bcachefs: Add a better limit for maximum number of buckets
The bucket_gens array is a single array allocation (one byte per
bucket), and kernel allocations are still limited to INT_MAX.

Check this limit to avoid failing the bucket_gens array allocation.

Reported-by: syzbot+b29f436493184ea42e2b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
18b4abcead bcachefs: Fix lifetime issue in device iterator helpers
bch2_get_next_dev() and bch2_get_next_online_dev() iterate over devices,
dropping and taking refs as they go; we can't access the previous device
(for ca->dev_idx) after we've dropped our ref to it, unless we take
rcu_read_lock() first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
3a2d025927 bcachefs: Fix bch2_dev_lookup() refcounting
bch2_dev_lookup() is supposed to take a ref on the device it returns, but
for_each_member_device() takes refs as it iterates,
for_each_member_device_rcu() does not.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
1267df40ac bcachefs: Initialize bch_write_op->failed in inline data path
Normally this is initialized in __bch2_write(), which is executed in a
loop, but the inline data path skips this.

Reported-by: syzbot+fd3ccb331eb21f05d13b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
feb077c177 bcachefs: Fix refcount put in sb_field_resize error path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
4a8521b6bb bcachefs: Inodes need extra padding for varint_decode_fast()
Reported-by: syzbot+66b9b74f6520068596a9@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
b30b70ad8b bcachefs: Fix early error path in bch2_fs_btree_key_cache_exit()
Reported-by: syzbot+a35cdb62ec34d44fb062@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
a2ddaf965f bcachefs: bucket_pos_to_bp_noerror()
We don't want the assert when we're checking if the backpointer is
valid.

Reported-by: syzbot+bf7215c0525098e7747a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
7ffec9ccdc bcachefs: don't free error pointers
Reported-by: syzbot+3333603f569fc2ef258c@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
72e71bf029 bcachefs: Fix a scheduler splat in __bch2_next_write_buffer_flush_journal_buf()
We're using mutex_lock() inside a wait_event() conditional -
prepare_to_wait() has already flipped task state, so potentially
blocking ops need annotation.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:14:13 -04:00
Mike Marshall
53e4efa470 orangefs: fix out-of-bounds fsid access
Arnd Bergmann sent a patch to fsdevel, he says:

"orangefs_statfs() copies two consecutive fields of the superblock into
the statfs structure, which triggers a warning from the string fortification
helpers"

Jan Kara suggested an alternate way to do the patch to make it more readable.

I ran both ideas through xfstests and both seem fine. This patch
is based on Jan Kara's suggestion.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2024-05-06 10:10:36 -04:00
Ryan Roberts
2c7ad9a590 fs/proc/task_mmu: fix uffd-wp confusion in pagemap_scan_pmd_entry()
pagemap_scan_pmd_entry() checks if uffd-wp is set on each pte to avoid
unnecessary if set.  However it was previously checking with
`pte_uffd_wp(ptep_get(pte))` without first confirming that the pte was
present.  It is only valid to call pte_uffd_wp() for present ptes.  For
swap ptes, pte_swp_uffd_wp() must be called because the uffd-wp bit may be
kept in a different position, depending on the arch.

This was leading to test failures in the pagemap_ioctl mm selftest, when
bringing up uffd-wp support on arm64 due to incorrectly interpretting the
uffd-wp status of migration entries.

Let's fix this by using the correct check based on pte_present().  While
we are at it, let's pass the pte to make_uffd_wp_pte() to avoid the
pointless extra ptep_get() which can't be optimized out due to READ_ONCE()
on many arches.

Link: https://lkml.kernel.org/r/20240429114104.182890-1-ryan.roberts@arm.com
Fixes: 12f6b01a0b ("fs/proc/task_mmu: add fast paths to get/clear PAGE_IS_WRITTEN flag")
Closes: https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> 
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:28:07 -07:00
Ryan Roberts
c70dce4982 fs/proc/task_mmu: fix loss of young/dirty bits during pagemap scan
make_uffd_wp_pte() was previously doing:

  pte = ptep_get(ptep);
  ptep_modify_prot_start(ptep);
  pte = pte_mkuffd_wp(pte);
  ptep_modify_prot_commit(ptep, pte);

But if another thread accessed or dirtied the pte between the first 2
calls, this could lead to loss of that information.  Since
ptep_modify_prot_start() gets and clears atomically, the following is the
correct pattern and prevents any possible race.  Any access after the
first call would see an invalid pte and cause a fault:

  pte = ptep_modify_prot_start(ptep);
  pte = pte_mkuffd_wp(pte);
  ptep_modify_prot_commit(ptep, pte);

Link: https://lkml.kernel.org/r/20240429114017.182570-1-ryan.roberts@arm.com
Fixes: 52526ca7fd ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:28:06 -07:00
Peter Xu
c88033efe9 mm/userfaultfd: reset ptes when close() for wr-protected ones
Userfaultfd unregister includes a step to remove wr-protect bits from all
the relevant pgtable entries, but that only covered an explicit
UFFDIO_UNREGISTER ioctl, not a close() on the userfaultfd itself.  Cover
that too.  This fixes a WARN trace.

The only user visible side effect is the user can observe leftover
wr-protect bits even if the user close()ed on an userfaultfd when
releasing the last reference of it.  However hopefully that should be
harmless, and nothing bad should happen even if so.

This change is now more important after the recent page-table-check
patch we merged in mm-unstable (446dd9ad37d0 ("mm/page_table_check:
support userfault wr-protect entries")), as we'll do sanity check on
uffd-wp bits without vma context.  So it's better if we can 100%
guarantee no uffd-wp bit leftovers, to make sure each report will be
valid.

Link: https://lore.kernel.org/all/000000000000ca4df20616a0fe16@google.com/
Fixes: f369b07c86 ("mm/uffd: reset write protection when unregister with wp-mode")
Analyzed-by: David Hildenbrand <david@redhat.com>
Link: https://lkml.kernel.org/r/20240422133311.2987675-1-peterx@redhat.com
Reported-by: syzbot+d8426b591c36b21c750e@syzkaller.appspotmail.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:28:04 -07:00
Linus Torvalds
4efaa5acf0 epoll: be better about file lifetimes
epoll can call out to vfs_poll() with a file pointer that may race with
the last 'fput()'. That would make f_count go down to zero, and while
the ep->mtx locking means that the resulting file pointer tear-down will
be blocked until the poll returns, it means that f_count is already
dead, and any use of it won't actually get a reference to the file any
more: it's dead regardless.

Make sure we have a valid ref on the file pointer before we call down to
vfs_poll() from the epoll routines.

Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-05 14:00:48 -07:00
Linus Torvalds
e92b99ae82 tracing and tracefs fixes for v6.9
- Fix RCU callback of freeing an eventfs_inode.
   The freeing of the eventfs_inode from the kref going to zero
   freed the contents of the eventfs_inode and then used kfree_rcu()
   to free the inode itself. But the contents should also be protected
   by RCU. Switch to a call_rcu() that calls a function to free all
   of the eventfs_inode after the RCU synchronization.
 
 - The tracing subsystem maps its own descriptor to a file represented by
   eventfs. The freeing of this descriptor needs to know when the
   last reference of an eventfs_inode is released, but currently
   there is no interface for that. Add a "release" callback to
   the eventfs_inode entry array that allows for freeing of data
   that can be referenced by the eventfs_inode being opened.
   Then increment the ref counter for this descriptor when the
   eventfs_inode file is created, and decrement/free it when the
   last reference to the eventfs_inode is released and the file
   is removed. This prevents races between freeing the descriptor
   and the opening of the eventfs file.
 
 - Fix the permission processing of eventfs.
   The change to make the permissions of eventfs default to the mount
   point but keep track of when changes were made had a side effect
   that could cause security concerns. When the tracefs is remounted
   with a given gid or uid, all the files within it should inherit
   that gid or uid. But if the admin had changed the permission of
   some file within the tracefs file system, it would not get updated
   by the remount. This caused the kselftest of file permissions
   to fail the second time it is run. The first time, all changes
   would look fine, but the second time, because the changes were
   "saved", the remount did not reset them.
 
   Create a link list of all existing tracefs inodes, and clear the
   saved flags on them on a remount if the remount changes the
   corresponding gid or uid fields.
 
   This also simplifies the code by removing the distinction between the
   toplevel eventfs and an instance eventfs. They should both act the
   same. They were different because of a misconception due to the
   remount not resetting the flags. Now that remount resets all the
   files and directories to default to the root node if a uid/gid is
   specified, it makes the logic simpler to implement.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZjXxzxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqzGAQCX8g7gtngGgwSsWqPW5GmecCifwFja
 k7cVEDhMYPnDeAEAkYi2ZBgJRkPsWPfMRClDK/DXP4woOo58asxtIxfTMgg=
 =mCkt
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing and tracefs fixes from Steven Rostedt:

 - Fix RCU callback of freeing an eventfs_inode.

   The freeing of the eventfs_inode from the kref going to zero freed
   the contents of the eventfs_inode and then used kfree_rcu() to free
   the inode itself. But the contents should also be protected by RCU.
   Switch to a call_rcu() that calls a function to free all of the
   eventfs_inode after the RCU synchronization.

 - The tracing subsystem maps its own descriptor to a file represented
   by eventfs. The freeing of this descriptor needs to know when the
   last reference of an eventfs_inode is released, but currently there
   is no interface for that.

   Add a "release" callback to the eventfs_inode entry array that allows
   for freeing of data that can be referenced by the eventfs_inode being
   opened. Then increment the ref counter for this descriptor when the
   eventfs_inode file is created, and decrement/free it when the last
   reference to the eventfs_inode is released and the file is removed.
   This prevents races between freeing the descriptor and the opening of
   the eventfs file.

 - Fix the permission processing of eventfs.

   The change to make the permissions of eventfs default to the mount
   point but keep track of when changes were made had a side effect that
   could cause security concerns. When the tracefs is remounted with a
   given gid or uid, all the files within it should inherit that gid or
   uid. But if the admin had changed the permission of some file within
   the tracefs file system, it would not get updated by the remount.

   This caused the kselftest of file permissions to fail the second time
   it is run. The first time, all changes would look fine, but the
   second time, because the changes were "saved", the remount did not
   reset them.

   Create a link list of all existing tracefs inodes, and clear the
   saved flags on them on a remount if the remount changes the
   corresponding gid or uid fields.

   This also simplifies the code by removing the distinction between the
   toplevel eventfs and an instance eventfs. They should both act the
   same. They were different because of a misconception due to the
   remount not resetting the flags. Now that remount resets all the
   files and directories to default to the root node if a uid/gid is
   specified, it makes the logic simpler to implement.

* tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Have "events" directory get permissions from its parent
  eventfs: Do not treat events directory different than other directories
  eventfs: Do not differentiate the toplevel events directory
  tracefs: Still use mount point as default permissions for instances
  tracefs: Reset permissions on remount if permissions are options
  eventfs: Free all of the eventfs_inode after RCU
  eventfs/tracing: Add callback for release of an eventfs_inode
2024-05-05 09:53:09 -07:00
Namjae Jeon
691aae4f36 ksmbd: do not grant v2 lease if parent lease key and epoch are not set
This patch fix xfstests generic/070 test with smb2 leases = yes.

cifs.ko doesn't set parent lease key and epoch in create context v2 lease.
ksmbd suppose that parent lease and epoch are vaild if data length is
v2 lease context size and handle directory lease using this values.
ksmbd should hanle it as v1 lease not v2 lease if parent lease key and
epoch are not set in create context v2 lease.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-04 23:53:36 -05:00
Namjae Jeon
d1c189c6cb ksmbd: use rwsem instead of rwlock for lease break
lease break wait for lease break acknowledgment.
rwsem is more suitable than unlock while traversing the list for parent
lease break in ->m_op_list.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-04 23:53:36 -05:00
Namjae Jeon
97c2ec6466 ksmbd: avoid to send duplicate lease break notifications
This patch fixes generic/011 when enable smb2 leases.

if ksmbd sends multiple notifications for a file, cifs increments
the reference count of the file but it does not decrement the count by
the failure of queue_work.
So even if the file is closed, cifs does not send a SMB2_CLOSE request.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-04 23:53:35 -05:00
Namjae Jeon
cc00bc83f2 ksmbd: off ipv6only for both ipv4/ipv6 binding
ΕΛΕΝΗ reported that ksmbd binds to the IPV6 wildcard (::) by default for
ipv4 and ipv6 binding. So IPV4 connections are successful only when
the Linux system parameter bindv6only is set to 0 [default value].
If this parameter is set to 1, then the ipv6 wildcard only represents
any IPV6 address. Samba creates different sockets for ipv4 and ipv6
by default. This patch off sk_ipv6only to support IPV4/IPV6 connections
without creating two sockets.

Cc: stable@vger.kernel.org
Reported-by: ΕΛΕΝΗ ΤΖΑΒΕΛΛΑ <helentzavellas@yahoo.gr>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-04 23:53:35 -05:00
Steven Rostedt (Google)
d57cf30c4c eventfs: Have "events" directory get permissions from its parent
The events directory gets its permissions from the root inode. But this
can cause an inconsistency if the instances directory changes its
permissions, as the permissions of the created directories under it should
inherit the permissions of the instances directory when directories under
it are created.

Currently the behavior is:

 # cd /sys/kernel/tracing
 # chgrp 1002 instances
 # mkdir instances/foo
 # ls -l instances/foo
[..]
 -r--r-----  1 root lkp  0 May  1 18:55 buffer_total_size_kb
 -rw-r-----  1 root lkp  0 May  1 18:55 current_tracer
 -rw-r-----  1 root lkp  0 May  1 18:55 error_log
 drwxr-xr-x  1 root root 0 May  1 18:55 events
 --w-------  1 root lkp  0 May  1 18:55 free_buffer
 drwxr-x---  2 root lkp  0 May  1 18:55 options
 drwxr-x--- 10 root lkp  0 May  1 18:55 per_cpu
 -rw-r-----  1 root lkp  0 May  1 18:55 set_event

All the files and directories under "foo" has the "lkp" group except the
"events" directory. That's because its getting its default value from the
mount point instead of its parent.

Have the "events" directory make its default value based on its parent's
permissions. That now gives:

 # ls -l instances/foo
[..]
 -rw-r-----  1 root lkp 0 May  1 21:16 buffer_subbuf_size_kb
 -r--r-----  1 root lkp 0 May  1 21:16 buffer_total_size_kb
 -rw-r-----  1 root lkp 0 May  1 21:16 current_tracer
 -rw-r-----  1 root lkp 0 May  1 21:16 error_log
 drwxr-xr-x  1 root lkp 0 May  1 21:16 events
 --w-------  1 root lkp 0 May  1 21:16 free_buffer
 drwxr-x---  2 root lkp 0 May  1 21:16 options
 drwxr-x--- 10 root lkp 0 May  1 21:16 per_cpu
 -rw-r-----  1 root lkp 0 May  1 21:16 set_event

Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.161887248@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 8186fff7ab ("tracefs/eventfs: Use root and instance inodes as default ownership")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)
22e61e15af eventfs: Do not treat events directory different than other directories
Treat the events directory the same as other directories when it comes to
permissions. The events directory was considered different because it's
dentry is persistent, whereas the other directory dentries are created
when accessed. But the way tracefs now does its ownership by using the
root dentry's permissions as the default permissions, the events directory
can get out of sync when a remount is performed setting the group and user
permissions.

Remove the special case for the events directory on setting the
attributes. This allows the updates caused by remount to work properly as
well as simplifies the code.

Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.002923579@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 8186fff7ab ("tracefs/eventfs: Use root and instance inodes as default ownership")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)
d53891d348 eventfs: Do not differentiate the toplevel events directory
The toplevel events directory is really no different than the events
directory of instances. Having the two be different caused
inconsistencies and made it harder to fix the permissions bugs.

Make all events directories act the same.

Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.846448710@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 8186fff7ab ("tracefs/eventfs: Use root and instance inodes as default ownership")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)
6599bd5517 tracefs: Still use mount point as default permissions for instances
If the instances directory's permissions were never change, then have it
and its children use the mount point permissions as the default.

Currently, the permissions of instance directories are determined by the
instance directory's permissions itself. But if the tracefs file system is
remounted and changes the permissions, the instance directory and its
children should use the new permission.

But because both the instance directory and its children use the instance
directory's inode for permissions, it misses the update.

To demonstrate this:

  # cd /sys/kernel/tracing/
  # mkdir instances/foo
  # ls -ld instances/foo
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld current_tracer
 -rw-r----- 1 root root 0 May  1 18:57 current_tracer

  # mount -o remount,gid=1002 .
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo/
  # ls -ld current_tracer
 -rw-r----- 1 root lkp 0 May  1 18:57 current_tracer

Notice that changing the group id to that of "lkp" did not affect the
instances directory nor its children. It should have been:

  # ls -ld current_tracer
 -rw-r----- 1 root root 0 May  1 19:19 current_tracer
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:25 instances/foo/
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 19:19 instances

  # mount -o remount,gid=1002 .
  # ls -ld current_tracer
 -rw-r----- 1 root lkp 0 May  1 19:19 current_tracer
  # ls -ld instances
 drwxr-x--- 3 root lkp 0 May  1 19:19 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root lkp 0 May  1 19:25 instances/foo/

Where all files were updated by the remount gid update.

Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.686838327@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 8186fff7ab ("tracefs/eventfs: Use root and instance inodes as default ownership")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)
baa23a8d43 tracefs: Reset permissions on remount if permissions are options
There's an inconsistency with the way permissions are handled in tracefs.
Because the permissions are generated when accessed, they default to the
root inode's permission if they were never set by the user. If the user
sets the permissions, then a flag is set and the permissions are saved via
the inode (for tracefs files) or an internal attribute field (for
eventfs).

But if a remount happens that specify the permissions, all the files that
were not changed by the user gets updated, but the ones that were are not.
If the user were to remount the file system with a given permission, then
all files and directories within that file system should be updated.

This can cause security issues if a file's permission was updated but the
admin forgot about it. They could incorrectly think that remounting with
permissions set would update all files, but miss some.

For example:

 # cd /sys/kernel/tracing
 # chgrp 1002 current_tracer
 # ls -l
[..]
 -rw-r-----  1 root root 0 May  1 21:25 buffer_size_kb
 -rw-r-----  1 root root 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-----  1 root root 0 May  1 21:25 buffer_total_size_kb
 -rw-r-----  1 root lkp  0 May  1 21:25 current_tracer
 -rw-r-----  1 root root 0 May  1 21:25 dynamic_events
 -r--r-----  1 root root 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-----  1 root root 0 May  1 21:25 enabled_functions

Where current_tracer now has group "lkp".

 # mount -o remount,gid=1001 .
 # ls -l
 -rw-r-----  1 root tracing 0 May  1 21:25 buffer_size_kb
 -rw-r-----  1 root tracing 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-----  1 root tracing 0 May  1 21:25 buffer_total_size_kb
 -rw-r-----  1 root lkp     0 May  1 21:25 current_tracer
 -rw-r-----  1 root tracing 0 May  1 21:25 dynamic_events
 -r--r-----  1 root tracing 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-----  1 root tracing 0 May  1 21:25 enabled_functions

Everything changed but the "current_tracer".

Add a new link list that keeps track of all the tracefs_inodes which has
the permission flags that tell if the file/dir should use the root inode's
permission or not. Then on remount, clear all the flags so that the
default behavior of using the root inode's permission is done for all
files and directories.

Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.529542160@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 8186fff7ab ("tracefs/eventfs: Use root and instance inodes as default ownership")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)
ee4e037947 eventfs: Free all of the eventfs_inode after RCU
The freeing of eventfs_inode via a kfree_rcu() callback. But the content
of the eventfs_inode was being freed after the last kref. This is
dangerous, as changes are being made that can access the content of an
eventfs_inode from an RCU loop.

Instead of using kfree_rcu() use call_rcu() that calls a function to do
all the freeing of the eventfs_inode after a RCU grace period has expired.

Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.370261163@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 43aa6f97c2 ("eventfs: Get rid of dentry pointers without refcounts")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)
b63db58e2f eventfs/tracing: Add callback for release of an eventfs_inode
Synthetic events create and destroy tracefs files when they are created
and removed. The tracing subsystem has its own file descriptor
representing the state of the events attached to the tracefs files.
There's a race between the eventfs files and this file descriptor of the
tracing system where the following can cause an issue:

With two scripts 'A' and 'B' doing:

  Script 'A':
    echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
    while :
    do
      echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
    done

  Script 'B':
    echo > /sys/kernel/tracing/synthetic_events

Script 'A' creates a synthetic event "hello" and then just writes zero
into its enable file.

Script 'B' removes all synthetic events (including the newly created
"hello" event).

What happens is that the opening of the "enable" file has:

 {
	struct trace_event_file *file = inode->i_private;
	int ret;

	ret = tracing_check_open_get_tr(file->tr);
 [..]

But deleting the events frees the "file" descriptor, and a "use after
free" happens with the dereference at "file->tr".

The file descriptor does have a reference counter, but there needs to be a
way to decrement it from the eventfs when the eventfs_inode is removed
that represents this file descriptor.

Add an optional "release" callback to the eventfs_entry array structure,
that gets called when the eventfs file is about to be removed. This allows
for the creating on the eventfs file to increment the tracing file
descriptor ref counter. When the eventfs file is deleted, it can call the
release function that will call the put function for the tracing file
descriptor.

This will protect the tracing file from being freed while a eventfs file
that references it is being opened.

Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-Tze-nan.Wu@mediatek.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240502090315.448cba46@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5790b1fb3d ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: Tze-nan wu <Tze-nan.Wu@mediatek.com>
Tested-by: Tze-nan Wu (吳澤南) <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Matthew Wilcox (Oracle)
50fabd42cb gfs2: Convert gfs2_aspace_writepage() to use a folio
Convert the incoming struct page to a folio and use it throughout.
Saves six calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-05-03 21:01:02 +02:00
Matthew Wilcox (Oracle)
b844048011 gfs2: Add a migrate_folio operation for journalled files
For journalled data, folio migration currently works by writing the folio
back, freeing the folio and faulting the new folio back in.  We can
bypass that by telling the migration code to migrate the buffer_heads
attached to our folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-05-03 21:01:02 +02:00
Eric Biggers
ee5814ddde fsverity: use register_sysctl_init() to avoid kmemleak warning
Since the fsverity sysctl registration runs as a builtin initcall, there
is no corresponding sysctl deregistration and the resulting struct
ctl_table_header is not used.  This can cause a kmemleak warning just
after the system boots up.  (A pointer to the ctl_table_header is stored
in the fsverity_sysctl_header static variable, which kmemleak should
detect; however, the compiler can optimize out that variable.)  Avoid
the kmemleak warning by using register_sysctl_init() which is intended
for use by builtin initcalls and uses kmemleak_not_leak().

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/r/CAHj4cs8DTSvR698UE040rs_pX1k-WVe7aR6N2OoXXuhXJPDC-w@mail.gmail.com
Cc: stable@vger.kernel.org
Reviewed-by: Joel Granados <j.granados@samsung.com>
Link: https://lore.kernel.org/r/20240501025331.594183-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-05-03 08:30:58 -07:00
Josef Bacik
e03418abde btrfs: make sure that WRITTEN is set on all metadata blocks
We previously would call btrfs_check_leaf() if we had the check
integrity code enabled, which meant that we could only run the extended
leaf checks if we had WRITTEN set on the header flags.

This leaves a gap in our checking, because we could end up with
corruption on disk where WRITTEN isn't set on the leaf, and then the
extended leaf checks don't get run which we rely on to validate all of
the item pointers to make sure we don't access memory outside of the
extent buffer.

However, since 732fab95ab ("btrfs: check-integrity: remove
CONFIG_BTRFS_FS_CHECK_INTEGRITY option") we no longer call
btrfs_check_leaf() from btrfs_mark_buffer_dirty(), which means we only
ever call it on blocks that are being written out, and thus have WRITTEN
set, or that are being read in, which should have WRITTEN set.

Add checks to make sure we have WRITTEN set appropriately, and then make
sure __btrfs_check_leaf() always does the item checking.  This will
protect us from file systems that have been corrupted and no longer have
WRITTEN set on some of the blocks.

This was hit on a crafted image tweaking the WRITTEN bit and reported by
KASAN as out-of-bound access in the eb accessors. The example is a dir
item at the end of an eb.

  [2.042] BTRFS warning (device loop1): bad eb member start: ptr 0x3fff start 30572544 member offset 16410 size 2
  [2.040] general protection fault, probably for non-canonical address 0xe0009d1000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI
  [2.537] KASAN: maybe wild-memory-access in range [0x0005088000000018-0x000508800000001f]
  [2.729] CPU: 0 PID: 2587 Comm: mount Not tainted 6.8.2 #1
  [2.729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  [2.621] RIP: 0010:btrfs_get_16+0x34b/0x6d0
  [2.621] RSP: 0018:ffff88810871fab8 EFLAGS: 00000206
  [2.621] RAX: 0000a11000000003 RBX: ffff888104ff8720 RCX: ffff88811b2288c0
  [2.621] RDX: dffffc0000000000 RSI: ffffffff81dd8aca RDI: ffff88810871f748
  [2.621] RBP: 000000000000401a R08: 0000000000000001 R09: ffffed10210e3ee9
  [2.621] R10: ffff88810871f74f R11: 205d323430333737 R12: 000000000000001a
  [2.621] R13: 000508800000001a R14: 1ffff110210e3f5d R15: ffffffff850011e8
  [2.621] FS:  00007f56ea275840(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000
  [2.621] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [2.621] CR2: 00007febd13b75c0 CR3: 000000010bb50000 CR4: 00000000000006f0
  [2.621] Call Trace:
  [2.621]  <TASK>
  [2.621]  ? show_regs+0x74/0x80
  [2.621]  ? die_addr+0x46/0xc0
  [2.621]  ? exc_general_protection+0x161/0x2a0
  [2.621]  ? asm_exc_general_protection+0x26/0x30
  [2.621]  ? btrfs_get_16+0x33a/0x6d0
  [2.621]  ? btrfs_get_16+0x34b/0x6d0
  [2.621]  ? btrfs_get_16+0x33a/0x6d0
  [2.621]  ? __pfx_btrfs_get_16+0x10/0x10
  [2.621]  ? __pfx_mutex_unlock+0x10/0x10
  [2.621]  btrfs_match_dir_item_name+0x101/0x1a0
  [2.621]  btrfs_lookup_dir_item+0x1f3/0x280
  [2.621]  ? __pfx_btrfs_lookup_dir_item+0x10/0x10
  [2.621]  btrfs_get_tree+0xd25/0x1910

Reported-by: lei lu <llfamsec@gmail.com>
CC: stable@vger.kernel.org # 6.7+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy more details from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-02 22:11:13 +02:00
Qu Wenruo
b5357cb268 btrfs: qgroup: do not check qgroup inherit if qgroup is disabled
[BUG]
After kernel commit 86211eea8a ("btrfs: qgroup: validate
btrfs_qgroup_inherit parameter"), user space tool snapper will fail to
create snapshot using its timeline feature.

[CAUSE]
It turns out that, if using timeline snapper would unconditionally pass
btrfs_qgroup_inherit parameter (assigning the new snapshot to qgroup 1/0)
for snapshot creation.

In that case, since qgroup is disabled there would be no qgroup 1/0, and
btrfs_qgroup_check_inherit() would return -ENOENT and fail the whole
snapshot creation.

[FIX]
Just skip the check if qgroup is not enabled.
This is to keep the older behavior for user space tools, as if the
kernel behavior changed for user space, it is a regression of kernel.

Thankfully snapper is also fixing the behavior by detecting if qgroup is
running in the first place, so the effect should not be that huge.

Link: https://github.com/openSUSE/snapper/issues/894
Fixes: 86211eea8a ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-02 21:30:14 +02:00
Jakub Kicinski
e958da0ddb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/filter.h
kernel/bpf/core.c
  66e13b615a ("bpf: verifier: prevent userspace memory access")
  d503a04f8b ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 12:06:25 -07:00
Linus Torvalds
f03359bca0 for-6.9-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYzivoACgkQxWXV+ddt
 WDu4TxAAgK+W1RSvrc2xe6MfHFMi2x2pL2qM0IEcYbmjNZJDQlmGYNj3jILho62/
 /mHyA5skMr9hN58FFUJveiBj3qOds/lZD0640sGGpysFJKzA4/Wdg5xJvpsQtyDM
 jr6BcgZOQ+j7Pqe7zsm/sc0n5yG4P+cydnlCFMNvpRfZjg1kYIV9F92qEPAHtLCx
 BoDJyHhCEqFWWyH2nALu3syTHyvGECUCBEHLFgyGcG/IXT6Oq/BpsDZPm1j72NCt
 9f58OY7/2R9QJYfCjYidFGnr3EYdI5CnCOtR2sQLcRUOISOOQSni52r5tonPdpm2
 7QRPyuXTiVxpM909phGJt5wwyssK/JQgxUjUo3s0U04+qXb3cRoJny3vAcGcnuyk
 W7lYh08QRQa3dzZ/Q+GFxqPPovdZalTHXYMAYP7QGwLuv+fZkqh39oz6LQfw7F7c
 JxEjuSCSd8lJpFyIDkirZF9lELurjgt0Zn3RNe25BLiBpeqFvTQdAYGo5wML3Ug0
 kHSmZVFC2En8Ad2AahpkGToVKGgUumo4RAZDiRGIUaHEoS7XfBbnPOAtC7Z1RKTS
 9N++XVtJ1/uYQiLM5afiZRtUTkA/jqjSNH/v3YYTS18SczKEOWlHnpJeQSWK0rD1
 rzbKZ+2MhBL5CGQnwkhUi0u07QorvMkQhWCHpf9au9rtUggg+nU=
 =zEs6
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - set correct ram_bytes when splitting ordered extent. This can be
   inconsistent on-disk but harmless as it's not used for calculations
   and it's only advisory for compression

 - fix lockdep splat when taking cleaner mutex in qgroups disable ioctl

 - fix missing mutex unlock on error path when looking up sys chunk for
   relocation

* tag 'for-6.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: set correct ram_bytes when splitting ordered extent
  btrfs: take the cleaner_mutex earlier in qgroup disable
  btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks()
2024-05-02 10:49:12 -07:00
Matthew Wilcox (Oracle)
75377ae754 gfs2: Simplify gfs2_read_super
Use submit_bio_wait() instead of hand-rolling our own synchronous
wait.  Also allocate the BIO on the stack since we're not deep in
the call stack at this point.

There's no need to kmap the page, since it isn't allocated from HIGHMEM.
Turn the GFP_NOFS allocation into GFP_KERNEL; if the page allocator
enters reclaim, we cannot be called as the filesystem has not yet been
initialised and so has no pages to reclaim.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-05-02 19:24:08 +02:00
Christophe JAILLET
e035af9f6e
seq_file: Simplify __seq_puts()
Change the implementation of the out-of-line __seq_puts() to simply be
a seq_write() call instead of duplicating the overflow/memcpy logic.

Suggested-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/7cebc1412d8d1338a7e52cc9291d00f5368c14e4.1713781332.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-02 16:28:20 +02:00
Christophe JAILLET
45751097ae
seq_file: Optimize seq_puts()
Most of seq_puts() usages are done with a string literal. In such cases,
the length of the string car be computed at compile time in order to save
a strlen() call at run-time. seq_putc() or seq_write() can then be used
instead.

This saves a few cycles.

To have an estimation of how often this optimization triggers:
   $ git grep seq_puts.*\" | wc -l
   3436

   $ git grep seq_puts.*\".\" | wc -l
   84

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/a8589bffe4830dafcb9111e22acf06603fea7132.1713781332.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christian Brauner <brauner@kernel.org>
The output for seq_putc() generation has also be checked and works.
2024-05-02 16:28:15 +02:00
Tyler Hicks (Microsoft)
0a960ba498
proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
The following commits loosened the permissions of /proc/<PID>/fdinfo/
directory, as well as the files within it, from 0500 to 0555 while also
introducing a PTRACE_MODE_READ check between the current task and
<PID>'s task:

 - commit 7bc3fa0172 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
 - commit 1927e498ae ("procfs: prevent unprivileged processes accessing fdinfo dir")

Before those changes, inode based system calls like inotify_add_watch(2)
would fail when the current task didn't have sufficient read permissions:

 [...]
 lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
 inotify_add_watch(64, "/proc/1/task/1/fdinfo",
		   IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
		   IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied)
 [...]

This matches the documented behavior in the inotify_add_watch(2) man
page:

 ERRORS
       EACCES Read access to the given file is not permitted.

After those changes, inotify_add_watch(2) started succeeding despite the
current task not having PTRACE_MODE_READ privileges on the target task:

 [...]
 lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
 inotify_add_watch(64, "/proc/1/task/1/fdinfo",
		   IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
		   IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = 1757
 openat(AT_FDCWD, "/proc/1/task/1/fdinfo",
	O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 EACCES (Permission denied)
 [...]

This change in behavior broke .NET prior to v7. See the github link
below for the v7 commit that inadvertently/quietly (?) fixed .NET after
the kernel changes mentioned above.

Return to the old behavior by moving the PTRACE_MODE_READ check out of
the file .open operation and into the inode .permission operation:

 [...]
 lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
 inotify_add_watch(64, "/proc/1/task/1/fdinfo",
		   IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
		   IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied)
 [...]

Reported-by: Kevin Parsons (Microsoft) <parsonskev@gmail.com>
Link: 89e5469ac5
Link: https://stackoverflow.com/questions/75379065/start-self-contained-net6-build-exe-as-service-on-raspbian-system-unauthorizeda
Fixes: 7bc3fa0172 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
Cc: stable@vger.kernel.org
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Hardik Garg <hargar@linux.microsoft.com>
Cc: Allen Pais <apais@linux.microsoft.com>
Signed-off-by: Tyler Hicks (Microsoft) <code@tyhicks.com>
Link: https://lore.kernel.org/r/20240501005646.745089-1-code@tyhicks.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-02 11:42:04 +02:00
David Howells
7c1ac89480 cifs: Enable large folio support
Now that cifs is using netfslib for its VM interaction, it only sees I/O in
terms of iov_iter iterators and does not see pages or folios.  This makes
large multipage folios transparent to cifs and so we can turn on multipage
folios on regular files.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:22 +01:00
David Howells
b593634424 cifs: Remove some code that's no longer used, part 3
Remove some code that was #if'd out with the netfslib conversion.  This is
split into parts for file.c as the diff generator otherwise produces a hard
to read diff for part of it where a big chunk is cut out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:22 +01:00
David Howells
2f99c0bce6 cifs: Remove some code that's no longer used, part 2
Remove some code that was #if'd out with the netfslib conversion.  This is
split into parts for file.c as the diff generator otherwise produces a hard
to read diff for part of it where a big chunk is cut out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:22 +01:00
David Howells
742b3443e2 cifs: Remove some code that's no longer used, part 1
Remove some code that was #if'd out with the netfslib conversion.  This is
split into parts for file.c as the diff generator otherwise produces a hard
to read diff for part of it where a big chunk is cut out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:21 +01:00
David Howells
3ee1a1fc39 cifs: Cut over to using netfslib
Make the cifs filesystem use netfslib to handle reading and writing on
behalf of cifs.  The changes include:

 (1) Various read_iter/write_iter type functions are turned into wrappers
     around netfslib API functions or are pointed directly at those
     functions:

	cifs_file_direct{,_nobrl}_ops switch to use
	netfs_unbuffered_read_iter and netfs_unbuffered_write_iter.

Large pieces of code that will be removed are #if'd out and will be removed
in subsequent patches.

[?] Why does cifs mark the page dirty in the destination buffer of a DIO
    read?  Should that happen automatically?  Does netfs need to do that?

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:21 +01:00
David Howells
69c3c023af cifs: Implement netfslib hooks
Provide implementation of the netfslib hooks that will be used by netfslib
to ask cifs to set up and perform operations.  Of particular note are

 (*) cifs_clamp_length() - This is used to negotiate the size of the next
     subrequest in a read request, taking into account the credit available
     and the rsize.  The credits are attached to the subrequest.

 (*) cifs_req_issue_read() - This is used to issue a subrequest that has
     been set up and clamped.

 (*) cifs_prepare_write() - This prepares to fill a subrequest by picking a
     channel, reopening the file and requesting credits so that we can set
     the maximum size of the subrequest and also sets the maximum number of
     segments if we're doing RDMA.

 (*) cifs_issue_write() - This releases any unneeded credits and issues an
     asynchronous data write for the contiguous slice of file covered by
     the subrequest.  This should possibly be folded in to all
     ->async_writev() ops and that called directly.

 (*) cifs_begin_writeback() - This gets the cached writable handle through
     which we do writeback (this does not affect writethrough, unbuffered
     or direct writes).

At this point, cifs is not wired up to actually *use* netfslib; that will
be done in a subsequent patch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:21 +01:00
David Howells
c20c0d7325 cifs: Make add_credits_and_wake_if() clear deducted credits
Make add_credits_and_wake_if() clear the amount of credits in the
cifs_credits struct after it has returned them to the overall counter.
This allows add_credits_and_wake_if() to be called multiple times during
the error handling and cleanup without accidentally returning the credits
again and again.

Note that the wake_up() in add_credits_and_wake_if() may also be
superfluous as ->add_credits() also does a wake on the request_q.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:08:21 +01:00
David Howells
edea94a697 cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs
Add mempools for the allocation of cifs_io_request and cifs_io_subrequest
structs for netfslib to use so that it can guarantee eventual allocation in
writeback.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:20 +01:00
David Howells
3758c485f6 cifs: Set zero_point in the copy_file_range() and remap_file_range()
Set zero_point in the copy_file_range() and remap_file_range()
implementations so that we don't skip reading data modified on a
server-side copy.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:20 +01:00
David Howells
1a5b4edd97 cifs: Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c
Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c so that
they are colocated with similar functions rather than being split with
cifsfs.c.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:20 +01:00
David Howells
dc5939de82 cifs: Replace the writedata replay bool with a netfs sreq flag
Replace the 'replay' bool in cifs_writedata (now cifs_io_subrequest) with a
flag in the netfs_io_subrequest flags.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:19 +01:00
David Howells
56257334e8 cifs: Make wait_mtu_credits take size_t args
Make the wait_mtu_credits functions use size_t for the size and num
arguments rather than unsigned int as netfslib uses size_t/ssize_t for
arguments and return values to allow for extra capacity.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
2024-05-01 18:08:19 +01:00
David Howells
ab58fbdeeb cifs: Use more fields from netfs_io_subrequest
Use more fields from netfs_io_subrequest instead of those incorporated into
cifs_io_subrequest from cifs_readdata and cifs_writedata.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:19 +01:00
David Howells
a975a2f22c cifs: Replace cifs_writedata with a wrapper around netfs_io_subrequest
Replace the cifs_writedata struct with the same wrapper around
netfs_io_subrequest that was used to replace cifs_readdata.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:18 +01:00
David Howells
753b67eb63 cifs: Replace cifs_readdata with a wrapper around netfs_io_subrequest
Netfslib has a facility whereby the allocation for netfs_io_subrequest can
be increased to so that filesystem-specific data can be tagged on the end.

Prepare to use this by making a struct, cifs_io_subrequest, that wraps
netfs_io_subrequest, and absorb struct cifs_readdata into it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:08:18 +01:00
David Howells
0f7c0f3f51 cifs: Use alternative invalidation to using launder_folio
Use writepages-based flushing invalidation instead of
invalidate_inode_pages2() and ->launder_folio().  This will allow
->launder_folio() to be removed eventually.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:08:18 +01:00
David Howells
1ecb146f7c netfs, afs: Use writeback retry to deal with alternate keys
Use a hook in the new writeback code's retry algorithm to rotate the keys
once all the outstanding subreqs have failed rather than doing it
separately on each subreq.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:38 +01:00
David Howells
d41ca44c20 netfs: Miscellaneous tidy ups
Do a couple of miscellaneous tidy ups:

 (1) Add a qualifier into a file banner comment.

 (2) Put the writeback folio traces back into alphabetical order.

 (3) Remove some unused folio traces.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:38 +01:00
David Howells
c245868524 netfs: Remove the old writeback code
Remove the old writeback code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:38 +01:00
David Howells
2df86547b2 netfs: Cut over to using new writeback code
Cut over to using the new writeback code.  The old code is #ifdef'd out or
otherwise removed from compilation to avoid conflicts and will be removed
in a future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:37 +01:00
David Howells
64e64e6c18 netfs, cachefiles: Implement helpers for new write code
Implement the helpers for the new write code in cachefiles.  There's now an
optional ->prepare_write() that allows the filesystem to set the parameters
for the next write, such as maximum size and maximum segment count, and an
->issue_write() that is called to initiate an (asynchronous) write
operation.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:37 +01:00
David Howells
5fb70e7275 netfs, 9p: Implement helpers for new write code
Implement the helpers for the new write code in 9p.  There's now an
optional ->prepare_write() that allows the filesystem to set the parameters
for the next write, such as maximum size and maximum segment count, and an
->issue_write() that is called to initiate an (asynchronous) write
operation.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:37 +01:00
David Howells
ed22e1dbf8 netfs, afs: Implement helpers for new write code
Implement the helpers for the new write code in afs.  There's now an
optional ->prepare_write() that allows the filesystem to set the parameters
for the next write, such as maximum size and maximum segment count, and an
->issue_write() that is called to initiate an (asynchronous) write
operation.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:36 +01:00
David Howells
4824e5917f netfs: Add some write-side stats and clean up some stat names
Add some write-side stats to count buffered writes, buffered writethrough,
and writepages calls.

Whilst we're at it, clean up the naming on some of the existing stats
counters and organise the output into two sets.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:36 +01:00
David Howells
288ace2f57 netfs: New writeback implementation
The current netfslib writeback implementation creates writeback requests of
contiguous folio data and then separately tiles subrequests over the space
twice, once for the server and once for the cache.  This creates a few
issues:

 (1) Every time there's a discontiguity or a change between writing to only
     one destination or writing to both, it must create a new request.
     This makes it harder to do vectored writes.

 (2) The folios don't have the writeback mark removed until the end of the
     request - and a request could be hundreds of megabytes.

 (3) In future, I want to support a larger cache granularity, which will
     require aggregation of some folios that contain unmodified data (which
     only need to go to the cache) and some which contain modifications
     (which need to be uploaded and stored to the cache) - but, currently,
     these are treated as discontiguous.

There's also a move to get everyone to use writeback_iter() to extract
writable folios from the pagecache.  That said, currently writeback_iter()
has some issues that make it less than ideal:

 (1) there's no way to cancel the iteration, even if you find a "temporary"
     error that means the current folio and all subsequent folios are going
     to fail;

 (2) there's no way to filter the folios being written back - something
     that will impact Ceph with it's ordered snap system;

 (3) and if you get a folio you can't immediately deal with (say you need
     to flush the preceding writes), you are left with a folio hanging in
     the locked state for the duration, when really we should unlock it and
     relock it later.

In this new implementation, I use writeback_iter() to pump folios,
progressively creating two parallel, but separate streams and cleaning up
the finished folios as the subrequests complete.  Either or both streams
can contain gaps, and the subrequests in each stream can be of variable
size, don't need to align with each other and don't need to align with the
folios.

Indeed, subrequests can cross folio boundaries, may cover several folios or
a folio may be spanned by multiple folios, e.g.:

         +---+---+-----+-----+---+----------+
Folios:  |   |   |     |     |   |          |
         +---+---+-----+-----+---+----------+

           +------+------+     +----+----+
Upload:    |      |      |.....|    |    |
           +------+------+     +----+----+

         +------+------+------+------+------+
Cache:   |      |      |      |      |      |
         +------+------+------+------+------+

The progressive subrequest construction permits the algorithm to be
preparing both the next upload to the server and the next write to the
cache whilst the previous ones are already in progress.  Throttling can be
applied to control the rate of production of subrequests - and, in any
case, we probably want to write them to the server in ascending order,
particularly if the file will be extended.

Content crypto can also be prepared at the same time as the subrequests and
run asynchronously, with the prepped requests being stalled until the
crypto catches up with them.  This might also be useful for transport
crypto, but that happens at a lower layer, so probably would be harder to
pull off.

The algorithm is split into three parts:

 (1) The issuer.  This walks through the data, packaging it up, encrypting
     it and creating subrequests.  The part of this that generates
     subrequests only deals with file positions and spans and so is usable
     for DIO/unbuffered writes as well as buffered writes.

 (2) The collector. This asynchronously collects completed subrequests,
     unlocks folios, frees crypto buffers and performs any retries.  This
     runs in a work queue so that the issuer can return to the caller for
     writeback (so that the VM can have its kswapd thread back) or async
     writes.

 (3) The retryer.  This pauses the issuer, waits for all outstanding
     subrequests to complete and then goes through the failed subrequests
     to reissue them.  This may involve reprepping them (with cifs, the
     credits must be renegotiated, and a subrequest may need splitting),
     and doing RMW for content crypto if there's a conflicting change on
     the server.

[!] Note that some of the functions are prefixed with "new_" to avoid
clashes with existing functions.  These will be renamed in a later patch
that cuts over to the new algorithm.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:36 +01:00
David Howells
7ba167c4c7 netfs: Switch to using unsigned long long rather than loff_t
Switch to using unsigned long long rather than loff_t in netfslib to avoid
problems with the sign flipping in the maths when we're dealing with the
byte at position 0x7fffffffffffffff.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:35 +01:00
David Howells
d9f85a04fb netfs: Use mempools for allocating requests and subrequests
Use mempools for allocating requests and subrequests in an effort to make
sure that allocation always succeeds so that when performing writeback we
can always make progress.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-05-01 18:07:35 +01:00
David Howells
b4ff7b178b netfs: Remove ->launder_folio() support
Remove support for ->launder_folio() from netfslib and expect filesystems
to use filemap_invalidate_inode() instead.  netfs_launder_folio() can then
be got rid of.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <sfrench@samba.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: devel@lists.orangefs.org
2024-05-01 18:07:34 +01:00
David Howells
d73065e60d afs: Use alternative invalidation to using launder_folio
Use writepages-based flushing invalidation instead of
invalidate_inode_pages2() and ->launder_folio().  This will allow
->launder_folio() to be removed eventually.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:34 +01:00
David Howells
40fb4828d5 9p: Use alternative invalidation to using launder_folio
Use writepages-based flushing invalidation instead of
invalidate_inode_pages2() and ->launder_folio().  This will allow
->launder_folio() to be removed eventually.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:34 +01:00
David Howells
74e797d79c mm: Provide a means of invalidation without using launder_folio
Implement a replacement for launder_folio.  The key feature of
invalidate_inode_pages2() is that it locks each folio individually, unmaps
it to prevent mmap'd accesses interfering and calls the ->launder_folio()
address_space op to flush it.  This has problems: firstly, each folio is
written individually as one or more small writes; secondly, adjacent folios
cannot be added so easily into the laundry; thirdly, it's yet another op to
implement.

Instead, use the invalidate lock to cause anyone wanting to add a folio to
the inode to wait, then unmap all the folios if we have mmaps, then,
conditionally, use ->writepages() to flush any dirty data back and then
discard all pages.

The invalidate lock prevents ->read_iter(), ->write_iter() and faulting
through mmap all from adding pages for the duration.

This is then used from netfslib to handle the flusing in unbuffered and
direct writes.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: devel@lists.orangefs.org
2024-05-01 18:07:06 +01:00
Qu Wenruo
63a6ce5a1a btrfs: set correct ram_bytes when splitting ordered extent
[BUG]
When running generic/287, the following file extent items can be
generated:

        item 16 key (258 EXTENT_DATA 2682880) itemoff 15305 itemsize 53
                generation 9 type 1 (regular)
                extent data disk byte 1378414592 nr 462848
                extent data offset 0 nr 462848 ram 2097152
                extent compression 0 (none)

Note that file extent item is not a compressed one, but its ram_bytes is
way larger than its disk_num_bytes.

According to btrfs on-disk scheme, ram_bytes should match disk_num_bytes
if it's not a compressed one.

[CAUSE]
Since commit b73a6fd1b1 ("btrfs: split partial dio bios before
submit"), for partial dio writes, we would split the ordered extent.

However the function btrfs_split_ordered_extent() doesn't update the
ram_bytes even it has already shrunk the disk_num_bytes.

Originally the function btrfs_split_ordered_extent() is only introduced
for zoned devices in commit d22002fd37 ("btrfs: zoned: split ordered
extent when bio is sent"), but later commit b73a6fd1b1 ("btrfs: split
partial dio bios before submit") makes non-zoned btrfs affected.

Thankfully for un-compressed file extent, we do not really utilize the
ram_bytes member, thus it won't cause any real problem.

[FIX]
Also update btrfs_ordered_extent::ram_bytes inside
btrfs_split_ordered_extent().

Fixes: d22002fd37 ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-30 12:03:44 +02:00
Linus Torvalds
a91bae8794 nfsd-6.9 fixes:
- Avoid freeing unallocated memory (v6.7 regression)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYv/dwACgkQM2qzM29m
 f5dcFQ//R5MRfOwLFqvNO2wwh750ijbR9tlUn9ce+PVQRpM1m1AhcDD9h98ZxNjn
 8rwxMPTJVd0E8GV5BG5B7vaYYQySuU6C9LAbDDFL1mWOSf2VFIY8U6YntqasQ77K
 HGXdFhHKlCLCwIGTvQgJ4Dnu6Ds2mjm83wH9IZ0SXdwHZsr5MeSiOmtpjDZn/Efm
 isx+OBjf4uYEwo02qzMzdX5mmj2eXDwincjK9dEUXYn6yyzIDEwo/wUDtzsaxuui
 TKm1Ni0uul8idukNOH+Clrp+sSDN+bR57mLhBZuToGxBbdMLaHBHOABYSX5BIDKv
 YCHYaIFrya9dtcZdTk5sIEj94hSwl63PP37RZTS/o/djMNWYNkiUmay+UtRpDkk2
 xbzx+C2hgb5OeGiI4ZDBrcIDi5bf5fIAT6yE8hfL853phloH15jlup9dnZQlB4B3
 PsdQEdJZY3uoOQSJ31/okDRFSxR98xClZxZX+N7zdXWcYtP1LXUk+vNTt44riw8C
 i3MBBWF2kxsGQ/r5J0phFRKunWCRjUQD36co4m7hJUYHqdJq2jNnSvk6O6F4R+yv
 IRQOs+qpX2TdMkNJ2jGaV8d4pt0f5bMLPeh3vl3Smf1xzfX+X0E4trtXTsw0nIrD
 m52ZV93Zag3Q4SkQ3ir5r/lfqX8sybJbVqZba85abtiqSSDIsFY=
 =GqdB
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Avoid freeing unallocated memory (v6.7 regression)

* tag 'nfsd-6.9-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix nfsd4_encode_fattr4() crasher
2024-04-29 14:22:24 -07:00
Linus Torvalds
9e4bc4bcae NFS client bugfixes for Linux 6.9
Bugfixes:
  - Fix an Oops in xs_tcp_tls_setup_socket
  - Fix an Oops due to missing error handling in nfs_net_init()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmYv7PEACgkQZwvnipYK
 APKcgA//XE7iII6qQU9IC6jiv44qO7NtB6Zy3PQFCWO2ssMSqXbc4lO2eCmR/1nA
 3Mlf1RwPxp+M+iOqFgANZV7voQ/r6djEyM1ycr+J2G/mfoxmKMVmnvg3lcyAfYNj
 3fm6n0t8ZCkb3URoO4K0ejw007QfN2zpCL2psKucBdahOX7OYHT45o02liKN8ge5
 0h7XKUjDStKsId4y5UVNB+QUeaQaWzKKMCzTzX4CxfHXZpIjbDjkdJ9WxAue2+Th
 5yvNtPKkTi9EYwHjFAgN7ZKAC7Gu+jZXs9ewdqdyaSfYlioGk7ALz2ZZSptKb5zr
 +nwuR+SC4Yzg5uBdjOLXS7/6Z/CVyp2bmgoFAzrP8cC0zfB7wJMNwUTZbUx3d823
 Q7xYwecj1F9PE5CjHJlYpFZiKkiMHB242EFmBrcUR+yoRPHRgEg32UpI1/YWYnVO
 pG+Hto0O8JnlGzkqslKy/qN7OMFgNTli+nrFnJT7TxDk27GpOY3gv271164TgeUt
 MOk7iY6QjDqk6Zpzbg5AMcq6UhB+QUe76XAAFtDvjXLLBbsRckmbNTBbkFv6/O3p
 a3bOd7oeugNIaRJJaR/lQ/EVAoBSeNUUj5G1ivoXDWsNXE+ZKNNdYyrCociVgI6S
 j31k3XtbGVMk8M/7Lu5slPOGL4xqDtJAddF0eE7buQhRZNX7pPA=
 =OtQc
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:

 - Fix an Oops in xs_tcp_tls_setup_socket

 - Fix an Oops due to missing error handling in nfs_net_init()

* tag 'nfs-for-6.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: Handle error of rpc_proc_register() in nfs_net_init().
  SUNRPC: add a missing rpc_stat for TCP TLS
2024-04-29 12:07:37 -07:00
Linus Torvalds
0a2e230514 bcachefs fixes for 6.9-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYvyT4ACgkQE6szbY3K
 bnYbOA/8Cot9WqlyXy+OTm6kVx4S2atMruZ/cj+3jQR+dZbMv375YDBuMiqtg/vW
 Cd2zjnRwoDNGxueqpjK+A66eEHm6tcVFQ7zp+Nb5FeEHAMQcxi/3PpqxkM7zjX9q
 YQ/dUiWae/oCuM/u50/Hw3xn9VkmXtahmfTjFHe6xqfbrZf0HGyNOGBgMInI8Wa6
 tCABISS2FAe/+8D5UzL0sxLbqdvfjeyFHbOfZiGhdTPCDsUDKTwZHsOnV9Z37y3W
 6Ne1sduiEPRJl6U6faTuhgXqKg6oBBcjjtWvXPAMYxLn4/zt79sY5mwTaCAAI733
 pq/WBwD0EESFCIUeZAA5vEQB2hvIOy9fdaO9tIxWrnXlbEfoZdbQNzL0Kr9//jpf
 qXz52nLp17rdJ5bsFqR+zITSckziud5p9Yqsp+zJxLda/VjteXG3efMG2giK9GJW
 yoVVgQOY/P3T9t1bgJtitu21kjCCv7vIxTjFjmIKIspUk+hpKk2W0Xv4ZSVymXP6
 AgxFGgGulMm67+V+YTaSo9AspobDYwrW5B645e5wNNpBG/igikwUgKh339lah6zt
 J27IUTTRIAof53uVT2O1AOMu6iWIw3DfMd3kvvLpuoG1032HcDEUfQForUD+6AwS
 3b3BuMeXNrZ60Oj8/tdobxN2vnQTF8+kOdma8/IJ3KYXtT3ccTA=
 =oGU4
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-04-29' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Tiny set of fixes this time"

* tag 'bcachefs-2024-04-29' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: fix integer conversion bug
  bcachefs: btree node scan now fills in sectors_written
  bcachefs: Remove accidental debug assert
2024-04-29 11:04:40 -07:00