Now users who want to get woken when waiting for events should submit a
timeout command first. It is not safe for applications that split SQ and
CQ handling between two threads, such as mysql. Users should synchronize
the two threads explicitly to protect SQ and that will impact the
performance.
This patch adds support for timeout to existing io_uring_enter(). To
avoid overloading arguments, it introduces a new parameter structure
which contains sigmask and timeout.
I have tested the workloads with one thread submiting nop requests
while the other reaping the cqe with timeout. It shows 1.8~2x faster
when the iodepth is 16.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: various cleanups/fixes, and name change to SIG_IS_DATA]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We unconditionally call blk_start_plug() when starting the IO
submission, but we only really should do that if we have more than 1
request to submit AND we're potentially dealing with block based storage
underneath. For any other type of request, it's just a waste of time to
do so.
Add a ->plug bit to io_op_def and set it for read/write requests. We
could make this more precise and check the file itself as well, but it
doesn't matter that much and would quickly become more expensive.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We've got extra 8 bytes in the 2nd cacheline, put ->fixed_file_refs
there, so inline execution path mostly doesn't touch the 3rd cacheline
for fixed_file requests as well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Singly linked list for keeping linked requests is enough, because we
almost always operate on the head and traverse forward with the
exception of linked timeouts going 1 hop backwards.
Replace ->link_list with a handmade singly linked list. Also kill
REQ_F_LINK_HEAD in favour of checking a newly added ->list for NULL
directly.
That saves 8B in io_kiocb, is not as heavy as list fixup, makes better
use of cache by not touching a previous request (i.e. last request of
the link) each time on list modification and optimises cache use further
in the following patch, and actually makes travesal easier removing in
the end some lines. Also, keeping invariant in ->list instead of having
REQ_F_LINK_HEAD is less error-prone.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for converting singly linked lists for chaining requests,
make linked timeouts save requests that they're responsible for and not
count on doubly linked list for back referencing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Explicitly save not only a link's head in io_submit_sqe[s]() but the
tail as well. That's in preparation for keeping linked requests in a
singly linked list.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't use a single struct for polls and poll remove requests, they have
totally different layouts.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same.
This behaves like do_unlinkat(), which also takes a filename struct and
puts it when it is done. Converting callers is then trivial.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that SQPOLL supports non-registered files and grabs the file table,
we can relax the restriction on open/close/accept/connect and allow
them on a ring that is setup with IORING_SETUP_SQPOLL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The restriction of needing fixed files for SQPOLL is problematic, and
prevents/inhibits several valid uses cases. With the referenced
files_struct that we have now, it's trivially supportable.
Treat ->files like we do the mm for the SQPOLL thread - grab a reference
to it (and assign it), and drop it when we're done.
This feature is exposed as IORING_FEAT_SQPOLL_NONFIXED.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since btrfs scrub is utilizing its own infrastructure to submit
read/write, scrub is independent from all other routines.
This brings one very neat feature, allow us to read 4K data into offset
0 of a 64K page. So is the writeback routine.
This makes scrub on subpage sector size much easier to implement, and
thanks to previous commits which just changed the implementation to
always do scrub based on sector size, now scrub can handle subpage
filesystem without any problem.
This patch will just remove the restriction on
(sectorsize != PAGE_SIZE), to make scrub finally work on subpage
filesystems.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs scrub is more flexible than buffered data write path, as we can
read an unaligned subpage data into page offset 0.
This ability makes subpage support much easier, we just need to check
each scrub_page::page_len and ensure we only calculate hash for [0,
page_len) of a page.
There is a small thing to notice: for subpage case, we still do sector
by sector scrub. This means we will submit a read bio for each sector
to scrub, resulting in the same amount of read bios, just like on the 4K
page systems.
This behavior can be considered as a good thing, if we want everything
to be the same as 4K page systems. But this also means, we're wasting
the possibility to submit larger bio using 64K page size. This is
another problem to consider in the future.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support subpage tree block scrub, scrub_checksum_tree_block() only
needs to learn 2 new tricks:
- Follow sector size
Now scrub_page only represents one sector, we need to follow it
properly.
- Run checksum on all sectors
Since scrub_page only represents one sector, we need to run checksum
on all sectors, not only (nodesize >> PAGE_SIZE).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For scrub_pages() and scrub_pages_for_parity(), we currently allocate
one scrub_page structure for one page.
This is fine if we only read/write one sector one time. But for cases
like scrubbing RAID56, we need to read/write the full stripe, which is
in 64K size for now.
For subpage size, we will submit the read in just one page, which is
normally a good thing, but for RAID56 case, it only expects to see one
sector, not the full stripe in its endio function.
This could lead to wrong parity checksum for RAID56 on subpage.
To make the existing code work well for subpage case, here we take a
shortcut by always allocating a full page for one sector.
This should provide the base to make RAID56 work for subpage case.
The cost is pretty obvious now, for one RAID56 stripe now we always need
16 pages. For support subpage situation (64K page size, 4K sector size),
this means we need full one megabyte to scrub just one RAID56 stripe.
And for data scrub, each 4K sector will also need one 64K page.
This is mostly just a workaround, the proper fix for this is a much
larger project, using scrub_block to replace scrub_page, and allow
scrub_block to handle multi pages, csums, and csum_bitmap to avoid
allocating one page for each sector.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs on-disk format chose to use u64 for almost everything, but there
are a other restrictions that won't let us use more than u32 for things
like extent length (the maximum length is 128MiB for non-hole extents),
or stripe length (we have device number limit).
This means if we don't have extra handling to convert u64 to u32, we
will always have some questionable operations like
"u32 = u64 >> sectorsize_bits" in the code.
This patch will try to address the problem by reducing the width for the
following members/parameters:
- scrub_parity::stripe_len
- @len of scrub_pages()
- @extent_len of scrub_remap_extent()
- @len of scrub_parity_mark_sectors_error()
- @len of scrub_parity_mark_sectors_data()
- @len of scrub_extent()
- @len of scrub_pages_for_parity()
- @len of scrub_extent_for_parity()
For members extracted from on-disk structure, like map->stripe_len, they
will be kept as is. Since that modification would require on-disk format
change.
There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
sites, extra ASSERT() is added to be extra safe for debug builds.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_lookup_bio_sums() is only called for read bios.
While btrfs_find_ordered_sum() is to search ordered extent sums, which
is only for write path.
This means to read a page we either:
- Submit read bio if it's not uptodate
This means we only need to search csum tree for checksums.
- The page is already uptodate
It can be marked uptodate for previous read, or being marked dirty.
As we always mark page uptodate for dirty page.
In that case, we don't need to submit read bio at all, thus no need
to search any checksums.
Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
And since btrfs_lookup_bio_sums() is the only caller for
btrfs_find_ordered_sum(), also remove the implementation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support sectorsize < PAGE_SIZE case, we need to take extra care of
extent buffer accessors.
Since sectorsize is smaller than PAGE_SIZE, one page can contain
multiple tree blocks, we must use eb->start to determine the real offset
to read/write for extent buffer accessors.
This patch introduces two helpers to do this:
- get_eb_page_index()
This is to calculate the index to access extent_buffer::pages.
It's just a simple wrapper around "start >> PAGE_SHIFT".
For sectorsize == PAGE_SIZE case, nothing is changed.
For sectorsize < PAGE_SIZE case, we always get index as 0, and
the existing page shift also works.
- get_eb_offset_in_page()
This is to calculate the offset to access extent_buffer::pages.
This needs to take extent_buffer::start into consideration.
For sectorsize == PAGE_SIZE case, extent_buffer::start is always
aligned to PAGE_SIZE, thus adding extent_buffer::start to
offset_in_page() won't change the result.
For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
us the correct offset to access.
This patch will touch the following parts to cover all extent buffer
accessors:
- BTRFS_SETGET_HEADER_FUNCS()
- read_extent_buffer()
- read_extent_buffer_to_user()
- memcmp_extent_buffer()
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer_full()
- copy_extent_buffer()
- memcpy_extent_buffer()
- memmove_extent_buffer()
- btrfs_get_token_##bits()
- btrfs_get_##bits()
- btrfs_set_token_##bits()
- btrfs_set_##bits()
- generic_bin_search()
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage sized extent buffer, we have ensured no extent buffer will
cross page boundary, thus we would only need one page for any extent
buffer.
Update function num_extent_pages to handle such case. Now
num_extent_pages() returns 1 for subpage sized extent buffer.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As a preparation for subpage sector size support (allowing filesystem
with sector size smaller than page size to be mounted) if the sector
size is smaller than page size, we don't allow tree block to be read if
it crosses 64K(*) boundary.
The 64K is selected because:
- we are only going to support 64K page size for subpage for now
- 64K is also the maximum supported node size
This ensures that tree blocks are always contained in one page for a
system with 64K page size, which can greatly simplify the handling.
Otherwise we would have to do complex multi-page handling of tree
blocks. Currently there is no way to create such tree blocks.
In kernel we have avoided such tree blocks allocation even on 4K page
size, as it can lead to RAID56 stripe scrubbing.
While btrfs-progs have fixed its chunk allocator since 2016 for convert,
and has extra checks to do the same behavior as the kernel.
Just add such graceful checks in case of an ancient filesystem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs only support 64K as maximum node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.
For a system using 64K page size, we would really have just one page.
While we always use 16 pages for extent_buffer::pages, this means for
systems using 64K pages, we are wasting memory for 15 page pointers
which will never be used.
Calculate the array size based on page size and the node size maximum.
- for systems using 4K page size, it will stay 16 pages
- for systems using 64K page size, it will be 1 page
Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
avoid circular inclusion of ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btree_write_cache_pages() we have a btree page submission routine
buried deeply in a nested loop.
This patch will extract that part of code into a helper function,
submit_eb_page(), to do the same work.
Since submit_eb_page() now can return >0 for successful extent
buffer submission, remove the "ASSERT(ret <= 0);" line.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_verify_data_csum() just passes the whole page to
check_data_csum(), which is fine since we only support sectorsize ==
PAGE_SIZE.
To support subpage, we need to properly honor per-sector
checksum verification, just like what we did in dio read path.
This patch will do the csum verification in a for loop, starts with
pg_off == start - page_offset(page), with sectorsize increase for
each loop.
For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
will only loop once.
For subpage case, we do the iterate over each sector and if we found any
error, we return error.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Parameter icsum for check_data_csum() is a little hard to understand.
So is the phy_offset for btrfs_verify_data_csum().
Both parameters are calculated values for csum lookup.
Instead of some calculated value, just pass bio_offset and let the
final and only user, check_data_csum(), calculate whatever it needs.
Since we are here, also make the bio_offset parameter and some related
variables to be u32 (unsigned int).
As bio size is limited by its bi_size, which is unsigned int, and has
extra size limit check during various bio operations.
Thus we are ensured that bio_offset won't overflow u32.
Thus for all involved functions, not only rename the parameter from
@phy_offset to @bio_offset, but also reduce its width to u32, so we
won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter bio_offset of extent_submit_bio_start_t is very confusing.
If it's really bio_offset (offset to bio), then it should be u32. But
in fact, it's only utilized by dio read, and that member is used as file
offset, which must be u64.
Rename it to dio_file_offset since the only user uses it as file offset,
and add comment for who is using it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A lock dependency loop exists between the root tree lock, the extent tree
lock, and the free space tree lock.
The root tree lock depends on the free space tree lock because
btrfs_create_tree holds the new tree's lock while adding it to the root
tree.
The extent tree lock depends on the root tree lock because during
umount, we write out space cache v1, which writes inodes in the root
tree, which results in holding the root tree lock while doing a lookup
in the extent tree.
Finally, the free space tree depends on the extent tree because
populate_free_space_tree holds a locked path in the extent tree and then
does a lookup in the free space tree to add the new item.
The simplest of the three to break is the one during tree creation: we
unlock the leaf before inserting the tree node into the root tree, which
fixes the lockdep warning.
[30.480136] ======================================================
[30.480830] WARNING: possible circular locking dependency detected
[30.481457] 5.9.0-rc8+ #76 Not tainted
[30.481897] ------------------------------------------------------
[30.482500] mount/520 is trying to acquire lock:
[30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.484054]
but task is already holding lock:
[30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.485581]
which lock already depends on the new lock.
[30.486397]
the existing dependency chain (in reverse order) is:
[30.487205]
-> #2 (btrfs-extent-01#2){++++}-{3:3}:
[30.487825] down_read_nested+0x43/0x150
[30.488306] __btrfs_tree_read_lock+0x39/0x180
[30.488868] __btrfs_read_lock_root_node+0x3a/0x50
[30.489477] btrfs_search_slot+0x464/0x9b0
[30.490009] check_committed_ref+0x59/0x1d0
[30.490603] btrfs_cross_ref_exist+0x65/0xb0
[30.491108] run_delalloc_nocow+0x405/0x930
[30.491651] btrfs_run_delalloc_range+0x60/0x6b0
[30.492203] writepage_delalloc+0xd4/0x150
[30.492688] __extent_writepage+0x18d/0x3a0
[30.493199] extent_write_cache_pages+0x2af/0x450
[30.493743] extent_writepages+0x34/0x70
[30.494231] do_writepages+0x31/0xd0
[30.494642] __filemap_fdatawrite_range+0xad/0xe0
[30.495194] btrfs_fdatawrite_range+0x1b/0x50
[30.495677] __btrfs_write_out_cache+0x40d/0x460
[30.496227] btrfs_write_out_cache+0x8b/0x110
[30.496716] btrfs_start_dirty_block_groups+0x211/0x4e0
[30.497317] btrfs_commit_transaction+0xc0/0xba0
[30.497861] sync_filesystem+0x71/0x90
[30.498303] btrfs_remount+0x81/0x433
[30.498767] reconfigure_super+0x9f/0x210
[30.499261] path_mount+0x9d1/0xa30
[30.499722] do_mount+0x55/0x70
[30.500158] __x64_sys_mount+0xc4/0xe0
[30.500616] do_syscall_64+0x33/0x40
[30.501091] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.501629]
-> #1 (btrfs-root-00){++++}-{3:3}:
[30.502241] down_read_nested+0x43/0x150
[30.502727] __btrfs_tree_read_lock+0x39/0x180
[30.503291] __btrfs_read_lock_root_node+0x3a/0x50
[30.503903] btrfs_search_slot+0x464/0x9b0
[30.504405] btrfs_insert_empty_items+0x60/0xa0
[30.504973] btrfs_insert_item+0x60/0xd0
[30.505412] btrfs_create_tree+0x1b6/0x210
[30.505913] btrfs_create_free_space_tree+0x54/0x110
[30.506460] btrfs_mount_rw+0x15d/0x20f
[30.506937] btrfs_remount+0x356/0x433
[30.507369] reconfigure_super+0x9f/0x210
[30.507868] path_mount+0x9d1/0xa30
[30.508264] do_mount+0x55/0x70
[30.508668] __x64_sys_mount+0xc4/0xe0
[30.509186] do_syscall_64+0x33/0x40
[30.509652] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.510271]
-> #0 (btrfs-free-space-00){++++}-{3:3}:
[30.510972] __lock_acquire+0x11ad/0x1b60
[30.511432] lock_acquire+0xa2/0x360
[30.511917] down_read_nested+0x43/0x150
[30.512383] __btrfs_tree_read_lock+0x39/0x180
[30.512947] __btrfs_read_lock_root_node+0x3a/0x50
[30.513455] btrfs_search_slot+0x464/0x9b0
[30.513947] search_free_space_info+0x45/0x90
[30.514465] __add_to_free_space_tree+0x92/0x39d
[30.515010] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.515639] btrfs_mount_rw+0x15d/0x20f
[30.516142] btrfs_remount+0x356/0x433
[30.516538] reconfigure_super+0x9f/0x210
[30.517065] path_mount+0x9d1/0xa30
[30.517438] do_mount+0x55/0x70
[30.517824] __x64_sys_mount+0xc4/0xe0
[30.518293] do_syscall_64+0x33/0x40
[30.518776] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.519335]
other info that might help us debug this:
[30.520210] Chain exists of:
btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2
[30.521407] Possible unsafe locking scenario:
[30.522037] CPU0 CPU1
[30.522456] ---- ----
[30.522941] lock(btrfs-extent-01#2);
[30.523311] lock(btrfs-root-00);
[30.523952] lock(btrfs-extent-01#2);
[30.524620] lock(btrfs-free-space-00);
[30.525068]
*** DEADLOCK ***
[30.525669] 5 locks held by mount/520:
[30.526116] #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
[30.527056] #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
[30.527960] #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
[30.529118] #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.530113] #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
[30.531124]
stack backtrace:
[30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
[30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
[30.533215] Call Trace:
[30.533452] dump_stack+0x8d/0xc0
[30.533797] check_noncircular+0x13c/0x150
[30.534233] __lock_acquire+0x11ad/0x1b60
[30.534667] lock_acquire+0xa2/0x360
[30.535063] ? __btrfs_tree_read_lock+0x39/0x180
[30.535525] down_read_nested+0x43/0x150
[30.535939] ? __btrfs_tree_read_lock+0x39/0x180
[30.536400] __btrfs_tree_read_lock+0x39/0x180
[30.536862] __btrfs_read_lock_root_node+0x3a/0x50
[30.537304] btrfs_search_slot+0x464/0x9b0
[30.537713] ? trace_hardirqs_on+0x1c/0xf0
[30.538148] search_free_space_info+0x45/0x90
[30.538572] __add_to_free_space_tree+0x92/0x39d
[30.539071] ? printk+0x48/0x4a
[30.539367] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.539972] btrfs_mount_rw+0x15d/0x20f
[30.540350] btrfs_remount+0x356/0x433
[30.540773] ? shrink_dcache_sb+0xd9/0x100
[30.541203] reconfigure_super+0x9f/0x210
[30.541642] path_mount+0x9d1/0xa30
[30.542040] do_mount+0x55/0x70
[30.542366] __x64_sys_mount+0xc4/0xe0
[30.542822] do_syscall_64+0x33/0x40
[30.543197] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.543691] RIP: 0033:0x7f109f7ab93a
[30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
[30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
[30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
[30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
[30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are not using space cache v1, we should not create the free space
object or free space inodes. This comes up when we delete the existing
free space objects/inodes when migrating to v2, only to see them get
recreated for every dirtied block group.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the filesystem transitions from space cache v1 to v2 or to
nospace_cache, it removes the old cached data, but does not remove
the FREE_SPACE items nor the free space inodes they point to. This
doesn't cause any issues besides being a bit inefficient, since these
items no longer do anything useful.
To fix it, when we are mounting, and plan to disable the space cache,
destroy each block group's free space item and free space inode.
The code to remove the items is lifted from the existing use case of
removing the block group, with a light adaptation to handle whether or
not we have already looked up the free space inode.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
If the remount is ro->ro, rw->ro, or rw->rw, we will not create or
clear the free space tree. This can be surprising, so print a warning
to dmesg to make the failure more visible. It is also important to
ensure that the space cache options (SPACE_CACHE, FREE_SPACE_TREE) are
consistent, so ensure those are set to properly match the current on
disk state (which won't be changing).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To make the contents of /proc/mounts better match the actual state of
the filesystem, base the display of the space cache mount options off
the contents of the super block rather than the last mount options
passed in. Since there are many scenarios where the mount will ignore a
space cache option, simply showing the passed in option is misleading.
For example, if we mount with -o remount,space_cache=v2 on a read-write
file system without an existing free space tree, we won't build a free
space tree, but /proc/mounts will read space_cache=v2 (until we mount
again and it goes away)
cache_generation is set iff space_cache=v1, FREE_SPACE_TREE is set iff
space_cache=v2, and if neither is the case, we print nospace_cache.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.
In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.
We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).
Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
A user might want to revert to v1 or nospace_cache on a root filesystem,
and much like turning on the free space tree, that can only be done
remounting from ro->rw. Support clearing the free space tree on such
mounts by moving it into the shared remount logic.
Since the CLEAR_CACHE option sticks around across remounts, this change
would result in clearing the tree for ever on every remount, which is
not desirable. To fix that, add CLEAR_CACHE to the oneshot options we
clear at mount end, which has the other bonus of not cluttering the
/proc/mounts output with clear_cache.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some options only apply during mount time and are cleared at the end
of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also
fits the bill, and this is a preparation patch for also clearing that
option.
One subtlety is that the current code only resets USEBACKUPROOT on rw
mounts, but the option is meaningfully "consumed" by a ro mount, so it
feels appropriate to clear in that case as well. A subsequent read-write
remount would not go through open_ctree, which is the only place that
checks the option, so the change should be benign.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a user attempts to remount a btrfs filesystem with
'mount -o remount,space_cache=v2', that operation silently succeeds.
Unfortunately, this is misleading, because the remount does not create
the free space tree. /proc/mounts will incorrectly show space_cache=v2,
but on the next mount, the file system will revert to the old
space_cache.
For now, we handle only the easier case, where the existing mount is
read-only and the new mount is read-write. In that case, we can create
the free space tree without contending with the block groups changing
as we go.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we attempt to create a free space tree while any block groups have
needs_free_space set, we will double add the new free space item
and hit EEXIST. Previously, we only created the free space tree on a new
mount, so we never hit the case, but if we try to create it on a
remount, such block groups could exist and trip us up.
We don't do anything with this field unless the free space tree is
enabled, so there is no harm in not setting it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When we mount a rw filesystem, we start the orphan cleanup process in
tree root and filesystem tree. However, when we remount a ro file system
rw, we only clean the former. Move the calls to btrfs_orphan_cleanup()
on tree_root and fs_root to the shared rw mount routine to effectively
add them on ro->rw remount.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.
This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).
Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:
1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers
They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash.
So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
The following script that uses dbench was used to measure the impact of
the whole patchset:
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).
Before patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11277211 0.250 85.340
Close 8283172 0.002 6.479
Rename 477515 1.935 86.026
Unlink 2277936 0.770 87.071
Deltree 256 15.732 81.379
Mkdir 128 0.003 0.009
Qpathinfo 10221180 0.056 44.404
Qfileinfo 1789967 0.002 4.066
Qfsinfo 1874399 0.003 9.176
Sfileinfo 918589 0.061 10.247
Find 3951758 0.341 54.040
WriteX 5616547 0.047 85.079
ReadX 17676028 0.005 9.704
LockX 36704 0.003 1.800
UnlockX 36704 0.002 0.687
Flush 790541 14.115 676.236
Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms
After patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 12687926 0.171 86.526
Close 9320780 0.002 8.063
Rename 537253 1.444 78.576
Unlink 2561827 0.559 87.228
Deltree 374 11.499 73.549
Mkdir 187 0.003 0.005
Qpathinfo 11500300 0.061 36.801
Qfileinfo 2017118 0.002 7.189
Qfsinfo 2108641 0.003 4.825
Sfileinfo 1033574 0.008 8.065
Find 4446553 0.408 47.835
WriteX 6335667 0.045 84.388
ReadX 19887312 0.003 9.215
LockX 41312 0.003 1.394
UnlockX 41312 0.002 1.425
Flush 889233 13.014 623.259
Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms
+12.7% throughput, -8.2% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode we may often have to fallback to a full transaction
commit, either because a new block group was allocated, there is some case
we can not deal with without a transaction commit or some error like an
ENOMEM happened. However after we fallback to a transaction commit, we
have a time window where we can make the next attempt to log any inode
commit the next transaction unnecessarily, adding additional overhead and
increasing latency.
A sequence of steps that leads to this issue is the following:
1) The current open transaction has a generation of 1000;
2) A new block group is allocated, and as a consequence we must make sure
any attempts to commit a log fallback to a transaction commit, so
btrfs_set_log_full_commit() is called from btrfs_make_block_group().
This sets fs_info->last_trans_log_full_commit to 1000;
3) Task A is holding a handle on transaction 1000 and tries to log inode X.
Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
which returns true, since fs_info->last_trans_log_full_commit has a
value of 1000. So we end up returning EAGAIN and propagating it up to
btrfs_sync_file(), where we commit transaction 1000;
4) The transaction commit task (task A) sets the transaction state to
unblocked (TRANS_STATE_UNBLOCKED);
5) Some other task, task B, starts a new transaction with a generation of
1001;
6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;
7) Transaction 1000 has not fully committed yet, we are still writing all
the extent buffers it created;
8) Some new task, task C, starts an fsync of inode Y, gets a handle for
transaction 1001, and it gets to btrfs_log_inode_parent() which does
the following check:
if (fs_info->last_trans_log_full_commit > last_committed) {
ret = 1;
goto end_no_trans;
}
At that point last_trans_log_full_commit has a value of 1000 and
last_committed (value of fs_info->last_trans_committed) has a value of
999, since transaction 1000 has not yet committed - it is either still
writing out dirty extent buffers, its super blocks or unpinning
extents.
As a consequence we return 1, which gets propagated up to
btrfs_sync_file(), which will then call btrfs_commit_transaction()
for transaction 1001.
As a consequence we have an unnecessary second transaction commit, we
previously committed transaction 1000 and now commit transaction 1001
as well, resulting in more overhead and increased latency.
So fix this double transaction commit issue simply by removing that check,
because all we need to do is wait for the previous transaction to finish
its commit, which we already do later when starting the log transaction at
start_log_trans(), because there we acquire the tree_log_mutex lock, which
is held by a transaction commit and only released after the transaction
commits its super blocks.
Another issue that check has is that it reads last_trans_log_full_commit
without using READ_ONCE(), which is incorrect since that member of
struct btrfs_fs_info is always updated with WRITE_ONCE() through the
helper btrfs_set_log_full_commit().
This double transaction commit issue can actually be triggered quite often
in long runs of dbench, since besides the creation of new block groups
that force inode logging to fallback to a transaction commit, there are
cases where dbench asks to fsync a directory which had files in it that
were previously renamed or subdirectories that were removed, resulting in
the inode logging to fallback to a full transaction commit.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode and the previous transaction is still committing, we
have a time window where we can end up incorrectly think an inode has its
last_unlink_trans field with a value greater than the last transaction
committed, which results in the logging to fallback to a full transaction
commit, which is usually much more expensive than doing a log commit.
The race is described by the following steps:
1) We are at transaction 1000;
2) We modify an inode X (a directory) using transaction 1000 and set its
last_unlink_trans field to 1000, because for example we removed one
of its subdirectories;
3) We create a new inode Y with a dentry in inode X using transaction 1000,
so its generation field is set to 1000;
4) The commit for transaction 1000 is started by task A;
5) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
6) Some task starts a new transaction with a generation of 1001;
7) We do some modification to inode Y (using transaction 1001);
8) The transaction 1000 commit starts unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
9) Task B starts an fsync on inode Y, and gets a handle for transaction
1001. When it gets to check_parent_dirs_for_sync() it does the checking
of the ancestor dentries because the following check does not evaluate
to true:
if (S_ISREG(inode->vfs_inode.i_mode) &&
inode->generation <= last_committed &&
inode->last_unlink_trans <= last_committed)
goto out;
The generation value for inode Y is 1000 and last_committed, which has
the value read from fs_info->last_trans_committed, has a value of 999,
so that check evaluates to false and we proceed to check the ancestor
inodes.
Once we get to the first ancestor, inode X, we call
btrfs_must_commit_transaction() on it, which evaluates to true:
static bool btrfs_must_commit_transaction(...)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
bool ret = false;
mutex_lock(&inode->log_mutex);
if (inode->last_unlink_trans > fs_info->last_trans_committed) {
/*
* Make sure any commits to the log are forced to be full
* commits.
*/
btrfs_set_log_full_commit(trans);
ret = true;
}
(...)
because inode's X last_unlink_trans has a value of 1000 and
fs_info->last_trans_committed still has a value of 999, it returns
true to check_parent_dirs_for_sync(), making it return 1 which is
propagated up to btrfs_sync_file(), causing it to fallback to a full
transaction commit of transaction 1001.
We should have not fallen back to commit transaction 1001, since inode
X had last_unlink_trans set to 1000 and the super blocks for
transaction 1000 were already written. So while not resulting in a
functional problem, it leads to a lot more work and higher latencies
for a fsync since committing a transaction is usually more expensive
than committing a log (if other filesystem changes happened under that
transaction).
Similar problem happens when logging directories, for the same reason as
btrfs_must_commit_transaction() returns true on an inode with its
last_unlink_trans having the generation of the previous transaction and
that transaction is still committing, unpinning its freed extents.
So fix this by comparing last_unlink_trans with the id of the current
transaction instead of fs_info->last_trans_committed.
This case is often hit when running dbench for a long enough duration, as
it does lots of rename and rmdir operations (both update the field
last_unlink_trans of an inode) and fsyncs of files and directories.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode and we are checking if we need to log ancestors that
are new, if the previous transaction is still committing we have a time
window where we can unnecessarily log ancestor inodes that were created in
the previous transaction.
The race is described by the following steps:
1) We are at transaction 1000;
2) Directory inode X is created, its generation is set to 1000;
3) The commit for transaction 1000 is started by task A;
4) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
5) Inode Y, a regular file, is created under directory inode X, this
results in starting a new transaction with a generation of 1001;
6) The transaction 1000 commit is unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
7) Task B calls fsync on inode Y and gets a handle for transaction 1001;
8) Task B ends up at log_all_new_ancestors() and then because inode Y has
only one hard link, ends up at log_new_ancestors_fast(). There it reads
a value of 999 from fs_info->last_trans_committed, and sees that the
parent inode X has a generation of 1000, so we end up logging inode X:
if (inode->generation > fs_info->last_trans_committed) {
ret = btrfs_log_inode(trans, root, inode,
LOG_INODE_EXISTS, ctx);
(...)
which is not necessary since it was created in the past transaction,
with a generation of 1000, and that transaction has already committed
its super blocks - it's still unpinning extents so it has not yet
updated fs_info->last_trans_committed from 999 to 1000.
So this just causes us to spend more time logging and allocating and
writing more tree blocks for the log tree.
So fix this by comparing an inode's generation with the generation of the
transaction our transaction handle refers to - if the inode's generation
matches the generation of the current transaction than we know it is a
new inode we need to log, otherwise don't log it.
This case is often hit when running dbench for a long enough duration.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging the extents of an inode during a fast fsync, we have a time
window where we can log extents that are from the previous transaction and
already persisted. This only makes us waste time unnecessarily.
The following sequence of steps shows how this can happen:
1) We are at transaction 1000;
2) An ordered extent E from inode I completes, that is it has gone through
btrfs_finish_ordered_io(), and it set the extent maps' generation to
1000 when we unpin the extent, which is the generation of the current
transaction;
3) The commit for transaction 1000 starts by task A;
4) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
5) Some change is made to inode I, resulting in creation of a new
transaction with a generation of 1001;
6) The transaction 1000 commit starts unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
7) Task B starts an fsync on inode I, and when it gets to
btrfs_log_changed_extents() sees the extent map for extent E in the
list of modified extents. It sees the extent map has a generation of
1000 and fs_info->last_trans_committed has a value of 999, so it
proceeds to logging the respective file extent item and all the
checksums covering its range.
So we end up wasting time since the extent was already persisted and
is reachable through the trees pointed to by the super block committed
by transaction 1000.
So just fix this by comparing the extent maps generation against the
generation of the transaction handle - if it is smaller then the id in the
handle, we know the extent was already persisted and we do not need to log
it.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are doing a rename or a link operation for an inode that was logged
in the previous transaction and that transaction is still committing, we
have a time window where we incorrectly consider that the inode was logged
previously in the current transaction and therefore decide to log it to
update it in the log. The following steps give an example on how this
happens during a link operation:
1) Inode X is logged in transaction 1000, so its logged_trans field is set
to 1000;
2) Task A starts to commit transaction 1000;
3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED;
4) Task B starts a link operation for inode X, and as a consequence it
starts transaction 1001;
5) Task A is still committing transaction 1000, therefore the value stored
at fs_info->last_trans_committed is still 999;
6) Task B calls btrfs_log_new_name(), it reads a value of 999 from
fs_info->last_trans_committed and because the logged_trans field of
inode X has a value of 1000, the function does not return immediately,
instead it proceeds to logging the inode, which should not happen
because the inode was logged in the previous transaction (1000) and
not in the current one (1001).
This is not a functional problem, just wasted time and space logging an
inode that does not need to be logged, contributing to higher latency
for link and rename operations.
So fix this by comparing the inodes' logged_trans field with the
generation of the current transaction instead of comparing with the value
stored in fs_info->last_trans_committed.
This case is often hit when running dbench for a long enough duration, as
it does lots of rename operations.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After removing the inode number cache that was using the free space
cache code, we can remove at least the recalc_thresholds callback from
the ops. Both code and tests use the same callback function. It's moved
before its first use.
The use_bitmaps callback is still needed by tests to create some
extents/bitmap setup.
Signed-off-by: David Sterba <dsterba@suse.com>
Since it's being used solely for the freespace cache unconditionally
set the flags required for it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Following removal of the ino cache io_ctl_init will be called only on
behalf of the freespace inode. In this case we always want to check
CRCs so conditional code that depended on io_ctl::check_crc can be
removed.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's been deprecated since commit b547a88ea5 ("btrfs: start
deprecation of mount option inode_cache") which enumerates the reasons.
A filesystem that uses the feature (mount -o inode_cache) tracks the
inode numbers in bitmaps, that data stay on the filesystem after this
patch. The size is roughly 5MiB for 1M inodes [1], which is considered
small enough to be left there. Removal of the change can be implemented
in btrfs-progs if needed.
[1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The former is going away as part of the inode map removal so switch
callers to btrfs_find_free_objectid. No functional changes since with
INODE_MAP disabled (default) find_free_objectid was called anyway.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those functions are going to be used even after inode cache is removed
so moved them to a more appropriate place.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 72deb455b5 ("block: remove CONFIG_LBDAF") (5.2) the
sector_t type is u64 on all arches and configs so we don't need to
typecast it. It used to be unsigned long and the result of sector size
shifts were not guaranteed to fit in the type.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Placing both data and metadata in a block group is impossible in ZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do that, because
the logical addresses are recorded in other metadata buffers to build up
the trees. As a result, a data buffer can be placed after a metadata
buffer, which is not written yet. Writing out the data buffer will break
the sequential write rule.
Check and disallow MIXED_BG with ZONED mode.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write
constraint of host-managed zoned block devices to the application, which
would break the POSIX semantic for the fallocated file. To avoid this,
report fallocate() as not supported when in ZONED mode for now.
In the future, we may be able to implement "in-memory" fallocate() in
ZONED mode by utilizing space_info->bytes_may_use or similar, so this
returns EOPNOTSUPP.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
NODATACOW implies overwriting the file data on a device, which is
impossible in sequential required zones. Disable NODATACOW globally with
mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As updates to the space cache v1 are in-place, the space cache cannot be
located over sequential zones and there is no guarantees that the device
will have enough conventional zones to store this cache. Resolve this
problem by disabling completely the space cache v1. This does not
introduce any problems with sequential block groups: all the free space
is located after the allocation pointer and no free space before the
pointer. There is no need to have such cache.
Note: we can technically use free-space-tree (space cache v2) on ZONED
mode. But, since ZONED mode now always allocates extents in a block
group sequentially regardless of underlying device zone type, it's no
use to enable and maintain the tree.
For the same reason, NODATACOW is also disabled.
In summary, ZONED will disable:
| Disabled features | Reason |
|-------------------+-----------------------------------------------------|
| RAID/DUP | Cannot handle two zone append writes to different |
| | zones |
|-------------------+-----------------------------------------------------|
| space_cache (v1) | In-place updating |
| NODATACOW | In-place updating |
|-------------------+-----------------------------------------------------|
| fallocate | Reserved extent will be a write hole |
|-------------------+-----------------------------------------------------|
| MIXED_BG | Allocated metadata region will be write holes for |
| | data writes |
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.
Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
enabled on the file system and if the file system consists of zoned
devices with equal zone size.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a zoned block device is found, get its zone information (number of
zones and zone size). To avoid costly run-time zone report
commands to test the device zones type during block allocation, attach
the seq_zones bitmap to the device structure to indicate if a zone is
sequential or accept random writes. Also it attaches the empty_zones
bitmap to indicate if a zone is empty or not.
This patch also introduces the helper function btrfs_dev_is_sequential()
to test if the zone storing a block is a sequential write required zone
and btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In both kernfs_node_from_dentry() and in
kernfs_dentry_node(), we will check the dentry->inode
is NULL or not, which is superfluous.
So remove the check in kernfs_node_from_dentry().
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hui Su <sh_def@163.com>
Link: https://lore.kernel.org/r/20201113132143.GA119541@rlk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We don't yet support dax on reflinked files, but that is in the works.
Further, having the flag set does not automatically mean that the inode
is actually "in the CPU direct access state," which depends on several
other conditions in addition to the flag being set.
As such, we should not catch this as corruption in the verifier - simply
not actually enabling S_DAX on reflinked files is enough for now.
Fixes: 4f435ebe7d ("xfs: don't mix reflink and DAX mode for now")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix the scrubber too]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In commit 27c14b5daa we started tracking the last inode seen during an
inode walk to avoid infinite loops if a corrupt inobt record happens to
have a lower ir_startino than the record preceeding it. Unfortunately,
the assertion trips over the case where there are completely empty inobt
records (which can happen quite easily on 64k page filesystems) because
we advance the tracking cursor without actually putting the empty record
into the processing buffer. Fix the assert to allow for this case.
Reported-by: zlang@redhat.com
Fixes: 27c14b5daa ("xfs: ensure inobt record walks always make forward progress")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Since *init_cursor() can always return a valid cursor, the NULL check
in caller is unneeded. So clean them up.
This also keeps the behavior consistent with other callers.
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Introduce a common helper to consolidate stripe validation process.
Also make kernel code xfs_validate_sb_common() use it first.
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The quota option 'usrquota' should be shown if both the XFS_UQUOTA_ACCT
and XFS_UQUOTA_ENFD flags are set. The option 'uqnoenforce' should be
shown when only the XFS_UQUOTA_ACCT flag is set. The current code logic
seems wrong, Fix it and show proper options.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are no callers of the XFS_B_FSB_OFFSET macro, so remove it.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The function posix_acl_release() test the passed-in argument and
move on only when it is non-null, so maybe the null check in
xfs_generic_create is unnecessary.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_trans_mod_dquot() function will allocate new tp->t_dqinfo if
it is NULL and make the changes in the tp->t_dqinfo->dqs[XFS_QM_TRANS
_{USR,GRP,PRJ}]. Nowadays seems none of the callers want to join the
dquots to the transaction and push them to device when the delta is
zero. Actually, most of time the caller would check the delta and go
on only when the delta value is not zero, so we should bail out when
it is zero.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Nowadays the only things that the XFS_TRANS_DQ_DIRTY flag seems to do
are indicates the tp->t_dqinfo->dqs[XFS_QM_TRANS_{USR,GRP,PRJ}] values
changed and check in xfs_trans_apply_dquot_deltas() and the unreserve
variant xfs_trans_unreserve_and_mod_dquots(). Actually, we also can
use the tp->t_dqinfo value instead of the XFS_TRANS_DQ_DIRTY flag, that
is to say, we allocate the new tp->t_dqinfo only when the qtrx values
changed, so the tp->t_dqinfo value isn't NULL equals the XFS_TRANS_DQ_DIRTY
flag is set, we only need to check if tp->t_dqinfo == NULL in
xfs_trans_apply_dquot_deltas() and its unreserve variant to determine
whether lock all of the dquots and join them to the transaction.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The function xfs_trans_mod_dquot_byino() wraps around
xfs_trans_mod_dquot() to account for quotas, and also there is the
function call chain xfs_trans_reserve_quota_bydquots -> xfs_trans_dqresv
-> xfs_trans_mod_dquot, both of them do the duplicated null check and
allocation. Thus we can delete the duplicated operation from them.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Get rid of this one-off namespace since we're done converting things to
fscontext now.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor all the open-coded validation of file block ranges into a
single helper, and teach the bmap scrubber to check the ranges.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Refactor all the open-coded validation of realtime device extents into a
single helper.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor all the open-coded validation of non-static data device extents
into a single helper.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
It's possible that xfs_iget can return EINVAL for inodes that the inobt
thinks are free, or ENOENT for inodes that look free. If this is the
case, mark the directory corrupt immediately when we check ftype. Note
that we already check the ftype of the '.' and '..' entries, so we
can skip the iget part since we already know the inode type for '.' and
we have a separate parent pointer scrubber for '..'.
Fixes: a5c46e5e89 ("xfs: scrub directory metadata")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_iget can return -ENOENT for a file that the inobt thinks is
allocated but has zeroed mode. This currently causes scrub to exit
with an operational error instead of flagging this as a corruption. The
end result is that scrub mistakenly reports the ENOENT to the user
instead of "directory parent pointer corrupt" like we do for EINVAL.
Fixes: 5927268f5a ("xfs: flag inode corruption if parent ptr doesn't get us a real inode")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Detect file block mappings with a blockcount that's either so large that
integer overflows occur or are zero, because neither are valid in the
filesystem. Worse yet, attempting directory modifications causes the
iext code to trip over the bmbt key handling and takes the filesystem
down. We can fix most of this by preventing the bad metadata from
entering the incore structures in the first place.
Found by setting blockcount=0 in a directory data fork mapping and
watching the fireworks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add a trace point so that we can capture when a recovered log intent
item fails to recover.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The rmap, and refcount log intent items were added to support the rmap
and reflink features. Because these features come with changes to the
ondisk format, the log items aren't tied to a log incompat flag.
However, the log recovery routines don't actually check for those
feature flags. The kernel has no business replayng an intent item for a
feature that isn't enabled, so check that as part of recovered log item
validation. (Note that kernels pre-dating rmap and reflink already fail
log recovery on the unknown log item type code.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The code that validates recovered extent-free intent items is kind of a
mess -- it doesn't use the standard xfs type validators, and it doesn't
check for things that it should. Fix the validator function to use the
standard validation helpers and look for more types of obvious errors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we recover a extent-free intent from the log, we need to validate
its contents before we try to replay them. Hoist the checking code into
a separate function in preparation to refactor this code to use
validation helpers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The code that validates recovered refcount intent items is kind of a
mess -- it doesn't use the standard xfs type validators, and it doesn't
check for things that it should. Fix the validator function to use the
standard validation helpers and look for more types of obvious errors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we recover a refcount intent from the log, we need to validate its
contents before we try to replay them. Hoist the checking code into a
separate function in preparation to refactor this code to use validation
helpers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The code that validates recovered rmap intent items is kind of a mess --
it doesn't use the standard xfs type validators, and it doesn't check
for things that it should. Fix the validator function to use the
standard validation helpers and look for more types of obvious errors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we recover a rmap intent from the log, we need to validate its
contents before we try to replay them. Hoist the checking code into a
separate function in preparation to refactor this code to use validation
helpers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The code that validates recovered bmap intent items is kind of a mess --
it doesn't use the standard xfs type validators, and it doesn't check
for things that it should. Fix the validator function to use the
standard validation helpers and look for more types of obvious errors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we recover a bmap intent from the log, we need to validate its
contents before we try to replay them. Hoist the checking code into a
separate function in preparation to refactor this code to use validation
helpers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Make it so that libxfs recognizes the needsrepair feature. Note that
the kernel will still refuse to mount these.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Define an incompat feature flag to indicate that the filesystem needs to
be repaired. While libxfs will recognize this feature, the kernel will
refuse to mount if the feature flag is set, and only xfs_repair will be
able to clear the flag. The goal here is to force the admin to run
xfs_repair to completion after upgrading the filesystem, or if we
otherwise detect anomalies.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
For the case of NFSv4, specify to the client that the pre/post-op
attributes were not recorded atomically with the main operation.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Don't set PF_LOCAL_THROTTLE on remote filesystems like NFS, since they
aren't expected to ever be subject to double buffering.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the underlying filesystem times out, then we want knfsd to return
NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In order to allow nfsd to accept return values that are not
acceptable to overlayfs and others, add a new function.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It's not uncommon for some workloads to do a bunch of I/O to a file and
delete it just afterward. If knfsd has a cached open file however, then
the file may still be open when the dentry is unlinked. If the
underlying filesystem is nfs, then that could trigger it to do a
sillyrename.
On a REMOVE or RENAME scan the nfsd_file cache for open files that
correspond to the inode, and proactively unhash and put their
references. This should prevent any delete-on-last-close activity from
occurring, solely due to knfsd's open file cache.
This must be done synchronously though so we use the variants that call
flush_delayed_fput. There are deadlock possibilities if you call
flush_delayed_fput while holding locks, however. In the case of
nfsd_rename, we don't even do the lookups of the dentries to be renamed
until we've locked for rename.
Once we've figured out what the target dentry is for a rename, check to
see whether there are cached open files associated with it. If there
are, then unwind all of the locking, close them all, and then reattempt
the rename.
None of this is really necessary for "typical" filesystems though. It's
mostly of use for NFS, so declare a new export op flag and use that to
determine whether to close the files beforehand.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When we start allowing NFS to be reexported, then we have some problems
when it comes to subtree checking. In principle, we could allow it, but
it would mean encoding parent info in the filehandles and there may not
be enough space for that in a NFSv3 filehandle.
To enforce this at export upcall time, we add a new export_ops flag
that declares the filesystem ineligible for subtree checking.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
With NFSv3 nfsd will always attempt to send along WCC data to the
client. This generally involves saving off the in-core inode information
prior to doing the operation on the given filehandle, and then issuing a
vfs_getattr to it after the op.
Some filesystems (particularly clustered or networked ones) have an
expensive ->getattr inode operation. Atomicity is also often difficult
or impossible to guarantee on such filesystems. For those, we're best
off not trying to provide WCC information to the client at all, and to
simply allow it to poll for that information as needed with a GETATTR
RPC.
This patch adds a new flags field to struct export_operations, and
defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
that nfsd should not attempt to provide WCC info in NFSv3 replies. It
also adds a blurb about the new flags field and flag to the exporting
documentation.
The server will also now skip collecting this information for NFSv2 as
well, since that info is never used there anyway.
Note that this patch does not add this flag to any filesystem
export_operations structures. This was originally developed to allow
reexporting nfs via nfsd.
Other filesystems may want to consider enabling this flag too. It's hard
to tell however which ones have export operations to enable export via
knfsd and which ones mostly rely on them for open-by-filehandle support,
so I'm leaving that up to the individual maintainers to decide. I am
cc'ing the relevant lists for those filesystems that I think may want to
consider adding this though.
Cc: HPDD-discuss@lists.01.org
Cc: ceph-devel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: fuse-devel@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This reverts commit a85857633b.
We're still factoring ctime into our change attribute even in the
IS_I_VERSION case. If someone sets the system time backwards, a client
could see the change attribute go backwards. Maybe we can just say
"well, don't do that", but there's some question whether that's good
enough, or whether we need a better guarantee.
Also, the client still isn't actually using the attribute.
While we're still figuring this out, let's just stop returning this
attribute.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
inode_query_iversion() has side effects, and there's no point calling it
when we're not even going to use it.
We check whether we're currently processing a v4 request by checking
fh_maxsize, which is arguably a little hacky; we could add a flag to
svc_fh instead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Minor cleanup, no change in behavior.
Also pull out a common helper that'll be useful elsewhere.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It doesn't make sense to carry all these extra fields around. Just
make everything into change attribute from the start.
This is just cleanup, there should be no change in behavior.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
inode_query_iversion() can modify i_version. Depending on the exported
filesystem, that may not be safe. For example, if you're re-exporting
NFS, NFS stores the server's change attribute in i_version and does not
expect it to be modified locally. This has been observed causing
unnecessary cache invalidations.
The way a filesystem indicates that it's OK to call
inode_query_iverson() is by setting SB_I_VERSION.
So, move the I_VERSION check out of encode_change(), where it's used
only in GETATTR responses, to nfsd4_change_attribute(), which is
also called for pre- and post- operation attributes.
(Note we could also pull the NFSEXP_V4ROOT case into
nfsd4_change_attribute() as well. That would actually be a no-op,
since pre/post attrs are only used for metadata-modifying operations,
and V4ROOT exports are read-only. But we might make the change in
the future just for simplicity.)
Reported-by: Daire Byrne <daire@dneg.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since commit b4868b44c5 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.
Fix by modifying nfs4_init_cp_state to return the stateid with seqid 1
instead of 0. This is also to conform with section 4.8 of RFC 7862.
Here is the relevant paragraph from section 4.8 of RFC 7862:
A copy offload stateid's seqid MUST NOT be zero. In the context of a
copy offload operation, it is inappropriate to indicate "the most
recent copy offload operation" using a stateid with a seqid of zero
(see Section 8.2.2 of [RFC5661]). It is inappropriate because the
stateid refers to internal state in the server and there may be
several asynchronous COPY operations being performed in parallel on
the same file by the server. Therefore, a copy offload stateid with
a seqid of zero MUST be considered invalid.
Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
linux/fs/nfsd/nfs4proc.c:1542:24: warning: incorrect type in assignment (different base types)
linux/fs/nfsd/nfs4proc.c:1542:24: expected restricted __be32 [assigned] [usertype] status
linux/fs/nfsd/nfs4proc.c:1542:24: got int
Clean-up: The dup_copy_fields() function returns only zero, so make
it return void for now, and get rid of the return code check.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The warning message from nfsd terminating normally
can confuse system adminstrators or monitoring software.
Though it's not exactly fair to pin-point a commit where it
originated, the current form in the current place started
to appear in:
Fixes: e096bbc648 ("knfsd: remove special handling for SIGHUP")
Signed-off-by: kazuo ito <kzpn200@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Try to forcely switch to inplace I/O under low memory scenario in
order to avoid direct memory reclaim due to cached page allocation.
Link: https://lore.kernel.org/r/20201209123717.12430-1-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
A couple of the superblock validation checks apply only to the kernel,
so move them to xfs_fc_fill_super before we add the needsrepair "feature",
which will prevent the kernel (but not xfsprogs) from mounting the
filesystem. This also reduces the diff between kernel and userspace
libxfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
There's a memory leak in afs_parse_source() whereby multiple source=
parameters overwrite fc->source in the fs_context struct without freeing
the previously recorded source.
Fix this by only permitting a single source parameter and rejecting with
an error all subsequent ones.
This was caught by syzbot with the kernel memory leak detector, showing
something like the following trace:
unreferenced object 0xffff888114375440 (size 32):
comm "repro", pid 5168, jiffies 4294923723 (age 569.948s)
backtrace:
slab_post_alloc_hook+0x42/0x79
__kmalloc_track_caller+0x125/0x16a
kmemdup_nul+0x24/0x3c
vfs_parse_fs_string+0x5a/0xa1
generic_parse_monolithic+0x9d/0xc5
do_new_mount+0x10d/0x15a
do_mount+0x5f/0x8e
__do_sys_mount+0xff/0x127
do_syscall_64+0x2d/0x3a
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 13fcc68370 ("afs: Add fs_context support")
Reported-by: syzbot+86dc6632faaca40133ab@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During recovery, we may missed to update inline xattr count correctly,
fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Miss to stat inline inode in f2fs_recover_inline_data.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In 3rd scene, it should remove data blocks instead of inline_data.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Many flash devices read and write a single IO based on a multiple
of 4KB, and we support only 4KB page cache size now.
Since we already check page size in init_f2fs_fs(), so remove page
size check in sanity_check_raw_super().
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Shaohua Liu <liush@allwinnertech.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use F2FS_ROOT_INO, F2FS_NODE_INO and F2FS_META_INO macro
for better code readability.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Shaohua Liu <liush@allwinnertech.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull seq_file fix from Al Viro:
"This fixes a regression introduced in this cycle wrt iov_iter based
variant for reading a seq_file"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix return values of seq_read_iter()
This patch introduces the ZONED incompat flag. The flag indicates that
the volume management will satisfy the constraints imposed by
host-managed zoned block devices (aligned chunk allocation, append-only
updates, reset zone after filled).
As the zoned support will happen incrementally due to enhancing some
core infrastructure like super block writes, tree-log, raid support, the
feature will appear in sysfs only on debug builds. It will be enabled
once the support is feature complete and applications can reliably check
whether zoned support is present or not.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It simply gets assigned to 'ret' in case of errors. The flow of the
while loop is not changed by this commit since the few call sites
that 'goto next' will simply break from the loop.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In most cases when an error is returned from a function 'ret' is simply
assigned to 'err'. There is only one case where walk_up_reloc_tree can
return a positive value - in this case the code breaks from the loop and
ret is going to get its return value from btrfs_cow_block - either 0 or
negative. This retains the old logic of how 'err' used to be set at
this call site.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use only a single 'ret' to control whether we should abort the
transaction or not. That's fine, because if we abort a transaction then
btrfs_end_transaction will return the same value as passed to
btrfs_abort_transaction. No semantic changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are attempting to start writeback for an existing extent in NOCOW
mode, at run_delalloc_nocow(), we must check if the extent is shared, and
if it is, fallback to a COW write. However we do such check while still
holding a read lock on the leaf that contains the file extent item, and
that check, the call to btrfs_cross_ref_exist(), can take some time
because:
1) It needs to do a search on the extent tree, which obviously takes some
time, specially if delayed references are being run at the moment, as
we can block when trying to lock currently write locked btree nodes;
2) It needs to check the delayed references for any existing reference
for our data extent, this requires acquiring the delayed references'
spinlock and maybe block on the mutex of a delayed reference head in the
case where there is a delayed reference for our data extent, in the
worst case it makes us release the path on the extent tree and retry
the whole process again (going back to step 1).
There are other operations we do while holding the leaf locked that can
take some significant time as well (specially all together):
* btrfs_extent_readonly() - to check if the block group containing the
extent is currently in RO mode. This requires taking a spinlock and
searching for the block group in a rbtree that can be big on large
filesystems;
* csum_exist_in_range() - to search if there are any checksums in the
csum tree for the extent. Like before, this can take some time if we are
in a filesystem that has both COW and NOCOW files, in which case the
csum tree is not empty;
* btrfs_inc_nocow_writers() - increment the number of nocow writers in the
block group that contains the data extent. Needs to acquire a spinlock
and search for the block group in a rbtree that can be big on large
filesystems.
So just unlock the leaf (release the path) before doing all those checks,
since we do not need it anymore. In case we can not do a NOCOW write for
the extent, due to any of those checks failing, and the writeback range
goes beyond that extents' length, we will do another btree search for the
next file extent item.
The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian):
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-m single -d single"
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
Before this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9326331 0.317 399.957
Close 6851198 0.002 6.402
Rename 394894 2.621 402.819
Unlink 1883131 0.931 398.082
Deltree 256 19.160 303.580
Mkdir 128 0.003 0.016
Qpathinfo 8452314 0.068 116.133
Qfileinfo 1481921 0.001 5.081
Qfsinfo 1549963 0.002 4.444
Sfileinfo 759679 0.084 17.079
Find 3268168 0.396 118.196
WriteX 4653310 0.056 110.993
ReadX 14618818 0.005 23.314
LockX 30364 0.003 0.497
UnlockX 30364 0.002 1.720
Flush 653619 16.954 569.299
Throughput 966.651 MB/sec 64 clients 64 procs max_latency=569.377 ms
After this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9710433 0.302 232.449
Close 7132948 0.002 11.496
Rename 411144 2.452 131.805
Unlink 1960961 0.893 230.383
Deltree 256 14.858 198.646
Mkdir 128 0.002 0.005
Qpathinfo 8800890 0.066 111.588
Qfileinfo 1542556 0.001 3.852
Qfsinfo 1613835 0.002 5.483
Sfileinfo 790871 0.081 19.492
Find 3402743 0.386 120.185
WriteX 4842918 0.054 179.312
ReadX 15220407 0.005 32.435
LockX 31612 0.003 1.533
UnlockX 31612 0.002 1.047
Flush 680567 16.320 463.323
Throughput 1016.59 MB/sec 64 clients 64 procs max_latency=463.327 ms
+5.0% throughput, -20.5% max latency
Also, the following test using fio was run:
$ cat test-fio.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 4 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo
echo "Using fio config:"
echo
cat /tmp/fio-job.ini
echo
echo "mount options: $MOUNT_OPTIONS"
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo "Creating nodatacow files before fio runs..."
for ((i = 0; i < $NUM_JOBS; i++)); do
xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0"
done
sync
fio /tmp/fio-job.ini
umount $MNT
Before this change:
$ ./test-fio.sh 16 512M 2 4K
(...)
WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec
After this change:
$ ./test-fio.sh 16 512M 2 4K
(...)
WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec
+9.7% throughput, -9.8% runtime
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree checker is called many times as it verifies metadata at
read/write time. The checks follow a simple pattern:
if (error_condition) {
report_error();
return -EUCLEAN;
}
All the error reporting functions are annotated as __cold that is
supposed to hint the compiler to move the statement block out of the hot
path. This does not seem to happen that often.
As the error condition is expected to be false almost always, we can
annotate it with 'unlikely' as this satisfies one of the few use cases
for the annotation. The expected outcome is a stronger hint to compiler
to reorder the checks
test
jump to exit
test
jump to exit
...
which can be observed in asm of eg. check_dir_item,
btrfs_check_chunk_valid, check_root_item or check_leaf.
There's a measurable run time improvement reported by Josef, the testing
workload went from 655 MiB/s to 677 MiB/s, which is about +3%.
There should be no functional changes but some of the conditions have
been rewritten to produce more readable result, some lines are longer
than 80, for the sake of readability.
Signed-off-by: David Sterba <dsterba@suse.com>
Without a NULL fs_info the helpers will print something like
BTRFS error (device <unknown>): ...
This can happen in contexts where fs_info is not available at all or
it's potentially unsafe due to object lifetime. The <unknown> stub does
not bring much information and with the prefix makes the message
unnecessarily longer.
Remove it for the NULL fs_info case.
BTRFS error: ...
Callers can add the device information to the message itself if needed.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In alloc_extent_buffer(), after we got a page from btree inode, we check
if that page has private pointer attached.
If attached, we check if the existing extent buffer has proper refs.
If not (the eb is being freed), we will detach that private eb pointer.
The point here is, we are detaching that eb pointer by calling:
- ClearPagePrivate()
- put_page()
The put_page() here is especially confusing, as it's decreasing the ref
from attach_page_private(). Without knowing that, it looks like the
put_page() is for the find_or_create_page() call, confusing the reader.
Since we're always modifying page private with attach_page_private() and
detach_page_private(), the only open-coded detach_page_private() here is
really confusing.
Fix it by calling detach_page_private().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
start readahead in the csum tree.
However the threshold is an immediate number, (PAGE_SIZE * 8), from the
initial btrfs merge.
The meaning of the value is pretty hard to guess, especially when the
immediate number is from the times when 4K sectorsize was the default
and only CRC32C was supported.
For the most common btrfs setup, CRC32 csum and 4K sectorsize,
it means just 32K read would kick readahead, while the csum itself is
only 32 bytes in size.
Now let's be more reasonable by taking both csum size and node size into
consideration.
If the csum size for the bio is larger than one leaf, then we kick the
readahead. This means for current default btrfs, the threshold will be
16M.
This change should not change performance observably, thus this is
mostly a readability enhancement.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_invalidatepage() will try to clear all possible bits since it's
calling clear_extent_bit() with delete == 1.
This is currently fine, since for btree io tree, it only utilizes
EXTENT_LOCK bit. But this could be a problem for later subpage support,
which will utilize extra io tree bit to represent additional info.
This patch will just convert that clear_extent_bit() to
unlock_extent_cached().
For current code since only EXTENT_LOCKED bit is utilized, this doesn't
change the behavior, but provides a much cleaner basis for incoming
subpage support.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
@phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.
But for metadata, it's completely useless as metadata stores their own
csum in its header, so we can remove it.
Note: parameters @start and @end, they are not utilized at all for
current sectorsize == PAGE_SIZE case, as we can grab eb directly from
page.
But those two parameters are very important for later subpage support,
thus @start/@len are not touched here.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That anonymous structure serve no special purpose, just replace it with
regular members.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the type is unsigned int which could change its width
depending on the architecture. We need up to 32 bits so make it
explicit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new helper to handle update page status in
end_bio_extent_readpage(). This will be later used for subpage support
where the page status update can be more complex than now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In extent-io-test, there are two invalid tests:
- Invalid nodesize for test_eb_bitmaps()
Instead of the sectorsize and nodesize combination passed in, we're
always using hand-crafted nodesize, e.g:
len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
? sectorsize * 4 : sectorsize;
In above case, if we have 32K page size, then we will get a length of
128K, which is beyond max node size, and obviously invalid.
The common page size goes up to 64K so we haven't hit that
- Invalid extent buffer bytenr
For 64K page size, the only combination we're going to test is
sectorsize = nodesize = 64K.
However, in that case we will try to test an eb which bytenr is not
sectorsize aligned:
/* Do it over again with an extent buffer which isn't page-aligned. */
eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len);
Sector alignment is a hard requirement for any sector size.
The only exception is superblock. But anything else should follow
sector size alignment.
This is definitely an invalid test case.
This patch will fix both problems by:
- Honor the sectorsize/nodesize combination
Now we won't bother to hand-craft the length and use it as nodesize.
- Use sectorsize as the 2nd run extent buffer start
This would test the case where extent buffer is aligned to sectorsize
but not always aligned to nodesize.
Please note that, later subpage related cleanup will reduce
extent_buffer::pages[] to exactly what we need, making the sector
unaligned extent buffer operations cause problems.
Since only extent_io self tests utilize this, this patch is required for
all later cleanup/refactoring.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A semicolon is not needed after a switch statement.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is needlessly convoluted. Fix that by:
* removing redundant sret variable definition in both if arms
* replace the again/done labels with direct return statements, the
function is short enough and doesn't do anything special upon exit
* remove BUG_ON on split_node returning a positive number - it can't
happen as split_node returns either 0 or a negative error code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the point when we set 'ret = 0' it's guaranteed that the function is
going to return 0 so directly return 0. No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At inode.c:cow_file_range_inline(), after we insert the inline extent
in the fs/subvolume btree, we call btrfs_drop_extent_cache() to drop
all extent maps in the file range, however that is not necessary because
we have already done it in the call to btrfs_drop_extents(), which calls
btrfs_drop_extent_cache() for us, and since at this point we have the file
range locked in the inode's iotree (we are in the writeback path), we know
no other task can come in and read stale file extent items or find none
and therefore create either stale extent maps or an extent map that
represents a hole.
So just remove that unnecessary call to btrfs_drop_extent_cache(), as it's
doing nothing and only wasting time. This call has been around since 2008,
introduced in commit c8b978188c ("Btrfs: Add zlib compression support"),
but even back then it seems it was not necessary, since we had the range
locked in the inode's iotree and the call to btrfs_drop_extents() already
used to always call btrfs_drop_extent_cache().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When joining a log transaction we acquire the root's log mutex, then
increment the root's log batch and log writers counters while holding
the mutex. However we don't need to increment the log batch there,
because we are holding the mutex and incremented the log writers counter
as well, so any other task trying to sync log will wait for the current
task to finish its logging and still achieve the desired log batching.
Since the log batch counter is an atomic counter and is incremented twice
at the very beginning of the fsync callback (btrfs_sync_file()), once
before flushing delalloc and once again after waiting for writeback to
complete, eliminating its increment when joining the log transaction
may provide some performance gains in case we have multiple concurrent
tasks doing fsyncs against different files in the same subvolume, as it
reduces contention on the atomic (locking the cacheline and bouncing it).
When testing fio with 32 jobs, on a 8 cores VM, doing fsyncs against
different files of the same subvolume, on top of a zram device, I could
consistently see gains (higher throughput) between 1% to 2%, which is a
very low value and possibly hard to be observed with a real device (I
couldn't observe consistent gains with my low/mid end NVMe device).
So this change is mostly motivated to just simplify the logic, as updating
the log batch counter is only relevant when an fsync starts and while not
holding the root's log mutex.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Every time we log an inode we lookup in the fs/subvol tree for xattrs and
if we have any, log them into the log tree. However it is very common to
have inodes without any xattrs, so doing the search wastes times, but more
importantly it adds contention on the fs/subvol tree locks, either making
the logging code block and wait for tree locks or making the logging code
making other concurrent operations block and wait.
The most typical use cases where xattrs are used are when capabilities or
ACLs are defined for an inode, or when SELinux is enabled.
This change makes the logging code detect when an inode does not have
xattrs and skip the xattrs search the next time the inode is logged,
unless the inode is evicted and loaded again or a xattr is added to the
inode. Therefore skipping the search for xattrs on inodes that don't ever
have xattrs and are fsynced with some frequency.
The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian distributions):
$ cat test.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 200 40
umount $MNT
The results before this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5761605 0.172 312.057
Close 4232452 0.002 10.927
Rename 243937 1.406 277.344
Unlink 1163456 0.631 298.402
Deltree 160 11.581 221.107
Mkdir 80 0.003 0.005
Qpathinfo 5221410 0.065 122.309
Qfileinfo 915432 0.001 3.333
Qfsinfo 957555 0.003 3.992
Sfileinfo 469244 0.023 20.494
Find 2018865 0.448 123.659
WriteX 2874851 0.049 118.529
ReadX 9030579 0.004 21.654
LockX 18754 0.003 4.423
UnlockX 18754 0.002 0.331
Flush 403792 10.944 359.494
Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms
The results after this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 6442521 0.159 230.693
Close 4732357 0.002 10.972
Rename 272809 1.293 227.398
Unlink 1301059 0.563 218.500
Deltree 160 7.796 54.887
Mkdir 80 0.008 0.478
Qpathinfo 5839452 0.047 124.330
Qfileinfo 1023199 0.001 4.996
Qfsinfo 1070760 0.003 5.709
Sfileinfo 524790 0.033 21.765
Find 2257658 0.314 125.611
WriteX 3211520 0.040 232.135
ReadX 10098969 0.004 25.340
LockX 20974 0.003 1.569
UnlockX 20974 0.002 3.475
Flush 451553 10.287 331.037
Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms
+10.8% throughput, -8.2% max latency
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only 2 direct calls to set_extent_bit outside of extent-io -
in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
thin wrappers around __set_extent_bit. This adds unnecessary indirection
and just makes it more annoying when looking at the various extent bit
manipulation functions. This patch renames __set_extent_bit to
set_extent_bit effectively removing a level of indirection. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat and remove __must_check ]
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is unused everywhere now, it can be removed.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is completely unused now, remove it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer use recursion, so
__btrfs_tree_read_lock(BTRFS_NESTING_NORMAL) == btrfs_tree_read_lock.
Replace this call with the simple helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer have recursive locking and there's no need for separate
helpers that allowed the transition to rwsem with minimal code changes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're no longer using recursion, rip out all of the supporting
code. Follow up patches will clean up the callers of these functions.
The extent_buffer::lock_owner is still retained as it allows safety
checks in btrfs_init_new_buffer for the case that the free space cache
is corrupted and we try to allocate a block that we are currently using
and have locked in the path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With my async free space cache loading patches ("btrfs: load free space
cache asynchronously") we no longer have a user of path->recurse and can
remove it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe reported the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc2-btrfs-next-71 #1 Not tainted
------------------------------------------------------
find/324157 is trying to acquire lock:
ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
but task is already holding lock:
ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-tree-00){++++}-{3:3}:
lock_acquire+0xd8/0x490
down_write_nested+0x44/0x120
__btrfs_tree_lock+0x27/0x120 [btrfs]
btrfs_search_slot+0x2a3/0xc50 [btrfs]
btrfs_insert_empty_items+0x58/0xa0 [btrfs]
insert_with_overflow+0x44/0x110 [btrfs]
btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
btrfs_setxattr+0xd6/0x4c0 [btrfs]
btrfs_setxattr_trans+0x68/0x100 [btrfs]
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x70/0x200
vfs_setxattr+0x6b/0x120
setxattr+0x125/0x240
path_setxattr+0xba/0xd0
__x64_sys_setxattr+0x27/0x30
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
lock_acquire+0xd8/0x490
down_read_nested+0x45/0x220
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
lock(btrfs-tree-00);
lock(btrfs-tree-01#2/3);
*** DEADLOCK ***
5 locks held by find/324157:
#0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
#1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
#2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
#4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
stack backtrace:
CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
check_noncircular+0xff/0x110
? mark_lock.part.0+0x468/0xe90
check_prev_add+0x91/0xc60
__lock_acquire+0x1689/0x3130
? kvm_clock_read+0x14/0x30
? kvm_sched_clock_read+0x5/0x10
lock_acquire+0xd8/0x490
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
down_read_nested+0x45/0x220
? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
__btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
btrfs_next_old_leaf+0x27d/0x580 [btrfs]
btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
iterate_dir+0x170/0x1c0
__x64_sys_getdents64+0x83/0x140
? filldir+0x1d0/0x1d0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because btrfs_next_old_leaf searches down to our current
key, and then walks up the path until we can move to the next slot, and
then reads back down the path so we get the next leaf.
However it doesn't unlock any lower levels until it replaces them with
the new extent buffer. This is technically fine, but of course causes
lockdep to complain, because we could be holding locks on lower levels
while locking upper levels.
Fix this by dropping all nodes below the level that we use as our new
starting point before we start reading back down the path. This also
allows us to drop the nested/recursive locking magic, because we're no
longer locking two nodes at the same level anymore.
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are carrying around this next_rw_lock from when we would do spinning
vs blocking read locks. Now that we have the rwsem locking we can
simply use the read lock flag unconditionally and the read lock helpers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
check if the parameter seed is true in the function btrfs_find_device().
This tells it whether to traverse the seed device list or not.
After this commit, the argument is unused and can be removed.
In device_list_add() it's not necessary because fs_devices always points
to the device's fs_devices. So with the devid+uuid matching, it will
find the right device and return, thus not needing to traverse seed
devices.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop the condition in verify_one_dev_extent,
btrfs_device::disk_total_bytes is set even for a seed device. The
comment is wrong, the size is properly set when cloning the device.
Commit 1b3922a8bc ("btrfs: Use real device structure to verify
dev extent") introduced it but it's unclear why the total_disk_bytes
was 0.
Theoretically, all devices (including missing and seed) marked with the
BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at
fill_device_from_item():
open_ctree()
btrfs_read_chunk_tree()
read_one_dev()
open_seed_device()
fill_device_from_item()
Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its
a bug to be fixed somewhere else and not in verify_one_dev_extent() as
it's just a messenger. It is never expected that a total_disk_bytes
shall be zero.
The function fill_device_from_item() does the job of reading it from the
item and updating btrfs_device::disk_total_bytes. So both the missing
device and the seed devices do have their disk_total_bytes updated.
btrfs_find_device can also return a device from fs_info->seed_list
because it searches it as well.
Furthermore, while removing the device if there is a power loss, we
could have a device with its total_bytes = 0, that's still valid.
Instead, introduce a check against maximum block device size in
read_one_dev().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit cf89af146b ("btrfs: dev-replace: fail mount if we don't have
replace item with target device") dropped the multi stage operation of
btrfs_free_extra_devids() that does not need to check replace target
anymore and we can remove the 'step' argument.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many arguments for __btrfs_drop_extents() and its wrapper
btrfs_drop_extents(), which makes it hard to add more arguments to it and
requires changing every caller. I have added a couple myself back in 2014
commit 1acae57b16 ("Btrfs: faster file extent item replace operations")
and therefore know firsthand that it is a bit cumbersome to add additional
arguments to these functions.
Since I will need to add more arguments in a subsequent bug fix, this
change is preparatory work and adds a data structure that holds all the
arguments, for both input and output, that are passed to this function,
with some comments in the structure's definition mentioning what each
field is and how it relates to other fields.
Callers of this function need only to zero out the content of the
structure and setup only the fields they need. This also removes the
need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
now we have a single function named btrfs_drop_extents() that takes a
pointer to this new data structure (struct btrfs_drop_extents_args).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both Filipe and Fedora QA recently hit the following lockdep splat:
WARNING: possible recursive locking detected
5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Not tainted
--------------------------------------------
rsync/2610 is trying to acquire lock:
ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
but task is already holding lock:
ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&eb->lock);
lock(&eb->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by rsync/2610:
#0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
#1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
stack backtrace:
CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x8b/0xb0
__lock_acquire.cold+0x12d/0x2a4
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
lock_acquire+0xc8/0x400
? btrfs_tree_read_lock_atomic+0x34/0x140
? read_block_for_search.isra.0+0xdd/0x320
_raw_read_lock+0x3d/0xa0
? btrfs_tree_read_lock_atomic+0x34/0x140
btrfs_tree_read_lock_atomic+0x34/0x140
btrfs_search_slot+0x616/0x9a0
btrfs_lookup_dir_item+0x6c/0xb0
btrfs_lookup_dentry+0xa8/0x520
? lockdep_init_map_waits+0x4c/0x210
btrfs_lookup+0xe/0x30
__lookup_slow+0x10f/0x1e0
walk_component+0x11b/0x190
path_lookupat+0x72/0x1c0
filename_lookup+0x97/0x180
? strncpy_from_user+0x96/0x1e0
? getname_flags.part.0+0x45/0x1a0
vfs_statx+0x64/0x100
? lockdep_hardirqs_on_prepare+0xff/0x180
? _raw_spin_unlock_irqrestore+0x41/0x50
__do_sys_newlstat+0x26/0x40
? lockdep_hardirqs_on_prepare+0xff/0x180
? syscall_enter_from_user_mode+0x27/0x80
? syscall_enter_from_user_mode+0x27/0x80
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
I have also seen a report of lockdep complaining about the lock class
that was looked up being the same as the lock class on the lock we were
using, but I can't find the report.
These are problems that occur because we do not have the lockdep class
set on the extent buffer until _after_ we read the eb in properly. This
is problematic for concurrent readers, because we will create the extent
buffer, lock it, and then attempt to read the extent buffer.
If a second thread comes in and tries to do a search down the same path
they'll get the above lockdep splat because the class isn't set properly
on the extent buffer.
There was a good reason for this, we generally didn't know the real
owner of the eb until we read it, specifically in refcounted roots.
However now all refcounted roots have the same class name, so we no
longer need to worry about this. For non-refcounted trees we know
which root we're on based on the parent.
Fix this by setting the lockdep class on the eb at creation time instead
of read time. This will fix the splat and the weirdness where the class
changes in the middle of locking the block.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've plumbed all of the callers to have the owner root and the
level, plumb it down into alloc_extent_buffer().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The readahead infrastructure does raw reads of extent buffers, but we're
going to need to know their owner and level in order to set the lockdep
key properly, so plumb in the infrastructure that we'll need to have
this information when we start allocating extent buffers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to properly set the lockdep class of a newly allocated block we
need to know the owner of the block. For non-refcounted trees this is
straightforward, we always know in advance what tree we're reading from.
For refcounted trees we don't necessarily know, however all refcounted
trees share the same lockdep class name, tree-<level>.
Fix all the callers of read_tree_block() to pass in the root objectid
we're using. In places like relocation and backref we could probably
unconditionally use 0, but just in case use the root when we have it,
otherwise use 0 in the cases we don't have the root as it's going to be
a refcounted tree anyway.
This is a preparation patch for further changes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're open-coding btrfs_read_node_slot() here, replace with the helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're open-coding btrfs_read_node_slot() here, replace with the helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're open-coding btrfs_read_node_slot() here, replace with the helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're open-coding btrfs_read_node_slot() here, replace with the helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're open-coding btrfs_read_node_slot() here, replace with the helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're open coding btrfs_read_node_slot in do_relocation, replace this
with the proper helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do not need to call read_tree_block() here, simply use the
btrfs_read_node_slot helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this open-coded nightmare in btrfs_realloc_node that does
the same thing that the normal read path does, which is to see if we
have the eb in memory already, and if not read it, and verify the eb is
uptodate. Delete this open coding and simply use btrfs_read_node_slot.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to pass around more information when we allocate extent
buffers, in order to make that cleaner how we do readahead. Most of the
callers have the parent node that we're getting our blockptr from, with
the sole exception of relocation which simply has the bytenr it wants to
read.
Add a helper that takes the current arguments that we need (bytenr and
gen), and add another helper for simply reading the slot out of a node.
In followup patches the helper that takes all the extra arguments will
be expanded, and the simpler helper won't need to have it's arguments
adjusted.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this weird problem where our lockdep class is set after we
read a tree block, which can race with concurrent readers and result in
erroneous lockdep errors. We want to set the lockdep class at
allocation time if possible, but in certain cases we may not have the
actual root owner, such as with relocation or any backref lookups. This
is only really a problem for reference counted trees, because all other
trees have their root reference set in their extent reference. Remove
the fs tree specific lock class. We need to still keep the reloc tree
one, it's still reference counted, because replace_path will lock the
reloc tree and the destination tree, and if they're both set to
tree-<level> we'll have issues.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After sysfs updates discard's iops_limit or kbps_limit it also needs to
adjust current timer through rescheduling, otherwise the discard work
may wait for a long time for the previous timer to expire or bumped by
someone else.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_discard_schedule_work() is called with override=true, it sets
delay anew regardless how much time is left until the timer should have
fired. If delays are long (that can happen, for example, with low
kbps_limit), they might get constantly overridden without having a
chance to run the discard work.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most delay calculations are done in ns or ms, so store
discard_ctl->delay in ms and convert the final delay to jiffies only at
the end.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using iops_limit only for cutting off extremes, calculate the
discard delay directly from it, so it closely follows iops_limit and
doesn't under-discard even though quotas are not saturated.
The iops limit could be hit more often in some cases and could increase
the discard rate.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function scrub_find_csum() is to locate the csum for bytenr @logical
from sctx->csum_list.
However it lacks a lot of comments to explain things like how the
csum_list is organized and why we need to drop csum range which is
before us.
Refactor the function by:
- Add more comments explaining the behavior
- Add comment explaining why we need to drop the csum range
- Put the csum copy in the main loop
This is mostly for the incoming patches to make scrub_find_csum() able
to find multiple checksums.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @force parameter for scrub_pages() is to indicate whether we want to
force bio submission. Currently it's only used for the super block,
and it can be easily determined by the @flags, so we can remove the
parameter.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several call sites where we declare something like
"struct scrub_page *page".
This is confusing as we also use regular page in this code,
rename it to 'spage' where applicable.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently csum_dirty_buffer() uses page to grab extent buffer, but that
only works for sector size == PAGE_SIZE case.
For subpage we need page + page_offset to grab extent buffer.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_validate_metadata_buffer() only needs to handle one
extent buffer as currently one page maps to at most one extent buffer.
For incoming subpage support, we need to extend the support where one
page could contain multiple extent buffers.
Split the function so we can call validate_extent_buffer on extent
buffers independently.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage size support, metadata blocks of nodesize are smaller than
one page and this needs to be handled when calculating the checksum.
The checksummed start and length need to be adjusted but only for the
first page:
- start is simply offset in the page
- length is nodesize (subpage) or PAGE_SIZE for all other cases
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit f28491e0a6 ("Btrfs: move the extent buffer radix tree into
the fs_info"), fs_info can be grabbed from extent_buffer directly.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage sector size support, one page can contain multiple tree
blocks. The entries cannot be based on page size and index must be
derived from the sectorsize. No change for page size == sector size.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When calling attach_extent_buffer_page(), either we're attaching
anonymous pages, called from btrfs_clone_extent_buffer(),
or we're attaching btree inode pages, called from alloc_extent_buffer().
For the latter case, we should hold page->mapping->private_lock to avoid
parallel changes to page->private.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While documenting the usage of the commit_root_sem, I noticed that we do
not actually take the commit_root_sem in the case of the free space
cache. This is problematic because we're supposed to hold that sem
while we're reading the commit roots, which is what we do for the free
space cache.
The reason I did it inline when I originally wrote the code was because
there's the case of unpinning where we need to make sure that the free
space cache is loaded if we're going to use the free space cache. But
we can accomplish the same thing by simply waiting for the cache to be
loaded.
Rework this code to load the free space cache asynchronously. This
allows us to greatly cleanup the caching code because now it's all
shared by the various caching methods. We also are now in a position to
have the commit_root semaphore held while we're loading the free space
cache. And finally our modification of ->last_byte_to_unpin is removed
because it can be handled in the proper way on commit.
Some care must be taken when replaying the log, when we expect that the
free space cache will be read entirely before we start excluding space
to replay. This could lead to overwriting space during replay.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Historically we've allowed recursive locking specifically for the free
space inode. This is because we are only doing reads and know that it's
safe. However we don't actually need this feature, we can get away with
reading the commit root for the extents. In fact if we want to allow
asynchronous loading of the free space cache we have to use the commit
root, otherwise we will deadlock.
Switch to using the commit root for the file extents. These are only
read at load time, and are replaced as soon as we start writing the
cache out to disk. The cache is never read again, so this is
legitimate. This matches what we do for the inode itself, as we read
that from the commit root as well.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The free space cache has been special in that we would load it right
away instead of farming the work off to a worker thread. This resulted
in some weirdness that had to be taken into account for this fact,
namely that if we every found a block group being cached the fast way we
had to wait for it to finish, because we could get the cache before it
had been validated and we may throw the cache away.
To handle this particular case instead create a temporary
btrfs_free_space_ctl to load the free space cache into. Then once we've
validated that it makes sense, copy it's contents into the actual
block_group->free_space_ctl. This allows us to avoid the problems of
needing to wait for the caching to complete, we can clean up the discard
extent handling stuff in __load_free_space_cache, and we no longer need
to do the merge_space_tree() because the space is added one by one into
the real free_space_ctl. This will allow further reworks of how we
handle loading the free space cache.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This passes in the block_group and the free_space_ctl, but we can get
this from the block group itself. Part of this is because we call it
from __load_free_space_cache, which can be called for the inode cache as
well.
Move that call into the block group specific load section, wrap it in
the right lock that we need for the assertion (but otherwise this is
safe without the lock because this happens in single-thread context).
Fix up the arguments to only take the block group. Add a lockdep_assert
as well for good measure to make sure we don't mess up the locking
again.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently unpin_extent_range happens in the transaction commit context,
so we are protected from ->last_byte_to_unpin changing while we're
unpinning, because any new transactions would have to wait for us to
complete before modifying ->last_byte_to_unpin.
However in the future we may want to change how this works, for instance
with async unpinning or other such TODO items. To prepare for that
future explicitly protect ->last_byte_to_unpin with the commit_root_sem
so we are sure it won't change while we're doing our work.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching. Consider
the following scenario
commit root
+----+----+----+----+----+----+----+
|\\\\| |\\\\|\\\\| |\\\\|\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7
new commit root
+----+----+----+----+----+----+----+
| | | |\\\\| | |\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7
Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots. In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1. Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it. Now caching_ctl->progress == 3. We swap out the commit root and
carry on to unpin.
The race can happen like:
1) The caching thread was running using the old commit root when it
found the extent for [2, 3);
2) Then it released the commit_root_sem because it was in the last
item of a leaf and the semaphore was contended, and set ->progress
to 3 (value of 'last'), as the last extent item in the current leaf
was for the extent for range [2, 3);
3) Next time it gets the commit_root_sem, will start using the new
commit root and search for a key with offset 3, so it never finds
the hole for [2, 3).
So the caching thread never saw [2, 3) as free space in any of the
commit roots, and by the time finish_extent_commit() was called for
the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
subrange [0, 1) to the free space cache, skipping [2, 3).
In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3). However because caching_ctl->progress == 3 we
do not see the newly freed section of [2,3), and thus do not add it to
our free space cache. This results in us missing a chunk of free space
in memory (on disk too, unless we have a power failure before writing
the free space cache to disk).
Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ update changelog with Filipe's review comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
While fixing up our ->last_byte_to_unpin locking I noticed that we will
shorten len based on ->last_byte_to_unpin if we're caching when we're
adding back the free space. This is correct for the free space, as we
cannot unpin more than ->last_byte_to_unpin, however we use len to
adjust the ->bytes_pinned counters and such, which need to track the
actual pinned usage. This could result in
WARN_ON(space_info->bytes_pinned) triggering at unmount time.
Fix this by using a local variable for the amount to add to free space
cache, and leave len untouched in this case.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer distinguish between blocking and spinning, so rip out all
this code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're using a rw_semaphore we no longer need to indicate if a
lock is blocking or not, nor do we need to flip the entire path from
blocking to spinning. Remove these helpers and all the places they are
called.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The context structure unnecessarily stores copy of the checksum size,
that can be now easily obtained from fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The state structure unnecessarily stores copy of the checksum size, that
can be now easily obtained from fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove local variable that is then used just once and replace it with
fs_info::csum_size.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info value is 32bit, switch also the local u16 variables. This
leads to a better assembly code generated due to movzwl.
This simple change will shave some bytes on x86_64 and release config:
text data bss dec hex filename
1090000 17980 14912 1122892 11224c pre/btrfs.ko
1089794 17980 14912 1122686 11217e post/btrfs.ko
DELTA: -206
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_get_16 shows up in the system performance profiles (helper to read
16bit values from on-disk structures). This is partially because of the
checksum size that's frequently read along with data reads/writes, other
u16 uses are from item size or directory entries.
Replace all calls to btrfs_super_csum_size by the cached value from
fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a
candidate for optimizations. After the 64bit division has been replaced
by shift, there's still a calculation done each time the function is
called: checksums per leaf.
As this is a constant value for the entire filesystem lifetime, we
can calculate it once at mount time and reuse. This also allows to
reduce the division to 64bit/32bit as we know the constant will always
fit the 32bit type.
Replace the open-coded rounding up with a macro that internally handles
the 64bit division and as it's now a short function, make it static
inline (slight code increase, slight stack usage reduction).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In many places we need the checksum size and it is inefficient to read
it from the raw superblock. Store the value into fs_info, actual use
will be in followup patches. The size is u32 as it allows to generate
better assembly than with u16.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of super_block::s_blocksize_bits is the same as
fs_info::sectorsize_bits, but we don't need to do the extra dereferences
in many functions and storing the bits as u32 (in fs_info) generates
shorter assembly.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change free_space_bitmap_size to take btrfs_fs_info so we can get the
sectorsize_bits to do calculations.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do a lot of calculations where we divide or multiply by sectorsize.
We also know and make sure that sectorsize is a power of two, so this
means all divisions can be turned to shifts and avoid eg. expensive
u64/u32 divisions.
The type is u32 as it's more register friendly on x86_64 compared to u8
and the resulting assembly is smaller (movzbl vs movl).
There's also superblock s_blocksize_bits but it's usually one more
pointer dereference farther than fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable @page_size in submit_extent_page() is not related to page
size.
It can already be smaller than PAGE_SIZE, so rename it to io_size to
reduce confusion, this is especially important for later subpage
support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're reading partial page, btrfs will warn about this as read/write
is always done in sector size, which now equals page size.
But for the upcoming subpage read-only support, our data read is only
aligned to sectorsize, which can be smaller than page size.
Thus here we change the warning condition to check it against
sectorsize, the behavior is not changed for regular sectorsize ==
PAGE_SIZE case, and won't report error for subpage read.
Also, pass the proper start/end with bv_offset for check_data_csum() to
handle.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function process_pages_contig() does not only handle page locking but
also other operations. Rename the local variable pages_locked to
pages_processed to reduce confusion.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For check_data_csum(), the page we're using is directly from the inode
mapping, thus it has valid page_offset().
We can use (page_offset() + pg_off) to replace @start parameter
completely, while the @len should always be sectorsize.
Since we're here, also add some comment, as there are quite some
confusion in words like start/offset, without explaining whether it's
file_offset or logical bytenr.
This should not affect the existing behavior, as for current sectorsize
== PAGE_SIZE case, @pgoff should always be 0, and len is always
PAGE_SIZE (or sectorsize from the dio read path).
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of btrfs_wq_submit_bio() pass struct inode as @private_data,
so there is no need for it to be (void *), replace it with "struct inode
*inode".
While we can extract fs_info from struct inode, also remove the @fs_info
parameter.
Since we're here, also replace all the (void *private_data) into (struct
inode *inode).
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @failed_start parameter is only paired with @exclusive_bits, and
those parameters are only used for EXTENT_LOCKED bit, which have their
own wrappers lock_extent_bits().
Thus for regular set_extent_bit() calls, the failed_start makes no
sense, just sink the parameter.
Also, since @failed_start and @exclusive_bits are used in pairs, add
an assert to make it obvious.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The pitfall here is, if the parameter @bits has multiple bits set, we
will return the first range which just has one of the specified bits
set.
This is a little tricky if we want an exact match. Anyway, update the
comment to make that clear.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The return value of that function is completely wrong.
That function only returns 0 if the extent buffer doesn't need to be
submitted. The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".
And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.
While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.
The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.
So the comment is completely wrong.
Since we're here, also change the description a little. The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.
For reference, added in commit 2e3c25136a ("btrfs: extent_io: add
proper error handling to lock_extent_buffer_for_io()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Long time ago the explicit casts were necessary for u64 but we don't
need it. Remove casts where the type matches, leaving only cases that
cast sector_t or loff_t.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The drop_level member is used directly unlike all the other int types in
root_item. Add the definition and use it everywhere. The type is u8 so
there's no conversion necessary and the helpers are properly inlined,
this is for consistency.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For consistency use the available helpers to set flags and limit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's one raw use of le->cpu conversion but we have a helper to do
that for us, so use it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have helpers to access the on-disk item members, use that for
root_item::ctransid instead of raw le64_to_cpu.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The names in btrfs_lockdep_keysets are generated from a simple pattern
using snprintf but we can generate them directly with some macro magic
and remove the helpers.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS_MAX_LEVEL is 8 and the keyset table is supposed to have a key for
each level, but we'll never have more than 8 levels. The values passed
to btrfs_set_buffer_lockdep_class are always derived from a valid extent
buffer. Set the array sizes to the right value.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC
workaround") now that the iomap API has been updated to allow
iomap_dio_complete() not to be called under i_rwsem anymore.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If direct writes are called with O_DIRECT | O_DSYNC, it will result in a
deadlock because iomap_dio_rw() is called under i_rwsem which calls:
iomap_dio_complete()
generic_write_sync()
btrfs_sync_file()
btrfs_sync_file() requires i_rwsem, so call __iomap_dio_rw() with the
i_rwsem locked, and call iomap_dio_complete() after unlocking i_rwsem.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode dio_sem can be eliminated because all DIO synchronization is
now performed through inode->i_rwsem that provides the same guarantees.
This reduces btrfs_inode size by 40 bytes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Direct writes within EOF are safe to be performed with inode shared lock
to improve parallelization with other direct writes or reads because EOF
is not changed and there is no race with truncate().
Direct reads are already performed under shared inode lock.
This patch is precursor to removing btrfs_inode->dio_sem.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Push inode locking and unlocking closer to where we perform the I/O. For
this we need to move the write checks inside the respective functions as
well.
pos is evaluated after generic_write_checks because O_APPEND can change
iocb->ki_pos.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_inode_lock/unlock() are wrappers around inode locks, separating
the type of lock and actual locking.
- 0 - default, exclusive lock
- BTRFS_ILOCK_SHARED - for shared locks, for possible parallel DIO
- BTRFS_ILOCK_TRY - for the RWF_NOWAIT sequence
The bits SHARED and TRY can be combined together.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_write_check() checks write parameters in one place before
beginning a write. This does away with inode_unlock() after every check.
In the later patches, it will help push inode_lock/unlock() in buffered
and direct write functions respectively.
generic_write_checks needs to be called before as it could truncate
iov_iter and its return used as count.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info::fs_state is a filesystem bit check as opposed to inode and can
be performed before we begin with write checks. This eliminates inode
lock/unlock in case the error bit is set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While we do this, correct the call to pagecache_isize_extended:
- pagecache_isize_extended needs to be called to the start of the write
as opposed to i_size
- we don't need to check range before the call, this is done in the
function
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The read and write DIO don't have anything in common except for the
call to iomap_dio_rw. Extract the write call into a new function to get
rid of conditional statements for direct write.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Add
/sys/fs/btrfs/UUID/read_policy
attribute so that the read policy for the raid1, raid1c34 and raid10 can
be tuned.
When this attribute is read, it will show all available policies, with
active policy in [ ]. The read_policy attribute can be written using one
of the items listed in there.
For example:
$ cat /sys/fs/btrfs/UUID/read_policy
[pid]
$ echo pid > /sys/fs/btrfs/UUID/read_policy
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As of now, we use the pid method to read striped mirrored data, which
means process id determines the stripe id to read. This type of routing
typically helps in a system with many small independent processes tying
to read random data. On the other hand, the pid based read IO policy is
inefficient because if there is a single process trying to read a large
file, the overall disk bandwidth remains underutilized.
So this patch introduces a read policy framework so that we could add
more read policies, such as IO routing based on the device's wait-queue
or manual when we have a read-preferred device or a policy based on the
target storage caching.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a generic helper to match the string in a given buffer, and ignore
the leading and trailing whitespace.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename variables, add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
We do not need anymore to start writeback for delalloc of roots that are
being snapshotted and wait for it to complete. This was done in commit
609e804d77 ("Btrfs: fix file corruption after snapshotting due to mix
of buffered/DIO writes") to fix a type of file corruption where files in a
snapshot end up having their i_size updated in a non-ordered way, leaving
implicit file holes, when buffered IO writes that increase a file's size
are followed by direct IO writes that also increase the file's size.
This is not needed anymore because we now have a more generic mechanism
to prevent a non-ordered i_size update since commit 9ddc959e80
("btrfs: use the file extent tree infrastructure"), which addresses this
scenario involving snapshots as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree. For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon. At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.
However now the opposite is the case. There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order. This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases. The rw_semaphores actually do this properly, and also
have adaptive spinning that works relatively well.
The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.
So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything. This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions. The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree. However, on my test system which is an 80 core single
socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
following results (with all debug options off):
dbench 200 baseline
Throughput 216.056 MB/sec 200 clients 200 procs max_latency=1471.197 ms
dbench 200 with patch
Throughput 737.188 MB/sec 200 clients 200 procs max_latency=714.346 ms
Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking
fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16
baseline
Average Files/sec: 160166.7
p50 Files/sec: 165832
p90 Files/sec: 123886
p99 Files/sec: 123495
real 3m26.527s
user 2m19.223s
sys 48m21.856s
patched
Average Files/sec: 164135.7
p50 Files/sec: 171095
p90 Files/sec: 122889
p99 Files/sec: 113819
real 3m29.660s
user 2m19.990s
sys 44m12.259s
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just open code it in its sole caller and remove a level of indirection.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have the building blocks for some better recovery options
with corrupted file systems, add a rescue=all option to enable all of
the relevant rescue options. This will allow distros to simply default
to rescue=all for the "oh dear lord the world's on fire" recovery
without needing to know all the different options that we have and may
add in the future.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are cases where you can end up with bad data csums because of
misbehaving applications. This happens when an application modifies a
buffer in-flight when doing an O_DIRECT write. In order to recover the
file we need a way to turn off data checksums so you can copy the file
off, and then you can delete the file and restore it properly later.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the face of extent root corruption, or any other core fs wide root
corruption we will fail to mount the file system. This makes recovery
kind of a pain, because you need to fall back to userspace tools to
scrape off data. Instead provide a mechanism to gracefully handle bad
roots, so we can at least mount read-only and possibly recover data from
the file system.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The standalone option usebackuproot was intended as one-time use and it
was not necessary to keep it in the option list. Now that we're going to
have more rescue options, it's desirable to keep them intact as it could
be confusing why the option disappears.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ remove the btrfs_clear_opt part from open_ctree ]
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have a lot of rescue options, add a helper to collapse
the /proc/mounts output to rescue=option1:option2:option3 format.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to be adding a variety of different rescue options, we
should advertise which ones we support to make user spaces life easier
in the future.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we move to being able to handle NULL csum_roots it'll be cleaner to
just check in btrfs_lookup_bio_sums instead of at all of the caller
locations, so push the NODATASUM check into it as well so it's unified.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to be adding more options that require RDONLY, so add a
helper to do the check and error out if we don't have RDONLY set.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When scrubbing a stripe of a block group we always start readahead for the
checksums btree and wait for it to complete, however when the blockgroup is
not a data block group (or a mixed block group) it is a waste of time to do
it, since there are no checksums for metadata extents in that btree.
So skip that when the block group does not have the data flag set, saving
some time doing memory allocations, queueing a job in the readahead work
queue, waiting for it to complete and potentially avoiding some IO as well
(when csum tree extents are not in memory already).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we drop the last reference of a zone, we end up releasing it through
the callback reada_zone_release(), which deletes the zone from a device's
reada_zones radix tree. This tree is protected by the global readahead
lock at fs_info->reada_lock. Currently all places that are sure that they
are dropping the last reference on a zone, are calling kref_put() in a
critical section delimited by this lock, while all other places that are
sure they are not dropping the last reference, do not bother calling
kref_put() while holding that lock.
When working on the previous fix for hangs and use-after-frees in the
readahead code, my initial attempts were different and I actually ended
up having reada_zone_release() called when not holding the lock, which
resulted in weird and unexpected problems. So just add an assertion
there to detect such problem more quickly and make the dependency more
obvious.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as
opposed to calling set_extent_bits again later.
Fold check for written length within the function.
Note: EXTENT_NORESERVE is set before unlocking extents.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
round_down looks prettier than the bit mask operations.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While using compression, a submitted bio is mapped with a compressed bio
which performs the read from disk, decompresses and returns uncompressed
data to original bio. The original bio must reflect the uncompressed
size (iosize) of the I/O to be performed, or else the page just gets the
decompressed I/O length of data (disk_io_size). The compressed bio
checks the extent map and gets the correct length while performing the
I/O from disk.
This came up in subpage work when only compressed length of the original
bio was filled in the page. This worked correctly for pagesize ==
sectorsize because both compressed and uncompressed data are at pagesize
boundaries, and would end up filling the requested page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
write_bytes can change in btrfs_check_nocow_lock(). Calculate variables
such as num_pages and reserve_bytes once we are sure of the value of
write_bytes so there is no need to re-calculate.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If transaction_kthread is woken up before btrfs_fs_info::commit_interval
seconds have elapsed it will sleep for a fixed period of 5 seconds. This
is not a problem per-se but is not accurate. Instead the code should
sleep for an interval which guarantees on next wakeup commit_interval
would have passed. Since time tracking is not precise subtract 1 second
from delta to ensure the delay we end up waiting will be longer than
than the wake up period.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename 'now' to 'delta' and store there the delta between transaction
start time and current time. This is in preparation for optimising the
sleep logic in the next patch. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value obtained from ktime_get_seconds() is guaranteed to be
monotonically increasing since it's taken from CLOCK_MONOTONIC. As
transaction_kthread obtains a reference to the currently running
transaction under holding btrfs_fs_info::trans_lock it's guaranteed to:
a) see an initialized 'cur', whose start_time is guaranteed to be smaller
than 'now'
or
b) not obtain a 'cur' and simply go to sleep.
Given this remove the unnecessary check, if it sees
now < cur->start_time this would imply there are far greater problems on
the machine.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
simplify try_to_claim_pcluster() by directly using cmpxchg() here
(the retry loop caused more overhead.) Also, move the chain loop
detection in and rename it to z_erofs_try_to_claim_pcluster().
Link: https://lore.kernel.org/r/20201208095834.3133565-3-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Previously, it could be some concern to call add_to_page_cache_lru()
with page->mapping == Z_EROFS_MAPPING_STAGING (!= NULL).
In contrast, page->private is used instead now, so partially revert
commit 5ddcee1f3a ("erofs: get rid of __stagingpage_alloc helper")
with some adaption for simplicity.
Link: https://lore.kernel.org/r/20201208095834.3133565-2-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Previously, we played around with magical page->mapping for short-lived
temporary pages since we need to identify different types of pages in
the same pcluster but both invalidated and short-lived temporary pages
can have page->mapping == NULL. It was considered as safe because that
temporary pages are all non-LRU / non-movable pages.
This patch tends to use specific page->private to identify short-lived
pages instead so it won't rely on page->mapping anymore. Details are
described in "compress.h" as well.
Link: https://lore.kernel.org/r/20201208095834.3133565-1-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Since commit 4f761fa253 ("erofs: rename errln/infoln/debugln to
erofs_{err, info, dbg}") the defined macro EROFS_VERSION has no affect,
therefore removing it from the Makefile is a non-functional change.
Link: https://lore.kernel.org/r/20201030122839.25431-1-vladimir@tuxera.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
This patch adds max_io_bytes to limit bio size when f2fs tries to merge
consecutive IOs. This can give a testing point to split out bios and check
end_io handles those bios correctly. This is used to capture a recent bug
on the decompression and fsverity flow.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After io_identity_cow() copies an work.identity it wants to copy creds
to the new just allocated id, not the old one. Otherwise it's
akin to req->work.identity->creds = req->work.identity->creds.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The kernel provides easy to understand helpers to convert from human
understandable units to the kernel-friendly 'jiffies'. So let's use
those to make the code easier to understand. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Matching with the information that's available from the ioctl
FS_INFO, add generation to the per-filesystem directory
/sys/fs/btrfs/UUID/generation, which could be used by scripts.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
'format_corename()' will splite 'core_pattern' on spaces when it is in
pipe mode, and take helper_argv[0] as the path to usermode executable.
It works fine in most cases.
However, if there is a space between '|' and '/file/path', such as
'| /usr/lib/systemd/systemd-coredump %P %u %g', then helper_argv[0] will
be parsed as '', and users will get a 'Core dump to | disabled'.
It is not friendly to users, as the pattern above was valid previously.
Fix this by ignoring the spaces between '|' and '/file/path'.
Fixes: 315c69261d ("coredump: split pipe command whitespace before expanding template")
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Wise <pabs3@bonedaddy.net>
Cc: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398]
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/5fb62870.1c69fb81.8ef5d.af76@mx.google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/L/o4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpifFD/9tqDSwqcRIKRhSHqsD6OJRagZ0UB37cPvH
ZXmtHblK9BkeXc83OPy+NmlcPq1mpDmbMgy0rls8UOGnZs0RINN9En5hHGOeRVD4
Ma6Abh9ypzXhNKqY4HO+lcsN+elGqoaYNSnu/Nbcib7Z5yYzUSNDnpIeTQy7Onbw
dTGNTJalZYqrQ+6ytglfVeQuXSSKYPJfjFdcYd1It6Ks0aSKDgcQxJZgPeuF6xE+
mxEWADs+zpZG8UqZYyLnS6+9dxuJwhx+Cd47Ntky0VdaWmxCogQiI3X/Q3vspnVq
gz9VqnUIA3Qgabm0UJkyjCi87W2KNDf4r5OOaQw4dLc0kDhIVS6ii67iGVs9Dl7F
xCZMCE4pkEvs0pPYCyITUUNPh/fBONHgDHAWxxhztkE8ldvsAUmqNyV/LWMMvtSl
SUQli6OZMVyyUX9sroHOWKhi8rQYt1UZa0tm+UZE/+YrZLfp6qBCZHFC6hdNfQoS
k7PiuxecsH4i/OweJMPtggChJ7VWHZUorTljfHnl14oDuFuVD22AJvc3q+OSStI1
Ua5N27b2/pWXnf9P3dPGzikprOU/vXPQnlifmBxXNj+/TD29+GPUHhV2+kibc5/F
gMbrhX3TX4tstHIGxUCoeB8XepiBqOmn1eLkWNQQGi8RbqCKfE4j/5o07HgxST2P
Ns2GaEKb4A==
=CPrM
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.10-2020-12-05' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Just a small fix this time, for an issue with 32-bit compat apps and
buffer selection with recvmsg"
* tag 'io_uring-5.10-2020-12-05' of git://git.kernel.dk/linux-block:
io_uring: fix recvmsg setup with compat buf-select
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl/K3U0ACgkQiiy9cAdy
T1Gewwv7Bjtq+c4tvc0LZj04oCfgsmcdu9s82Xff+e81lJlPyK+04zrp1kPiVQ8O
iV5+bQr96BPgA96f2Et/OdnO4ei4T1vp9yTfXkTHQCWHQhd57N98TNudtJ+JNKCv
ziaY+X6BFBBFnbCiPQxLZjvXsXenS3PFTq+wJB6ot4Qsj7z3kk88nXrvDDP7kiFL
3WqZzUWKPJaRHjJQIrUch9M77P1cGk1duKT8Ydj/XKQujocWFWYXEWVYMGKcFFpK
cXBEbN6vBUTogn6fA8lQKy3KTarRGT8qpl8c91UCZ61sH8t6RlHQ53rtPrLPG/9k
KH8xyN+Lu4+w9LDni5HqsIHAqi0+dnPvKRZWIkxMI/ZbZcV4Mwoi4kMZ/H85psp6
wEDgA6czaQSS9c3jZ0lvNgOHvpKkz1lz8sXwUuTYVehMaQrTcxNdEgh8TdlfZ9P6
St404ZXSsthXChac2N22G9JX4XcM/wocSWq/sfpoUvr9eKJMjFMdB7rP7ORPQofb
S/u8v5pD
=3b+C
-----END PGP SIGNATURE-----
Merge tag '5.10-rc6-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three smb3 fixes (two for stable) fixing
- a null pointer issue in a DFS error path
- a problem with excessive padding when mounted with "idsfromsid"
causing owner fields to get corrupted
- a more recent problem with compounded reparse point query found in
testing to the Linux kernel server"
* tag '5.10-rc6-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: refactor create_sd_buf() and and avoid corrupting the buffer
cifs: add NULL check for ses->tcon_ipc
smb3: set COMPOUND_FID to FileID field of subsequent compound request
Currently, the sock_from_file prototype takes an "err" pointer that is
either not set or set to -ENOTSOCK IFF the returned socket is NULL. This
makes the error redundant and it is ignored by a few callers.
This patch simplifies the API by letting callers deduce the error based
on whether the returned socket is NULL or not.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-12-03
The main changes are:
1) Support BTF in kernel modules, from Andrii.
2) Introduce preferred busy-polling, from Björn.
3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.
4) Memcg-based memory accounting for bpf objects, from Roman.
5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
selftests/bpf: Fix invalid use of strncat in test_sockmap
libbpf: Use memcpy instead of strncpy to please GCC
selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
selftests/bpf: Add tp_btf CO-RE reloc test for modules
libbpf: Support attachment of BPF tracing programs to kernel modules
libbpf: Factor out low-level BPF program loading helper
bpf: Allow to specify kernel module BTFs when attaching BPF programs
bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
selftests/bpf: Add support for marking sub-tests as skipped
selftests/bpf: Add bpf_testmod kernel module for testing
libbpf: Add kernel module BTF support for CO-RE relocations
libbpf: Refactor CO-RE relocs to not assume a single BTF object
libbpf: Add internal helper to load BTF data by FD
bpf: Keep module's btf_data_size intact after load
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
bpf: Adds support for setting window clamp
samples/bpf: Fix spelling mistake "recieving" -> "receiving"
bpf: Fix cold build of test_progs-no_alu32
...
====================
Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
immediately close the files but it sets the close-on-exec bit.
It is useful for e.g. container runtimes that usually install a
seccomp profile "as late as possible" before execv'ing the container
process itself. The container runtime could either do:
1 2
- install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0);
- close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
- execve(...); - execve(...);
Both alternative have some disadvantages.
In the first variant the seccomp_profile cannot block the close_range
syscall, as well as opendir/read/close/... for the fallback on older
kernels.
In the second variant, close_range() can be used only on the fds
that are not going to be needed by the runtime anymore, and it must be
potentially called multiple times to account for the different ranges
that must be closed.
Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
The runtime is able to use the existing open fds, the seccomp profile
can block close_range() and the syscalls used for its fallback.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When mounting with "idsfromsid" mount option, Azure
corrupted the owner SIDs due to excessive padding
caused by placing the owner fields at the end of the
security descriptor on create. Placing owners at the
front of the security descriptor (rather than the end)
is also safer, as the number of ACEs (that follow it)
are variable.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Suggested-by: Rohith Surabattula <rohiths@microsoft.com>
CC: Stable <stable@vger.kernel.org> # v5.8
Signed-off-by: Steve French <stfrench@microsoft.com>
In some scenarios (DFS and BAD_NETWORK_NAME) set_root_set() can be
called with a NULL ses->tcon_ipc.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
For an operation compounded with an SMB2 CREATE request, client must set
COMPOUND_FID(0xFFFFFFFFFFFFFFFF) to FileID field of smb2 ioctl.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Fixes: 2e4564b31b ("smb3: add support stat of WSL reparse points for special file types")
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Function signal_our_withdraw needs to work on file systems that have been
partially frozen. To do this, it called flush_workqueue(gfs2_freeze_wq).
This this wrong because it waits for *ALL* file systems to be unfrozen, not
just the one we're withdrawing from. It should only wait for the targetted
file system to be unfrozen. Otherwise it would wait until ALL file systems
are thawed before signaling the withdraw.
This patch changes signal_our_withdraw so it calls flush_work() for the target
file system's freeze work (only) to be completed.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_statfs_sync called sb_start_write and
sb_end_write. This is completely unnecessary because, aside from grabbing
glocks, gfs2_statfs_sync does all its updates to statfs with a transaction:
gfs2_trans_begin and _end. And transactions always do sb_start_intwrite in
gfs2_trans_begin and sb_end_intwrite in gfs2_trans_end.
This patch simply removes the call to sb_start_write.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The code of mb_find_order_for_block is a bit obscure, but we can
simplify it with mb_find_buddy(), make the code more concise.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-3-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After this patch (163a203), if an abnormal bitmap is detected, we
will mark the group as corrupt, and we will not use this group in
the future. Therefore, it should be meaningless to regenerate the
buddy bitmap of this group, It might be better to delete it.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Convert inotify to use the simple handle_inode_event() interface to
get rid of the code duplication between the generic helper
fsnotify_handle_event() and the inotify_handle_event() callback, which
also happen to be buggy code.
The bug will be fixed in the generic helper.
Link: https://lore.kernel.org/r/20201202120713.702387-3-amir73il@gmail.com
CC: stable@vger.kernel.org
Fixes: b9a1b97725 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There are currently multiple forms of assertion, such as J_ASSERT().
J_ASEERT() is provided for the jbd module, which is a public module.
Maybe we should use custom ASSERT() like other file systems, such as
xfs, which would be better.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Right now, it is hard to understand which quota journalling type is enabled:
you need to be quite familiar with kernel code and trace it or really
understand what different combinations of fs flags/mount options lead to.
This patch adds printing of current quota jounalling mode on each
mount/remount, thus making it easier to check it at a glance/in autotests.
The semantics is similar to ext4 data journalling modes:
* journalled - quota configured, journalling will be enabled
* writeback - quota configured, journalling won't be enabled
* none - quota isn't configured
* disabled - kernel compiled without CONFIG_QUOTA feature
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1603336860-16153-2-git-send-email-dotdot@yandex-team.ru
Signed-off-by: Roman Anufriev <dotdot@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Right now, there are several places, where we check whether fs is
capable of enabling quota or if quota is journalled with quite long
and non-self-descriptive condition statements.
This patch wraps these statements into helpers for better readability
and easier usage.
Link: https://lore.kernel.org/r/1603336860-16153-1-git-send-email-dotdot@yandex-team.ru
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Roman Anufriev <dotdot@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Variable ex is assigned a variable that is not being read, the assignment
is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20201021132326.148052-1-colin.king@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
bv_page can't be NULL in a valid bio_vec, so we can remove the NULL check,
as we did in other places when calling bio_for_each_segment_all() to go
through all bio_vec of a bio.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Link: https://lore.kernel.org/r/20201020082201.34257-1-tian.xianting@h3c.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The out_fail branch path don't release the bh and the second bh is
valid only in the for statement, so we don't need to set them to NULL.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/1603194069-17557-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The handle_inode_event() interface was added as (quoting comment):
"a simple variant of handle_event() for groups that only have inode
marks and don't have ignore mask".
In other words, all backends except fanotify. The inotify backend
also falls under this category, but because it required extra arguments
it was left out of the initial pass of backends conversion to the
simple interface.
This results in code duplication between the generic helper
fsnotify_handle_event() and the inotify_handle_event() callback
which also happen to be buggy code.
Generalize the handle_inode_event() arguments and add the check for
FS_EXCL_UNLINK flag to the generic helper, so inotify backend could
be converted to use the simple interface.
Link: https://lore.kernel.org/r/20201202120713.702387-2-amir73il@gmail.com
CC: stable@vger.kernel.org
Fixes: b9a1b97725 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This was an oversight in the original implementation, as it makes no
sense to specify both scoping flags to the same openat2(2) invocation
(before this patch, the result of such an invocation was equivalent to
RESOLVE_IN_ROOT being ignored).
This is a userspace-visible ABI change, but the only user of openat2(2)
at the moment is LXC which doesn't specify both flags and so no
userspace programs will break as a result.
Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <stable@vger.kernel.org> # v5.6+
Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Added two ioctl to decompress/compress explicitly the compression
enabled file in "compress_mode=user" mount option.
Using these two ioctls, the users can make a control of compression
and decompression of their files.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We will add a new "compress_mode" mount option to control file
compression mode. This supports "fs" and "user". In "fs" mode (default),
f2fs does automatic compression on the compression enabled files.
In "user" mode, f2fs disables the automaic compression and gives the
user discretion of choosing the target file and the timing. It means
the user can do manual compression/decompression on the compression
enabled files using ioctls.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
WARN_ON() already contains an unlikely(), so it's not necessary
to use unlikely.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Shuosheng Huang <huangshuosheng@allwinnertech.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
section is dirty, but dirty_secmap may not set
Reported-by: Jia Yang <jiayang5@huawei.com>
Fixes: da52f8ade4 ("f2fs: get the right gc victim section when section has several segments")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes buffer_head dependency when getting block addresses.
Light reported there's a 32bit issue in f2fs_fiemap where map_bh.b_size is
32bits while len is 64bits given by user. This will give wrong length to
f2fs_map_block.
Reported-by: Light Hsieh <Light.Hsieh@mediatek.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should convert cur_lblock, a block count, to bytes for len.
Fixes: af4b6b8edf ("f2fs: introduce check_swap_activate_fast()")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch renames two functions like below having u64.
- logical_to_blk to bytes_to_blks
- blk_to_logical to blks_to_bytes
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For multi-device case, one f2fs image includes multi devices, so it
needs to account bytes written of all block devices belong to the image
rather than one main block device, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to store chksum value with compressed
data, and verify the integrality of compressed data while
reading the data.
The feature can be enabled through specifying mount option
'compress_chksum'.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Lei Li reported a issue: if foreground operations are frequent, background
checkpoint may be always skipped due to below check, result in losing more
data after sudden power-cut.
f2fs_balance_fs_bg()
...
if (!is_idle(sbi, REQ_TIME) &&
(!excess_dirty_nats(sbi) && !excess_dirty_nodes(sbi)))
return;
E.g:
cp_interval = 5 second
idle_interval = 2 second
foreground operation interval = 1 second (append 1 byte per second into file)
In such case, no matter when it calls f2fs_balance_fs_bg(), is_idle(, REQ_TIME)
returns false, result in skipping background checkpoint.
This patch changes as below to make trigger condition being more reasonable:
- trigger sync_fs() if dirty_{nats,nodes} and prefree segs exceeds threshold;
- skip triggering sync_fs() if there is any background inflight IO or there is
foreground operation recently and meanwhile cp_rwsem is being held by someone;
Reported-by: Lei Li <noctis.akm@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use rwsem to ensure serialization of the callers and to avoid
starvation of high priority tasks, when the system is under
heavy IO workload.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Expand f2fs's casefolding support to include encrypted directories. To
index casefolded+encrypted directories, we use the SipHash of the
casefolded name, keyed by a key derived from the directory's fscrypt
master key. This ensures that the dirhash doesn't leak information
about the plaintext filenames.
Encryption keys are unavailable during roll-forward recovery, so we
can't compute the dirhash when recovering a new dentry in an encrypted +
casefolded directory. To avoid having to force a checkpoint when a new
file is fsync'ed, store the dirhash on-disk appended to i_name.
This patch incorporates work by Eric Biggers <ebiggers@google.com>
and Jaegeuk Kim <jaegeuk@kernel.org>.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This shifts the responsibility of setting up dentry operations from
fscrypt to the individual filesystems, allowing them to have their own
operations while still setting fscrypt's d_revalidate as appropriate.
Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
they have their own specific dentry operations as well. That operation
will set the minimal d_ops required under the circumstances.
Since the fscrypt d_ops are set later on, we must set all d_ops there,
since we cannot adjust those later on. This should not result in any
change in behavior.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This adds a function to set dentry operations at lookup time that will
work for both encrypted filenames and casefolded filenames.
A filesystem that supports both features simultaneously can use this
function during lookup preparations to set up its dentry operations once
fscrypt no longer does that itself.
Currently the casefolding dentry operation are always set if the
filesystem defines an encoding because the features is toggleable on
empty directories. Unlike in the encryption case, the dentry operations
used come from the parent. Since we don't know what set of functions
we'll eventually need, and cannot change them later, we enable the
casefolding operations if the filesystem supports them at all.
By splitting out the various cases, we support as few dentry operations
as we can get away with, maximizing compatibility with overlayfs, which
will not function if a filesystem supports certain dentry_operations.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two assignments are meaningless, and remove them.
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since sync_inodes_sb has been used, there is no need to
use writeback_inodes_sb, so remove it.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In case of retrying fill_super with skip_recovery,
s_encoding for casefold would not be loaded again even though it's
already been freed because it's not NULL.
Set NULL after free to prevent double freeing when unmount.
Fixes: eca4873ee1 ("f2fs: Use generic casefolding support")
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Eric reported a ioctl bug in below link:
https://lore.kernel.org/linux-f2fs-devel/20201103032234.GB2875@sol.localdomain/
That said, on some 32-bit architectures, u64 has only 32-bit alignment,
notably i386 and x86_32, so that size of struct f2fs_gc_range compiled
in x86_32 is 20 bytes, however the size in x86_64 is 24 bytes, binary
compiled in x86_32 can not call F2FS_IOC_GARBAGE_COLLECT_RANGE successfully
due to mismatched value of ioctl command in between binary and f2fs
module, similarly, F2FS_IOC_MOVE_RANGE will fail too.
In this patch we introduce two ioctls for compatibility of above special
32-bit binary:
- F2FS_IOC32_GARBAGE_COLLECT_RANGE
- F2FS_IOC32_MOVE_RANGE
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fields in struct f2fs_move_range won't change in f2fs_ioc_move_range(),
let's avoid copying this structure's data to userspace.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace. The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter. Pages with a type set can't be mapped to
userspace.
But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.
Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.
This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer. Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions. As the
result the code became more robust with fewer open-coded bit tricks.
This patch (of 4):
Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.
It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information. In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.
This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
struct mem_cgroup *page_memcg(struct page *page);
struct mem_cgroup *page_memcg_rcu(struct page *page);
struct mem_cgroup *page_memcg_check(struct page *page);
page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector. It does
check the lowest bit, and if set, returns NULL. page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.
To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
Currently it's impossible to delete files that use an unsupported
encryption policy, as the kernel will just return an error when
performing any operation on the top-level encrypted directory, even just
a path lookup into the directory or opening the directory for readdir.
More specifically, this occurs in any of the following cases:
- The encryption context has an unrecognized version number. Current
kernels know about v1 and v2, but there could be more versions in the
future.
- The encryption context has unrecognized encryption modes
(FSCRYPT_MODE_*) or flags (FSCRYPT_POLICY_FLAG_*), an unrecognized
combination of modes, or reserved bits set.
- The encryption key has been added and the encryption modes are
recognized but aren't available in the crypto API -- for example, a
directory is encrypted with FSCRYPT_MODE_ADIANTUM but the kernel
doesn't have CONFIG_CRYPTO_ADIANTUM enabled.
It's desirable to return errors for most operations on files that use an
unsupported encryption policy, but the current behavior is too strict.
We need to allow enough to delete files, so that people can't be stuck
with undeletable files when downgrading kernel versions. That includes
allowing directories to be listed and allowing dentries to be looked up.
Fix this by modifying the key setup logic to treat an unsupported
encryption policy in the same way as "key unavailable" in the cases that
are required for a recursive delete to work: preparing for a readdir or
a dentry lookup, revalidating a dentry, or checking whether an inode has
the same encryption policy as its parent directory.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-10-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Now that fscrypt_get_encryption_info() is only called from files in
fs/crypto/ (due to all key setup now being handled by higher-level
helper functions instead of directly by filesystems), unexport it and
move its declaration to fscrypt_private.h.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-9-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_require_key() is now only used by files in fs/crypto/. So
reduce its visibility to fscrypt_private.h. This is also a prerequsite
for unexporting fscrypt_get_encryption_info().
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
In preparation for reducing the visibility of fscrypt_require_key() by
moving it to fscrypt_private.h, move the call to it from
fscrypt_prepare_setattr() to an out-of-line function.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
The last remaining use of fscrypt_get_encryption_info() from filesystems
is for readdir (->iterate_shared()). Every other call is now in
fs/crypto/ as part of some other higher-level operation.
We need to add a new argument to fscrypt_get_encryption_info() to
indicate whether the encryption policy is allowed to be unrecognized or
not. Doing this is easier if we can work with high-level operations
rather than direct filesystem use of fscrypt_get_encryption_info().
So add a function fscrypt_prepare_readdir() which wraps the call to
fscrypt_get_encryption_info() for the readdir use case.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
The call to fscrypt_get_encryption_info() in dx_show_leaf() is too low
in the call tree; fscrypt_get_encryption_info() should have already been
called when starting the directory operation. And indeed, it already
is. Moreover, the encryption key is guaranteed to already be available
because dx_show_leaf() is only called when adding a new directory entry.
And even if the key wasn't available, dx_show_leaf() uses
fscrypt_fname_disk_to_usr() which knows how to create a no-key name.
So for the above reasons, and because it would be desirable to stop
exporting fscrypt_get_encryption_info() directly to filesystems, remove
the call to fscrypt_get_encryption_info() from dx_show_leaf().
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Since encrypted directories can be opened and searched without their key
being available, and each readdir and ->lookup() tries to set up the
key, trying to set up the key in ->open() too isn't really useful.
Just remove it so that directories don't need an ->open() method
anymore, and so that we eliminate a use of fscrypt_get_encryption_info()
(which I'd like to stop exporting to filesystems).
Link: https://lore.kernel.org/r/20201203022041.230976-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Since encrypted directories can be opened and searched without their key
being available, and each readdir and ->lookup() tries to set up the
key, trying to set up the key in ->open() too isn't really useful.
Just remove it so that directories don't need an ->open() method
anymore, and so that we eliminate a use of fscrypt_get_encryption_info()
(which I'd like to stop exporting to filesystems).
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20201203022041.230976-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Since encrypted directories can be opened and searched without their key
being available, and each readdir and ->lookup() tries to set up the
key, trying to set up the key in ->open() too isn't really useful.
Just remove it so that directories don't need an ->open() method
anymore, and so that we eliminate a use of fscrypt_get_encryption_info()
(which I'd like to stop exporting to filesystems).
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl/H/pUUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqxhRAAuTKHz1jMYHNAWm1NVVRiGVENhjlV
g5kRcRmf8m/VAWmI+HM0J34CbHpWc1IgJnjID/vZ/VB8QQPmoDoEFrD/ple9V6A4
xQJzQcG9uvLir5h43R2SWR6OSSijtEDjvpQpRQSK3B2e1otRV+W7mLjPBkg410rY
l9Cj01QhCwCv1dVO7VknSEB1Jpkhe0Vvr1z4DropeU5JEBSADTl5deTkFWbinDo1
9nygDw01AdWXhR1FNJTAy8XJYlcM3dx6xDuHD9xJH9HciWsocOr+EqtPfJEEsCSf
X6CrWY3mYZBr1kJFcvp2HD5rKYhw0xC0gZiHEnvmZx5XKTKoDKq2+2s2wQOwWk3B
QIvEfLoAzV6mKBHRC8mCDw2bImj8FDrvBkmmJnlyL6nJDrriBXimFAJPX7MryKTJ
jdvm0JFRyY0ja3LxLxBVwiKnKiw6yY6fPjpMWzi042P/PhshRSvazxgUf+pc69CB
Czo3W9MnNGWzP73ALIsuVxG81RW48CvAJdXOmoWIqjEeqnC0YZ3vgwInQcqBzton
fCI38XN6CG1UbwaVxoZGoPS5FIfBbrTL5v62e3U8JIv3J4ol5fWCEruUhImsByGI
w3UkiuNqWjOab/774m9ZS0eBXCEn6LAizvZSgf3pt+8VRoN20RxF3k1Jhvm7g7I3
IbOHK9kddrycH9Q=
=EHzo
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.10-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Various gfs2 fixes"
* tag 'gfs2-v5.10-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix deadlock between gfs2_{create_inode,inode_lookup} and delete_work_func
gfs2: Upgrade shared glocks for atime updates
gfs2: Don't freeze the file system during unmount
gfs2: check for empty rgrp tree in gfs2_ri_update
gfs2: set lockdep subclass for iopen glocks
gfs2: Fix deadlock dumping resource group glocks
nfsiod is currently a concurrency-managed workqueue (CMWQ).
This means that workitems scheduled to nfsiod on a given CPU are queued
behind all other work items queued on any CMWQ on the same CPU. This
can introduce unexpected latency.
Occaionally nfsiod can even cause excessive latency. If the work item
to complete a CLOSE request calls the final iput() on an inode, the
address_space of that inode will be dismantled. This takes time
proportional to the number of in-memory pages, which on a large host
working on large files (e.g.. 5TB), can be a large number of pages
resulting in a noticable number of seconds.
We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
This causes each concurrent work item to gets a dedicated thread which
can be scheduled to an idle CPU.
There is precedent for this as several other filesystems use WQ_UNBOUND
workqueue for handling various async events.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ada609ee2a ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
NLM uses an interval-based rebinding, i.e. it clears the transport's
binding under certain conditions if more than 60 seconds have elapsed
since the connection was last bound.
This rebinding is not necessary for an autobind RPC client over a
connection-oriented protocol like TCP.
It can also cause problems: it is possible for nlm_bind_host() to clear
XPRT_BOUND whilst a connection worker is in the middle of trying to
reconnect, after it had already been checked in xprt_connect().
When the connection worker notices that XPRT_BOUND has been cleared
under it, in xs_tcp_finish_connecting(), that results in:
xs_tcp_setup_socket: connect returned unhandled error -107
Worse, it's possible that the two can get into lockstep, resulting in
the same behaviour repeated indefinitely, with the above error every
300 seconds, without ever recovering, and the connection never being
established. This has been seen in practice, with a large number of NLM
client tasks, following a server restart.
The existing callers of nlm_bind_host & nlm_rebind_host should not need
to force the rebind, for TCP, so restrict the interval-based rebinding
to UDP only.
For TCP, we will still rebind when needed, e.g. on timeout, and connection
error (including closure), since connection-related errors on an existing
connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
already unconditionally clear XPRT_BOUND.
To avoid having to add the fix, and explanation, to both nlm_bind_host()
and nlm_rebind_host(), remove the duplicate code from the former, and
have it call the latter.
Drop the dprintk, which adds no value over a trace.
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Fixes: 35f5a422ce ("SUNRPC: new interface to force an RPC rebind")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In several patches work has been done to enable NFSv4 to use user
namespaces:
58002399da: NFSv4: Convert the NFS client idmapper to use the container user namespace
3b7eb5e35d: NFS: When mounting, don't share filesystems between different user namespaces
Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.
Users can still use rpc.idmapd if they choose to, but there are complexities
with user namespaces and request-key that have yet to be addresssed.
Eventually, we will need to at least:
* Come up with an upcall mechanism that can be triggered inside of the container,
or safely triggered outside, with the requisite context to do the right
mapping. * Handle whatever refactoring needs to be done in net/sunrpc.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d088c ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
There was refactoring done to use the fs_context for mounting done in:
62a55d088c: NFS: Additional refactoring for fs_context conversion
This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.
NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0e5: NFS: Store the credential of the mount process in the nfs_server
264d948ce7: NFS: Convert NFSv3 to use the container user namespace
c207db2f5d: NFS: Convert NFSv2 to use the container user namespace
Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.
This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d088c ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When returning the layout in nfs4_evict_inode(), we need to ensure that
the layout is actually done being freed before we can proceed to free the
inode itself.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to
contain the length of the data that we expect to want placed in the head
kvec plus a count of 1 word of padding that is placed after the page data.
This is very confusing when trying to read the code, and sometimes leads
to callers adding an arbitrary value of '1' just in order to satisfy the
requirement (whether or not the page data actually needs such padding).
This patch aims to clarify the code by changing the 'hdrsize' argument
to remove that 1 word of padding. This means we need to subtract the
padding from all the existing callers.
Fixes: 02ef04e432 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We can fit the device_addr4 opaque data padding in the pages.
Fixes: cf500bac8f ("SUNRPC: Introduce rpc_prepare_reply_pages()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Use the existing xdr_stream_decode_string_dup() to safely decode into
kmalloced strings.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that we report the correct netid when using UDP or RDMA
transports to the DSes.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We want to enable RDMA and UDP as valid transport methods if a
GETDEVICEINFO call specifies it. Do so by adding a parser for the
netid that translates it to an appropriate argument for the RPC
transport layer.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the pNFS metadata server advertises multiple addresses for the same
data server, we should try to connect to just one protocol family and
transport type on the assumption that homogeneity will improve performance.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Switch the mount code to use xprt_find_transport_ident() and to check
the results before allowing the mount to proceed.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the directory is changing, causing the page cache to get invalidated
while we are listing the contents, then the NFS client is currently forced
to read in the entire directory contents from scratch, because it needs
to perform a linear search for the readdir cookie. While this is not
an issue for small directories, it does not scale to directories with
millions of entries.
In order to be able to deal with large directories that are changing,
add a heuristic to ensure that if the page cache is empty, and we are
searching for a cookie that is not the zero cookie, we just default to
performing uncached readdir.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
If we're doing uncached readdir, allocate multiple pages in order to
try to avoid duplicate RPC calls for the same getdents() call.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
If the server is handing out monotonically increasing readdir cookie values,
then we can optimise away searches through pages that contain cookies that
lie outside our search range.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
If the server insists on using the readdir verifiers in order to allow
cookies to expire, then we should ensure that we cache the verifier
with the cookie, so that we can return an error if the application
tries to use the expired cookie.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
If the server returns NFS4ERR_NOT_SAME or tells us that the cookie is
bad in response to a READDIR call, then we should empty the page cache
so that we can fill it from scratch again.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
If we're ever going to allow support for servers that use the readdir
verifier, then that use needs to be managed by the middle layers as
those need to be able to reject cookies from other verifiers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
The descriptor and the struct nfs_entry are both large structures,
so don't allocate them from the stack.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Clean up nfs_do_filldir().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Remove the redundant caching of the credential in struct
nfs_open_dir_context.
Pass the buffer size as an argument to nfs_readdir_xdr_filler().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Support readdir buffers of up to 1MB in size so that we can read
large directories using few RPC calls.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
We don't need to store a hash, so replace struct qstr with a simple
const char pointer and length.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
The kmapped pointer is only used once per loop to check if we need to
exit.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
If a readdir call returns more data than we can fit into one page
cache page, then allocate a new one for that data rather than
discarding the data.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Refactor to use pagecache_get_page() so that we can fill the page
in multiple stages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Clean up handling of the case where there are no entries in the readdir
reply.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Since the 'eof_index' is only ever used as a flag, make it so.
Also add a flag to detect if the page has been completely filled.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Ensure that the contents of struct nfs_open_dir_context are consistent
by setting them under the file->f_lock from a private copy (that is
known to be consistent).
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Currently, the client will always ask for security_labels if the server
returns that it supports that feature regardless of any LSM modules
(such as Selinux) enforcing security policy. This adds performance
penalty to the READDIR operation.
Client adjusts superblock's support of the security_label based on
the server's support but also current client's configuration of the
LSM modules. Thus, prior to using the default bitmask in READDIR,
this patch checks the server's capabilities and then instructs
READDIR to remove FATTR4_WORD2_SECURITY_LABEL from the bitmask.
v5: fixing silly mistakes of the rushed v4
v4: simplifying logic
v3: changing label's initialization per Ondrej's comment
v2: dropping selinux hook and using the sb cap.
Suggested-by: Ondrej Mosnacek <omosnace@redhat.com>
Suggested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 2b0143b5c9 ("VFS: normal filesystems (and lustre): d_inode() annotations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We need to respect the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp,
by timing out if the server is unavailable.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In order to use the open_by_filehandle() operations on NFSv3, we need
to be able to emulate lookupp() so that nfs_get_parent() can be used
to convert disconnected dentries into connected ones.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We want to reuse the lookup code in NFSv3 in order to emulate the
NFSv4 lookupp operation.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Since commit b4868b44c5 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.
Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
until after the call to update_open_stateid, to indicate this is the 1st
open. This fix is part of a 2 patches, the other patch is the fix in the
source server to return the stateid for COPY_NOTIFY request with seqid 1
instead of 0.
Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
By switching to an XFS-backed export, I am able to reproduce the
ibcomp worker crash on my client with xfstests generic/013.
For the failing LISTXATTRS operation, xdr_inline_pages() is called
with page_len=12 and buflen=128.
- When ->send_request() is called, rpcrdma_marshal_req() does not
set up a Reply chunk because buflen is smaller than the inline
threshold. Thus rpcrdma_convert_iovs() does not get invoked at
all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
on the receive buffer.
- During reply processing, rpcrdma_inline_fixup() tries to copy
received data into rq_rcv_buf->pages because page_len is positive.
But there are no receive pages because rpcrdma_marshal_req() never
allocated them.
The result is that the ibcomp worker faults and dies. Sometimes that
causes a visible crash, and sometimes it results in a transport hang
without other symptoms.
RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
should eventually be fixed or replaced. However, my preference is
that upper-layer operations should explicitly allocate their receive
buffers (using GFP_KERNEL) when possible, rather than relying on
XDRBUF_SPARSE_PAGES.
Reported-by: Olga kornievskaia <kolga@netapp.com>
Suggested-by: Olga kornievskaia <kolga@netapp.com>
Fixes: c10a75145f ("NFSv4.2: add the extended attribute proc functions.")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga kornievskaia <kolga@netapp.com>
Reviewed-by: Frank van der Linden <fllinden@amazon.com>
Tested-by: Olga kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector. This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant. The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access. In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
We can just dereference the point in struct gendisk instead. Also
remove the now unused export.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of having two structures that represent each block device with
different life time rules, merge them into a single one. This also
greatly simplifies the reference counting rules, as we can use the inode
reference count as the main reference count for the new struct
block_device, with the device model reference front ending it for device
model interaction.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bd_part is never NULL for a block device in use by a file system, so
remove the checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use struct block_device to lookup partitions on a disk. This removes
all usage of struct hd_struct from the I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allocate hd_struct together with struct block_device to pre-load
the lifetime rule changes in preparation of merging the two structures.
Note that part0 was previously embedded into struct gendisk, but is
a separate allocation now, and already points to the block_device instead
of the hd_struct. The lifetime of struct gendisk is still controlled by
the struct device embedded in the part0 hd_struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the holder_dir field to struct block_device in preparation for
kill struct hd_struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the partition_meta_info to struct block_device in preparation for
killing struct hd_struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the dkstats and stamp field to struct block_device in preparation
of killing struct hd_struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.
Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes. By only using the block_device field
this problem also gets fixed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't play tricks with slab constructors as bdev structures tends to not
get reused very much, and this makes the code a lot less error prone.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stop passing the whole device as a separate argument given that it
can be trivially deducted and cleanup the !holder debug check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that each hd_struct has a reference to the corresponding
block_device, there is no need for the bd_contains pointer. Add
a bdev_whole() helper to look up the whole device block_device
struture instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To simplify block device lookup and a few other upcoming areas, make sure
that we always have a struct block_device available for each disk and
each partition, and only find existing block devices in bdget. The only
downside of this is that each device and partition uses a little more
memory. The upside will be that a lot of code can be simplified.
With that all we need to look up the block device is to lookup the inode
and do a few sanity checks on the gendisk, instead of the separate lookup
for the gendisk. For blk-cgroup which wants to access a gendisk without
opening it, a new blkdev_{get,put}_no_open low-level interface is added
to replace the previous get_gendisk use.
Note that the change to look up block device directly instead of the two
step lookup using struct gendisk causes a subtile change in behavior:
accessing a non-existing partition on an existing block device can now
cause a call to request_module. That call is harmless, and in practice
no recent system will access these nodes as they aren't created by udev
and static /dev/ setups are unusual.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch the block device lookup interfaces to directly work with a dev_t
so that struct block_device references are only acquired by the
blkdev_get variants (and the blk-cgroup special case). This means that
we now don't need an extra reference in the inode and can generally
simplify handling of struct block_device to keep the lookups contained
in the core block layer code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just call devcgroup_check_permission to avoid various superflous checks
and a double conversion of the access flags.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will allow for a more symmetric calling convention going forward.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move more code that is only run on the outer open but not the open of
the underlying whole device when opening a partition into blkdev_get,
which leads to a much easier to follow structure.
This allows to simplify the disk and module refcounting so that one
reference is held for each open, similar to what we do with normal
file operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reorder the code to have one big section for the last close, and to use
bdev_is_partition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All of the current callers already have a reference, but to prepare for
additional users ensure bdgrab returns NULL if the block device is being
freed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Adding the minor to the major creates tons of pointless conflicts. Just
use the dev_t itself, which is 32-bits and thus is guaranteed to fit
into ino_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a little helper to find the kobject for a struct block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Store the frozen superblock in struct block_device to avoid the awkward
interface that can return a sb only used a cookie, an ERR_PTR or NULL.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just open code the wait in the only caller of both functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The default splice operations got removed recently, add it back to 9p
with iter_file_splice_write like many other filesystems do.
Link: http://lkml.kernel.org/r/1606837496-21717-1-git-send-email-asmadeus@codewreck.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While kmsg_bytes can be set for pstore via mount, if a crash occurs
before the mount only partial console log will be stored as kmsg_bytes
defaults to a potentially different hardcoded value in the kernel
(PSTORE_DEFAULT_KMSG_BYTES). This makes it impossible to analyze valuable
post-mortem data especially on the embedded development or in the process
of bringing up new boards. Change this value to be a Kconfig option
with the default of old PSTORE_DEFAULT_KMSG_BYTES
Signed-off-by: Vasile-Laurentiu Stanimir <vasile-laurentiu.stanimir@windriver.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
This interface is entirely unused, so remove them and various bits of
unreachable code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201016132047.3068029-4-hch@lst.de
Introduce an abritrary 128MiB cap to avoid malloc failures when using
a larger block device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: WeiXiong Liao <gmpy.liaowx@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201016132047.3068029-2-hch@lst.de
The v9fs file operations were missing the splice_read operations, which
breaks sendfile() of files on such a filesystem. I discovered this while
trying to load an eBPF program using iproute2 inside a 'virtme' environment
which uses 9pfs for the virtual file system. iproute2 relies on sendfile()
with an AF_ALG socket to hash files, which was erroring out in the virtual
environment.
Since generic_file_splice_read() seems to just implement splice_read in
terms of the read_iter operation, I simply added the generic implementation
to the file operations, which fixed the error I was seeing. A quick grep
indicates that this is what most other file systems do as well.
Link: http://lkml.kernel.org/r/20201201135409.55510-1-toke@redhat.com
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>