linux/fs
Filipe Manana 28270e25c6 btrfs: always reserve space for delayed refs when starting transaction
When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:

1) When we start a transaction we reserve space for the number of items
   the caller wants to be inserted/modified/deleted in a btree. This space
   goes to the transaction block reserve;

2) If the delayed refs block reserve is not full, its size is greater
   than the amount of its reserved space, and the flush method is
   BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
   it corresponding to the number of items the caller wants to
   insert/modify/delete in a btree;

3) The size of the delayed refs block reserve is increased when a task
   creates delayed refs after COWing an extent buffer, allocating a new
   one or deleting (freeing) an extent buffer. This happens after the
   the task started or joined a transaction, whenever it calls
   btrfs_update_delayed_refs_rsv();

4) The delayed refs block reserve is then refilled by anyone calling
   btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
   operations or when someone else calls btrfs_start_transaction() with
   a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;

5) As a task COWs or allocates extent buffers, it consumes space from the
   transaction block reserve. When the task releases its transaction
   handle (btrfs_end_transaction()) or it attempts to commit the
   transaction, it releases any remaining space in the transaction block
   reserve that it did not use, as not all space may have been used (due
   to pessimistic space calculation) by calling btrfs_block_rsv_release()
   which will try to add that unused space to the delayed refs block
   reserve (if its current size is greater than its reserved space).
   That transferred space may not be enough to completely fulfill the
   delayed refs block reserve.

   Plus we have some tasks that will attempt do modify as many leaves
   as they can before getting -ENOSPC (and then reserving more space and
   retrying), such as hole punching and extent cloning which call
   btrfs_replace_file_extents(). Such tasks can generate therefore a
   high number of delayed refs, for both metadata and data (we can't
   know in advance how many file extent items we will find in a range
   and therefore how many delayed refs for dropping references on data
   extents we will generate);

6) If a transaction starts its commit before the delayed refs block
   reserve is refilled, for example by the transaction kthread or by
   someone who called btrfs_join_transaction() before starting the
   commit, then when running delayed references if we don't have enough
   reserved space in the delayed refs block reserve, we will consume
   space from the global block reserve.

Now this doesn't make a lot of sense because:

1) We should reserve space for delayed references when starting the
   transaction, since we have no guarantees the delayed refs block
   reserve will be refilled;

2) If no refill happens then we will consume from the global block reserve
   when running delayed refs during the transaction commit;

3) If we have a bunch of tasks calling btrfs_start_transaction() with a
   number of items greater than zero and at the time the delayed refs
   reserve is full, then we don't reserve any space at
   btrfs_start_transaction() for the delayed refs that will be generated
   by a task, and we can therefore end up using a lot of space from the
   global reserve when running the delayed refs during a transaction
   commit;

4) There are also other operations that result in bumping the size of the
   delayed refs reserve, such as creating and deleting block groups, as
   well as the need to update a block group item because we allocated or
   freed an extent from the respective block group;

5) If we have a significant gap between the delayed refs reserve's size
   and its reserved space, two very bad things may happen:

   1) The reserved space of the global reserve may not be enough and we
      fail the transaction commit with -ENOSPC when running delayed refs;

   2) If the available space in the global reserve is enough it may result
      in nearly exhausting it. If the fs has no more unallocated device
      space for allocating a new block group and all the available space
      in existing metadata block groups is not far from the global
      reserve's size before we started the transaction commit, we may end
      up in a situation where after the transaction commit we have too
      little available metadata space, and any future transaction commit
      will fail with -ENOSPC, because although we were able to reserve
      space to start the transaction, we were not able to commit it, as
      running delayed refs generates some more delayed refs (to update the
      extent tree for example) - this includes not even being able to
      commit a transaction that was started with the goal of unlinking a
      file, removing an empty data block group or doing reclaim/balance,
      so there's no way to release metadata space.

      In the worst case the next time we mount the filesystem we may
      also fail with -ENOSPC due to failure to commit a transaction to
      cleanup orphan inodes. This later case was reported and hit by
      someone running a SLE (SUSE Linux Enterprise) distribution for
      example - where the fs had no more unallocated space that could be
      used to allocate a new metadata block group, and the available
      metadata space was about 1.5M, not enough to commit a transaction
      to cleanup an orphan inode (or do relocation of data block groups
      that were far from being full).

So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
..
9p - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") 2023-08-29 14:25:26 -07:00
adfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
affs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
afs - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") 2023-08-29 14:25:26 -07:00
autofs v6.6-vfs.autofs 2023-08-28 11:39:14 -07:00
befs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
bfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
btrfs btrfs: always reserve space for delayed refs when starting transaction 2023-10-12 16:44:06 +02:00
cachefiles - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") 2023-08-29 14:25:26 -07:00
ceph ceph: remove unnecessary check for NULL in parse_longname() 2023-09-18 12:04:50 +02:00
coda v6.6-vfs.ctime 2023-08-28 09:31:32 -07:00
configfs configfs: convert to ctime accessor functions 2023-07-13 10:28:05 +02:00
cramfs v6.6-vfs.super 2023-08-28 11:04:18 -07:00
crypto
debugfs Char/Misc driver changes for 6.6-rc1 2023-09-01 09:53:54 -07:00
devpts v6.6-vfs.misc 2023-08-28 10:17:14 -07:00
dlm dlm: fix plock lookup when using multiple lockspaces 2023-08-25 10:31:39 -05:00
ecryptfs v6.6-vfs.misc 2023-08-28 10:17:14 -07:00
efivarfs efivarfs: fix statfs() on efivarfs 2023-09-11 09:10:02 +00:00
efs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
erofs erofs: allow empty device tags in flatdev mode 2023-09-19 08:39:02 +08:00
exfat for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
exportfs exportfs: remove kernel-doc warnings in exportfs 2023-08-29 17:45:22 -04:00
ext2 \n 2023-08-30 12:10:50 -07:00
ext4 Revert "ext4: switch to multigrain timestamps" 2023-09-20 18:05:31 +02:00
f2fs f2fs update for 6.6-rc1 2023-09-02 15:37:59 -07:00
fat for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
freevxfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
fscache
fuse fuse update for 6.6 2023-09-05 12:45:55 -07:00
gfs2 gfs2: Fix quota=quiet oversight 2023-09-18 16:26:24 +02:00
hfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
hfsplus for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
hostfs hostfs: convert to ctime accessor functions 2023-07-24 10:30:00 +02:00
hpfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
hugetlbfs - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") 2023-08-29 14:25:26 -07:00
iomap iomap: Spelling s/preceeding/preceding/g 2023-09-28 09:26:58 -07:00
isofs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
jbd2 Regression and bug fixes for ext4. 2023-09-17 10:33:53 -07:00
jffs2 jffs2: convert to ctime accessor functions 2023-07-24 10:30:01 +02:00
jfs A few small fixes 2023-08-31 15:25:01 -07:00
kernfs Driver core changes for 6.6-rc1 2023-09-01 09:43:18 -07:00
lockd SUNRPC: Add enum svc_auth_status 2023-08-29 17:45:22 -04:00
minix for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
netfs netfs: Only call folio_start_fscache() one time for each folio 2023-09-18 12:03:46 -07:00
nfs nfs: decrement nrequests counter before releasing the req 2023-09-28 12:22:25 -04:00
nfs_common
nfsd nfsd-6.6 fixes: 2023-09-30 09:44:48 -07:00
nilfs2 nilfs2: fix potential use after free in nilfs_gccache_submit_read_data() 2023-09-29 17:20:46 -07:00
nls nls: Hide new NLS_UCS2_UTILS 2023-08-31 12:07:34 -05:00
notify dnotify: Pass argument of fcntl_dirnotify as int 2023-07-10 14:36:12 +02:00
ntfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
ntfs3 ntfs3: put resources during ntfs_fill_super() 2023-09-25 14:12:42 +02:00
ocfs2 Many ext4 and jbd2 cleanups and bug fixes for v6.6-rc1. 2023-08-31 15:18:15 -07:00
omfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
openpromfs openpromfs: convert to ctime accessor functions 2023-07-24 10:30:03 +02:00
orangefs fs: drop the timespec64 argument from update_time 2023-08-11 09:04:57 +02:00
overlayfs ovl: fix NULL pointer defer when encoding non-decodable lower fid 2023-10-03 09:24:11 +03:00
proc proc: nommu: fix empty /proc/<pid>/maps 2023-09-19 13:21:34 -07:00
pstore pstore fix for v6.6-rc1 2023-09-02 10:45:17 -07:00
qnx4 for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
qnx6 for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
quota quota: Fix slow quotaoff 2023-10-06 11:01:23 +02:00
ramfs ramfs: convert to ctime accessor functions 2023-07-24 10:30:04 +02:00
reiserfs reiserfs: Replace 1-element array with C99 style flex-array 2023-09-11 14:07:46 +02:00
romfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
smb Six SMB3 server fixes 2023-10-08 10:10:52 -07:00
squashfs squashfs: convert to ctime accessor functions 2023-07-24 10:30:05 +02:00
sysfs
sysv for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
tracefs eventfs: Test for dentries array allocated in eventfs_release() 2023-09-30 16:26:04 -04:00
ubifs fs: drop the timespec64 argument from update_time 2023-08-11 09:04:57 +02:00
udf \n 2023-08-30 12:10:50 -07:00
ufs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
unicode
vboxsf v6.6-vfs.ctime 2023-08-28 09:31:32 -07:00
verity fsverity: skip PKCS#7 parser when keyring is empty 2023-08-20 10:33:43 -07:00
xfs xfs: abort fstrim if kernel is suspending 2023-10-04 09:25:04 +11:00
zonefs New code for 6.6: 2023-08-28 11:59:52 -07:00
aio.c aio: Annotate struct kioctx_table with __counted_by 2023-09-20 14:22:01 +02:00
anon_inodes.c
attr.c v6.6-vfs.misc 2023-08-28 10:17:14 -07:00
bad_inode.c fs: drop the timespec64 argument from update_time 2023-08-11 09:04:57 +02:00
binfmt_elf_fdpic.c fs: binfmt_elf_efpic: fix personality for ELF-FDPIC 2023-09-29 17:20:45 -07:00
binfmt_elf_test.c
binfmt_elf.c Merge branch 'expand-stack' 2023-06-28 20:35:21 -07:00
binfmt_flat.c
binfmt_misc.c fs: convert to ctime accessor functions 2023-07-13 10:28:04 +02:00
binfmt_script.c
buffer.c iomap: add a workaround for racy i_size updates on block devices 2023-09-25 08:55:00 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c v6.5/vfs.misc 2023-06-26 09:50:21 -07:00
d_path.c
dax.c mm: remove enum page_entry_size 2023-08-24 16:20:30 -07:00
dcache.c fs/dcache: Replace printk and WARN_ON by WARN 2023-08-19 13:41:11 +02:00
direct-io.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
drop_caches.c fs: drop_caches: draining pages before dropping caches 2023-08-18 10:12:11 -07:00
eventfd.c eventfd: prevent underflow for eventfd semaphores 2023-07-11 11:41:34 +02:00
eventpoll.c epoll: simplify ep_alloc() 2023-07-26 14:56:07 +02:00
exec.c - An extensive rework of kexec and crash Kconfig from Eric DeVolder 2023-08-29 14:53:51 -07:00
fcntl.c fcntl: Cast commands with int args explicitly 2023-07-10 14:36:11 +02:00
fhandle.c
file_table.c fs: use __fput_sync in close(2) 2023-08-08 19:36:51 +02:00
file.c v6.6-vfs.misc 2023-08-28 10:17:14 -07:00
filesystems.c
fs_context.c v6.6-vfs.misc 2023-08-28 10:17:14 -07:00
fs_parser.c
fs_pin.c
fs_struct.c kill do_each_thread() 2023-08-21 13:46:25 -07:00
fs_types.c
fs-writeback.c fs-writeback: do not requeue a clean inode having skipped pages 2023-09-20 14:22:01 +02:00
fsopen.c fs: add FSCONFIG_CMD_CREATE_EXCL 2023-08-14 18:48:02 +02:00
init.c
inode.c Revert "fs: add infrastructure for multigrain timestamps" 2023-09-20 18:05:31 +02:00
internal.h for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
ioctl.c v6.6-vfs.super 2023-08-28 11:04:18 -07:00
Kconfig for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
Kconfig.binfmt riscv: support the elf-fdpic binfmt loader 2023-08-23 14:17:43 -07:00
kernel_read_file.c fs: Fix kernel-doc warnings 2023-08-19 12:12:12 +02:00
libfs.c direct_write_fallback(): on error revert the ->ki_pos update from buffered write 2023-09-20 14:22:01 +02:00
locks.c NFSD 6.6 Release Notes 2023-08-31 15:32:18 -07:00
Makefile fs: add CONFIG_BUFFER_HEAD 2023-08-02 09:13:09 -06:00
mbcache.c
mnt_idmapping.c
mount.h
mpage.c
namei.c fs: Fix kernel-doc warnings 2023-08-19 12:12:12 +02:00
namespace.c v6.5/vfs.mount 2023-06-26 10:27:04 -07:00
nsfs.c fs: convert to ctime accessor functions 2023-07-13 10:28:04 +02:00
open.c v6.6-vfs.fchmodat2 2023-08-28 11:25:27 -07:00
pipe.c fs/pipe: remove duplicate "offset" initializer 2023-09-20 14:22:01 +02:00
pnode.c
pnode.h
posix_acl.c fs: convert to ctime accessor functions 2023-07-13 10:28:04 +02:00
proc_namespace.c
read_write.c fs: Fix one kernel-doc comment 2023-08-15 08:32:45 +02:00
readdir.c vfs: get rid of old '->iterate' directory operation 2023-08-06 15:08:35 +02:00
remap_range.c
select.c
seq_file.c
signalfd.c
splice.c - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") 2023-08-29 14:25:26 -07:00
stack.c fs: convert to ctime accessor functions 2023-07-13 10:28:04 +02:00
stat.c Revert "fs: add infrastructure for multigrain timestamps" 2023-09-20 18:05:31 +02:00
statfs.c
super.c fs: export sget_dev() 2023-08-31 12:47:15 +02:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c mm: userfaultfd: remove stale comment about core dump locking 2023-08-24 16:20:27 -07:00
utimes.c
xattr.c tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs 2023-08-22 10:57:46 +02:00