linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-27 13:22:23 +00:00

History

Filipe Manana 3ebac17ce5 btrfs: reduce contention on log trees when logging checksums The possibility of extents being shared (through clone and deduplication operations) requires special care when logging data checksums, to avoid having a log tree with different checksum items that cover ranges which overlap (which resulted in missing checksums after replaying a log tree). Such problems were fixed in the past by the following commits: commit `40e046acbd` ("Btrfs: fix missing data checksums after replaying a log tree") commit `e289f03ea7` ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents") Test case generic/588 exercises the scenario solved by the first commit (purely sequential and deterministic) while test case generic/457 often triggered the case fixed by the second commit (not deterministic, requires specific timings under concurrency). The problems were addressed by deleting, from the log tree, any existing checksums before logging the new ones. And also by doing the deletion and logging of the cheksums while locking the checksum range in an extent io tree (root->log_csum_range), to deal with the case where we have concurrent fsyncs against files with shared extents. That however causes more contention on the leaves of a log tree where we store checksums (and all the nodes in the paths leading to them), even when we do not have shared extents, or all the shared extents were created by past transactions. It also adds a bit of contention on the spin lock of the log_csums_range extent io tree of the log root. This change adds a 'last_reflink_trans' field to the inode to keep track of the last transaction where a new extent was shared between inodes (through clone and deduplication operations). It is updated for both the source and destination inodes of reflink operations whenever a new extent (created in the current transaction) becomes shared by the inodes. This field is kept in memory only, not persisted in the inode item, similar to other existing fields (last_unlink_trans, logged_trans). When logging checksums for an extent, if the value of 'last_reflink_trans' is smaller then the current transaction's generation/id, we skip locking the extent range and deletion of checksums from the log tree, since we know we do not have new shared extents. This reduces contention on the log tree's leaves where checksums are stored. The following script, which uses fio, was used to measure the impact of this change: $ cat test-fsync.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 cat <<EOF > /tmp/fio-job.ini [writers] rw=write fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=64k ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "Using config:" echo cat /tmp/fio-job.ini echo mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The tests were performed for different numbers of jobs, file sizes and fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has 12 cores, with cpu governance set to performance mode on all cores), 16GiB of ram (the host has 64GiB) and using a NVMe device directly (without an intermediary filesystem in the host). While running the tests, the host was not used for anything else, to avoid disturbing the tests. The obtained results were the following (the last line of fio's output was pasted). Starting with 16 jobs is where a significant difference is observable in this particular setup and hardware (differences highlighted below). The very small differences for tests with less than 16 jobs are possibly just noise and random. ** 1 job, file size 1G, fsync frequency 1 before this change: WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec after this change: WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec 2 jobs, file size 1G, fsync frequency 1 before this change: WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec after this change: WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec 4 jobs, file size 1G, fsync frequency 1 before this change: WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec after this change: WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec 8 jobs, file size 1G, fsync frequency 1 before this change: WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec after this change: WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec 16 jobs, file size 1G, fsync frequency 1 before this change: WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec after this change: WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec (+40.1% throughput, -28.5% runtime) 32 jobs, file size 1G, fsync frequency 1 before this change: WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec after this change: WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec (+29.4% throughput, -22.7% runtime) 64 jobs, file size 512M, fsync frequency 1 before this change: WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec after this change: WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec (+29.3% throughput, -22.7% runtime) 128 jobs, file size 256M, fsync frequency 1 before this change: WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec after this change: WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec (+121.2% throughput, -54.6% runtime) 256 jobs, file size 256M, fsync frequency 1 before this change: WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec after this change: WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec (+74.9% throughput, -42.8% runtime) 512 jobs, file size 128M, fsync frequency 1 before this change: WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec after this change: WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec (+65.1% throughput, -39.3% runtime) 1024 jobs, file size 128M, fsync frequency 1 ** before this change: WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec after this change: WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec (+48.7% throughput, -33.1% runtime) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2020-07-27 12:55:45 +02:00
..
9p	9p: read only once on O_NONBLOCK	2020-03-27 09:29:56 +00:00
adfs	docs: filesystems: fix renamed references	2020-04-20 15:45:22 -06:00
affs	docs: filesystems: fix renamed references	2020-04-20 15:45:22 -06:00
afs	afs: Fix interruption of operations	2020-07-15 15:49:04 -07:00
autofs	autofs: switch to kernel_write	2020-07-08 08:27:56 +02:00
befs
bfs	docs: filesystems: fix renamed references	2020-04-20 15:45:22 -06:00
btrfs	btrfs: reduce contention on log trees when logging checksums	2020-07-27 12:55:45 +02:00
cachefiles	cachefiles: switch to kernel_write	2020-07-08 08:27:56 +02:00
ceph	ceph: skip checking caps when session reconnecting and releasing reqs	2020-06-01 13:22:53 +02:00
cifs	Revert "cifs: Fix the target file was deleted when rename failed."	2020-07-23 15:44:11 -05:00
coda	docs: filesystems: convert coda.txt to ReST	2020-05-05 09:22:21 -06:00
configfs	A fair amount of stuff this time around, dominated by yet another massive	2020-06-01 15:45:27 -07:00
cramfs	docs: filesystems: fix renamed references	2020-04-20 15:45:22 -06:00
crypto	fscrypt updates for 5.8	2020-06-01 12:10:17 -07:00
debugfs	Merge 5.7-rc3 into driver-core-next	2020-04-27 09:34:55 +02:00
devpts
dlm	dlm for 5.8	2020-06-05 16:43:16 -07:00
ecryptfs	A fair amount of stuff this time around, dominated by yet another massive	2020-06-01 15:45:27 -07:00
efivarfs	efi/efivars: Expose RT service availability via efivars abstraction	2020-07-09 10:14:29 +03:00
efs
erofs	erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup	2020-06-24 09:47:44 +08:00
exfat	exfat: fix name_hash computation on big endian systems	2020-07-21 10:44:19 +09:00
exportfs
ext2	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
ext4	This is the second round of ext4 commits for 5.8 merge window. It	2020-06-15 09:32:10 -07:00
f2fs	f2fs-for-5.8-rc1	2020-06-09 11:28:59 -07:00
fat	fat: improve the readahead for FAT entries	2020-06-04 19:06:25 -07:00
freevxfs
fscache	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-06-03 16:27:18 -07:00
fuse	fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS	2020-07-15 14:18:20 +02:00
gfs2	gfs2: Rework read and page fault locking	2020-07-07 23:40:12 +02:00
hfs	for-5.8/block-2020-06-01	2020-06-02 15:29:19 -07:00
hfsplus	block: remove the error_sector argument to blkdev_issue_flush	2020-05-22 08:45:46 -06:00
hostfs	hostfs: Use kasprintf() instead of fixed buffer formatting	2020-03-29 23:23:00 +02:00
hpfs	hpfs: fix warning due to superfluous semicolon	2020-06-06 10:08:17 -07:00
hugetlbfs	mmap locking API: convert mmap_sem API comments	2020-06-09 09:39:14 -07:00
iomap	New code for 5.8:	2020-06-13 12:44:30 -07:00
isofs	for-5.8/block-2020-06-01	2020-06-02 15:29:19 -07:00
jbd2	This is the second round of ext4 commits for 5.8 merge window. It	2020-06-15 09:32:10 -07:00
jffs2	jffs2: Replace zero-length array with flexible-array	2020-06-15 23:08:31 -05:00
jfs	Replace zero-length array in JFS	2020-06-02 20:11:35 -07:00
kernfs	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
lockd
minix
nfs	SUNRPC reverting `d03727b248` ("NFSv4 fix CLOSE not waiting for direct IO compeletion")	2020-07-17 14:47:38 -04:00
nfs_common
nfsd	nfsd4: fix NULL dereference in nfsd/clients display code	2020-07-22 16:47:14 -04:00
nilfs2	nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()	2020-06-10 19:14:17 -07:00
nls	treewide: replace '---help---' in Kconfig files with 'help'	2020-06-14 01:57:21 +09:00
notify	treewide: replace '---help---' in Kconfig files with 'help'	2020-06-14 01:57:21 +09:00
ntfs	Merge branch 'akpm' (patches from Andrew)	2020-06-02 12:21:36 -07:00
ocfs2	ocfs2: fix value of OCFS2_INVALID_SLOT	2020-06-26 00:27:37 -07:00
omfs	fs: convert mpage_readpages to mpage_readahead	2020-06-02 10:59:07 -07:00
openpromfs
orangefs	orangefs: a conversion and a cleanup...	2020-06-05 16:44:36 -07:00
overlayfs	ovl: fix lookup of indexed hardlinks with metacopy	2020-07-16 07:24:47 +02:00
proc	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-07-03 23:20:14 -07:00
pstore	Merge branch 'uaccess.__copy_from_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-06-01 16:18:46 -07:00
qnx4
qnx6	fs: convert mpage_readpages to mpage_readahead	2020-06-02 10:59:07 -07:00
quota	sysctl: pass kernel pointers to ->proc_handler	2020-04-27 02:07:40 -04:00
ramfs
reiserfs	\n	2020-06-04 13:53:10 -07:00
romfs	treewide: replace '---help---' in Kconfig files with 'help'	2020-06-14 01:57:21 +09:00
squashfs	squashfs: fix length field overlap check in metadata reading	2020-07-24 12:42:41 -07:00
sysfs	RDMA 5.8 merge window pull request	2020-06-05 14:05:57 -07:00
sysv	docs: filesystems: fix renamed references	2020-04-20 15:45:22 -06:00
tracefs
ubifs	mm: remove the pgprot argument to __vmalloc	2020-06-02 10:59:11 -07:00
udf	for-5.8/block-2020-06-01	2020-06-02 15:29:19 -07:00
ufs
unicode	.gitignore: add SPDX License Identifier	2020-03-25 11:50:48 +01:00
vboxsf	vboxsf: don't use the source name in the bdi name	2020-05-07 08:45:47 -06:00
verity	fs-verity: remove unnecessary extern keywords	2020-05-12 16:44:00 -07:00
xfs	xfs: fix use-after-free on CIL context on shutdown	2020-06-22 19:22:57 -07:00
zonefs	zonefs: count pages after truncating the iterator	2020-07-20 17:59:31 +09:00
aio.c	aio: Replace zero-length array with flexible-array	2020-06-15 23:08:25 -05:00
anon_inodes.c
attr.c
bad_inode.c	fs: move the fiemap definitions out of fs.h	2020-06-03 23:16:55 -04:00
binfmt_aout.c	exec: Rename flush_old_exec begin_new_exec	2020-05-07 16:55:47 -05:00
binfmt_elf_fdpic.c	Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-06-10 16:02:54 -07:00
binfmt_elf.c	Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-06-10 16:02:54 -07:00
binfmt_em86.c	Merge branch 'akpm' (patches from Andrew)	2020-06-04 19:18:29 -07:00
binfmt_flat.c	Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-06-10 16:02:54 -07:00
binfmt_misc.c	Merge branch 'akpm' (patches from Andrew)	2020-06-04 19:18:29 -07:00
binfmt_script.c	Merge branch 'akpm' (patches from Andrew)	2020-06-04 19:18:29 -07:00
block_dev.c	block: make function 'kill_bdev' static	2020-06-18 09:24:35 -06:00
buffer.c	fs/buffer.c: use attach/detach_page_private	2020-06-02 10:59:07 -07:00
char_dev.c	vfs: allow unprivileged whiteout creation	2020-05-14 16:44:23 +02:00
compat_binfmt_elf.c	Split the old READ_IMPLIES_EXEC workaround from executable PT_GNU_STACK	2020-06-05 13:45:21 -07:00
compat.c
coredump.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
d_path.c
dax.c	dax,iomap: Add helper dax_iomap_zero() to zero a range	2020-04-02 19:15:03 -07:00
dcache.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-06-03 16:27:18 -07:00
dcookies.c
direct-io.c	for-5.8-part2-tag	2020-06-14 09:47:25 -07:00
drop_caches.c	sysctl: pass kernel pointers to ->proc_handler	2020-04-27 02:07:40 -04:00
eventfd.c	eventfd: convert to f_op->read_iter()	2020-05-06 22:33:43 -04:00
eventpoll.c	epoll: call final ep_events_available() check under the lock	2020-05-14 10:00:35 -07:00
exec.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
fcntl.c
fhandle.c
file_table.c	Revert "fs: Do not check if there is a fsnotify watcher on pseudo inodes"	2020-06-29 09:40:55 -07:00
file.c	fix multiplication overflow in copy_fdtable()	2020-05-19 18:29:36 -04:00
filesystems.c	fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()	2020-04-10 15:36:22 -07:00
fs_context.c	vfs: don't parse "silent" option	2020-05-14 16:44:25 +02:00
fs_parser.c	fs_parse: remove pr_notice() about each validation	2020-04-02 09:35:26 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c	A lot of bug fixes and cleanups for ext4, including:	2020-06-05 16:19:28 -07:00
fsopen.c
inode.c	AFS Changes	2020-06-05 16:26:36 -07:00
internal.h	A lot of bug fixes and cleanups for ext4, including:	2020-06-05 16:19:28 -07:00
io_uring.c	io_uring: missed req_init_async() for IOSQE_ASYNC	2020-07-23 11:20:55 -06:00
io-wq.c	io_uring: cancel all task's requests on exit	2020-06-15 08:51:34 -06:00
io-wq.h	io_uring: cancel by ->task not pid	2020-06-15 08:51:38 -06:00
ioctl.c	fs: remove the access_ok() check in ioctl_fiemap	2020-06-03 23:16:55 -04:00
Kconfig	treewide: replace '---help---' in Kconfig files with 'help'	2020-06-14 01:57:21 +09:00
Kconfig.binfmt	treewide: replace '---help---' in Kconfig files with 'help'	2020-06-14 01:57:21 +09:00
libfs.c	block: remove the error_sector argument to blkdev_issue_flush	2020-05-22 08:45:46 -06:00
locks.c	Highlights:	2020-06-11 10:33:13 -07:00
Makefile
mbcache.c
mount.h	proc/mounts: add cursor	2020-05-14 16:44:24 +02:00
mpage.c	fs: convert mpage_readpages to mpage_readahead	2020-06-02 10:59:07 -07:00
namei.c	vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK	2020-06-08 11:04:19 -07:00
namespace.c	fuse: reject options on reconfigure via fsconfig(2)	2020-07-14 14:45:41 +02:00
no-block.c
nsfs.c	nsproxy: attach to namespaces via pidfds	2020-05-13 11:41:22 +02:00
open.c	Merge branch 'akpm' (patches from Andrew)	2020-06-02 12:21:36 -07:00
pipe.c	Notifications over pipes + Keyring notifications	2020-06-13 09:56:21 -07:00
pnode.c	propagate_one(): mnt_set_mountpoint() needs mount_lock	2020-04-27 10:37:14 -04:00
pnode.h
posix_acl.c	vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK	2020-06-08 11:04:19 -07:00
proc_namespace.c	Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2020-06-04 13:54:34 -07:00
read_write.c	fs: remove __vfs_read	2020-07-08 08:27:57 +02:00
readdir.c	readdir.c: get rid of the last __put_user(), drop now-useless access_ok()	2020-05-01 20:29:54 -04:00
select.c	pselect6() and friends: take handling the combined 6th/7th args into helper	2020-05-29 19:10:42 -04:00
seq_file.c	fs/seq_file.c: seq_read: Update pr_info_ratelimited	2020-06-04 19:06:25 -07:00
signalfd.c
splice.c	Notifications over pipes + Keyring notifications	2020-06-13 09:56:21 -07:00
stack.c
stat.c	New code for 5.8:	2020-06-02 19:45:12 -07:00
statfs.c
super.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-06-10 16:09:11 -07:00
sync.c	overlayfs update for 5.8	2020-06-09 15:40:50 -07:00
timerfd.c
userfaultfd.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
utimes.c	utimensat: AT_EMPTY_PATH support	2020-05-14 16:44:24 +02:00
xattr.c	xattr: fix uninitialized out-param	2020-04-09 15:33:09 -04:00