linux/fs/btrfs
Filipe Manana cd9253c23a btrfs: fix race between direct IO write and fsync when using same fd
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:

1) Attempt a fsync without holding the inode's lock, triggering an
   assertion failures when assertions are enabled;

2) Do an invalid memory access from the fsync task because the file private
   points to memory allocated on stack by the direct IO task and it may be
   used by the fsync task after the stack was destroyed.

The race happens like this:

1) A user space program opens a file descriptor with O_DIRECT;

2) The program spawns 2 threads using libpthread for example;

3) One of the threads uses the file descriptor to do direct IO writes,
   while the other calls fsync using the same file descriptor.

4) Call task A the thread doing direct IO writes and task B the thread
   doing fsyncs;

5) Task A does a direct IO write, and at btrfs_direct_write() sets the
   file's private to an on stack allocated private with the member
   'fsync_skip_inode_lock' set to true;

6) Task B enters btrfs_sync_file() and sees that there's a private
   structure associated to the file which has 'fsync_skip_inode_lock' set
   to true, so it skips locking the inode's VFS lock;

7) Task A completes the direct IO write, and resets the file's private to
   NULL since it had no prior private and our private was stack allocated.
   Then it unlocks the inode's VFS lock;

8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
   assertion that checks the inode's VFS lock is held fails, since task B
   never locked it and task A has already unlocked it.

The stack trace produced is the following:

   assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
   ------------[ cut here ]------------
   kernel BUG at fs/btrfs/ordered-data.c:983!
   Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
   CPU: 9 PID: 5072 Comm: worker Tainted: G     U     OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
   Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
   RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
   Code: 50 d6 86 c0 e8 (...)
   RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
   RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
   RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
   RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
   R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
   R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
   FS:  00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
   Call Trace:
    <TASK>
    ? __die_body.cold+0x14/0x24
    ? die+0x2e/0x50
    ? do_trap+0xca/0x110
    ? do_error_trap+0x6a/0x90
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? exc_invalid_op+0x50/0x70
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? asm_exc_invalid_op+0x1a/0x20
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? __seccomp_filter+0x31d/0x4f0
    __x64_sys_fdatasync+0x4f/0x90
    do_syscall_64+0x82/0x160
    ? do_futex+0xcb/0x190
    ? __x64_sys_futex+0x10e/0x1d0
    ? switch_fpu_return+0x4f/0xd0
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.

This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.

Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.

The following C program triggers the issue:

   $ cat repro.c
   /* Get the O_DIRECT definition. */
   #ifndef _GNU_SOURCE
   #define _GNU_SOURCE
   #endif

   #include <stdio.h>
   #include <stdlib.h>
   #include <unistd.h>
   #include <stdint.h>
   #include <fcntl.h>
   #include <errno.h>
   #include <string.h>
   #include <pthread.h>

   static int fd;

   static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
   {
       while (count > 0) {
           ssize_t ret;

           ret = pwrite(fd, buf, count, offset);
           if (ret < 0) {
               if (errno == EINTR)
                   continue;
               return ret;
           }
           count -= ret;
           buf += ret;
       }
       return 0;
   }

   static void *fsync_loop(void *arg)
   {
       while (1) {
           int ret;

           ret = fsync(fd);
           if (ret != 0) {
               perror("Fsync failed");
               exit(6);
           }
       }
   }

   int main(int argc, char *argv[])
   {
       long pagesize;
       void *write_buf;
       pthread_t fsyncer;
       int ret;

       if (argc != 2) {
           fprintf(stderr, "Use: %s <file path>\n", argv[0]);
           return 1;
       }

       fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
       if (fd == -1) {
           perror("Failed to open/create file");
           return 1;
       }

       pagesize = sysconf(_SC_PAGE_SIZE);
       if (pagesize == -1) {
           perror("Failed to get page size");
           return 2;
       }

       ret = posix_memalign(&write_buf, pagesize, pagesize);
       if (ret) {
           perror("Failed to allocate buffer");
           return 3;
       }

       ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
       if (ret != 0) {
           fprintf(stderr, "Failed to create writer thread: %d\n", ret);
           return 4;
       }

       while (1) {
           ret = do_write(fd, write_buf, pagesize, 0);
           if (ret != 0) {
               perror("Write failed");
               exit(5);
           }
       }

       return 0;
   }

   $ mkfs.btrfs -f /dev/sdi
   $ mount /dev/sdi /mnt/sdi
   $ timeout 10 ./repro /mnt/sdi/foo

Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.

Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8 ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-03 20:29:55 +02:00
..
tests btrfs: fix corrupt read due to bad offset of a compressed extent map 2024-07-25 23:54:06 +02:00
accessors.c btrfs: remove unused included headers 2024-03-04 16:24:46 +01:00
accessors.h btrfs: remove raid-stripe-tree encoding field from stripe_extent 2024-07-11 15:33:28 +02:00
acl.c btrfs: remove unused included headers 2024-03-04 16:24:46 +01:00
acl.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
async-thread.c btrfs: remove unused included headers 2024-03-04 16:24:46 +01:00
async-thread.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
backref.c btrfs: change root->root_key.objectid to btrfs_root_id() 2024-05-07 21:31:06 +02:00
backref.h btrfs: uninline some static inline helpers from backref.h 2024-03-04 16:24:53 +01:00
bio.c btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk() 2024-08-27 01:34:08 +02:00
bio.h btrfs: add forward declarations and headers, part 2 2024-03-04 16:24:49 +01:00
block-group.c btrfs: zoned: fix zone_unusable accounting on making block group read-write again 2024-07-29 19:21:19 +02:00
block-group.h btrfs: switch btrfs_block_group::inode to struct btrfs_inode 2024-07-11 15:33:28 +02:00
block-rsv.c btrfs: change root->root_key.objectid to btrfs_root_id() 2024-05-07 21:31:06 +02:00
block-rsv.h btrfs: add forward declarations and headers, part 2 2024-03-04 16:24:49 +01:00
btrfs_inode.h btrfs: move the direct IO code into its own file 2024-07-11 15:33:29 +02:00
compression.c btrfs: fix extent map use-after-free when adding pages to compressed bio 2024-07-11 16:32:22 +02:00
compression.h btrfs: pass a btrfs_inode to btrfs_compress_heuristic() 2024-07-11 15:33:28 +02:00
ctree.c btrfs: fix data race when accessing the last_trans field of a root 2024-07-11 15:52:25 +02:00
ctree.h btrfs: fix race between direct IO write and fsync when using same fd 2024-09-03 20:29:55 +02:00
defrag.c btrfs: fix data race when accessing the last_trans field of a root 2024-07-11 15:52:25 +02:00
defrag.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
delalloc-space.c btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
delalloc-space.h btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
delayed-inode.c btrfs: pass a btrfs_inode to btrfs_readdir_get_delayed_items() 2024-07-11 15:33:28 +02:00
delayed-inode.h btrfs: pass a btrfs_inode to btrfs_readdir_get_delayed_items() 2024-07-11 15:33:28 +02:00
delayed-ref.c btrfs: check delayed refs when we're checking if a ref exists 2024-08-13 13:42:26 +02:00
delayed-ref.h btrfs: check delayed refs when we're checking if a ref exists 2024-08-13 13:42:26 +02:00
dev-replace.c btrfs: simplify range parameters of btrfs_wait_ordered_roots() 2024-07-11 15:33:19 +02:00
dev-replace.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
dir-item.c btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
dir-item.h btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
direct-io.c btrfs: fix race between direct IO write and fsync when using same fd 2024-09-03 20:29:55 +02:00
direct-io.h btrfs: move the direct IO code into its own file 2024-07-11 15:33:29 +02:00
discard.c btrfs: unexport btrfs_run_discard_work and make it static 2023-06-19 13:59:25 +02:00
discard.h btrfs: unexport btrfs_run_discard_work and make it static 2023-06-19 13:59:25 +02:00
disk-io.c for-6.11-tag 2024-07-17 12:38:04 -07:00
disk-io.h btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
export.c btrfs: remove super block argument from btrfs_iget() 2024-07-11 15:33:25 +02:00
export.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
extent_io.c btrfs: fix invalid mapping of extent xarray state 2024-08-13 15:36:57 +02:00
extent_io.h btrfs: move extent_range_clear_dirty_for_io() into inode.c 2024-07-11 15:52:25 +02:00
extent_map.c btrfs: only run the extent map shrinker from kswapd tasks 2024-08-13 13:43:28 +02:00
extent_map.h btrfs: do not directly include rwlock_types.h 2024-07-11 15:33:22 +02:00
extent-io-tree.c btrfs: preallocate ulist memory for qgroup rsv 2024-07-11 15:33:26 +02:00
extent-io-tree.h btrfs: add forward declarations and headers, part 2 2024-03-04 16:24:49 +01:00
extent-tree.c btrfs: check delayed refs when we're checking if a ref exists 2024-08-13 13:42:26 +02:00
extent-tree.h btrfs: do not BUG_ON() when freeing tree block after error 2024-07-11 15:33:26 +02:00
fiemap.c btrfs: initialize last_extent_end to fix -Wmaybe-uninitialized warning in extent_fiemap() 2024-08-26 16:58:13 +02:00
fiemap.h btrfs: move fiemap code into its own file 2024-07-11 15:33:20 +02:00
file-item.c btrfs: introduce new "rescue=ignoremetacsums" mount option 2024-07-11 15:33:29 +02:00
file-item.h btrfs: remove search_commit parameter from btrfs_lookup_csums_list() 2024-05-07 21:31:03 +02:00
file.c btrfs: fix race between direct IO write and fsync when using same fd 2024-09-03 20:29:55 +02:00
file.h btrfs: move the direct IO code into its own file 2024-07-11 15:33:29 +02:00
free-space-cache.c btrfs: zoned: properly take lock to read/update block group's zoned variables 2024-08-15 20:35:56 +02:00
free-space-cache.h btrfs: add forward declarations and headers, part 2 2024-03-04 16:24:49 +01:00
free-space-tree.c btrfs: do not BUG_ON() when freeing tree block after error 2024-07-11 15:33:26 +02:00
free-space-tree.h btrfs: add forward declarations and headers, part 2 2024-03-04 16:24:49 +01:00
fs.c btrfs: sysfs: update fs features directory asynchronously 2023-02-13 17:50:35 +01:00
fs.h for-6.11-tag 2024-07-19 14:34:52 -07:00
inode-item.c btrfs: abort transaction if we don't find extref in btrfs_del_inode_extref() 2024-07-11 15:33:27 +02:00
inode-item.h btrfs: add forward declarations and headers, part 2 2024-03-04 16:24:49 +01:00
inode.c btrfs: update target inode's ctime on unlink 2024-08-15 20:35:44 +02:00
ioctl.c for-6.11-tag 2024-07-17 12:38:04 -07:00
ioctl.h btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
Kconfig btrfs: check-integrity: remove CONFIG_BTRFS_FS_CHECK_INTEGRITY option 2023-10-12 16:44:05 +02:00
locking.c btrfs: change root->root_key.objectid to btrfs_root_id() 2024-05-07 21:31:06 +02:00
locking.h btrfs: cleanup recursive include of the same header 2024-07-11 15:33:22 +02:00
lru_cache.c btrfs: fix typos found by codespell 2023-12-15 23:00:04 +01:00
lru_cache.h btrfs: cleanup recursive include of the same header 2024-07-11 15:33:22 +02:00
lzo.c btrfs: enhance compression error messages 2024-07-11 15:52:25 +02:00
Makefile btrfs: move the direct IO code into its own file 2024-07-11 15:33:29 +02:00
messages.c btrfs: introduce new "rescue=ignoremetacsums" mount option 2024-07-11 15:33:29 +02:00
messages.h btrfs: constify fs_info parameter in __btrfs_panic() 2023-12-15 20:27:02 +01:00
misc.h btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
ordered-data.c btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode 2024-07-11 15:33:28 +02:00
ordered-data.h btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode 2024-07-11 15:33:28 +02:00
orphan.c btrfs: remove unused included headers 2024-03-04 16:24:46 +01:00
orphan.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
print-tree.c btrfs: avoid using fixed char array size for tree names 2024-08-02 22:44:27 +02:00
print-tree.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
props.c btrfs: pass a btrfs_inode to btrfs_set_prop() 2024-07-11 15:33:29 +02:00
props.h btrfs: pass a btrfs_inode to btrfs_set_prop() 2024-07-11 15:33:29 +02:00
qgroup.c btrfs: qgroup: don't use extent changeset when not needed 2024-09-02 20:18:08 +02:00
qgroup.h btrfs: qgroup: preallocate memory before adding a relation 2024-07-11 15:33:27 +02:00
raid-stripe-tree.c btrfs: remove raid-stripe-tree encoding field from stripe_extent 2024-07-11 15:33:28 +02:00
raid-stripe-tree.h btrfs: remove raid-stripe-tree encoding field from stripe_extent 2024-07-11 15:33:28 +02:00
raid56.c btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array() 2024-07-11 15:33:30 +02:00
raid56.h btrfs: add forward declarations and headers, part 2 2024-03-04 16:24:49 +01:00
rcu-string.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
ref-verify.c btrfs: fix uninitialized return value in the ref-verify tool 2024-07-02 19:14:57 +02:00
ref-verify.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
reflink.c btrfs: pass a btrfs_inode to btrfs_wait_ordered_range() 2024-07-11 15:33:18 +02:00
reflink.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
relocation.c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
relocation.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
root-tree.c btrfs: change root->root_key.objectid to btrfs_root_id() 2024-05-07 21:31:06 +02:00
root-tree.h btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations 2024-04-02 19:18:23 +02:00
scrub.c btrfs: scrub: update last_physical after scrubbing one stripe 2024-08-01 17:15:07 +02:00
scrub.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
send.c btrfs: send: annotate struct name_cache_entry with __counted_by() 2024-08-15 20:35:32 +02:00
send.h btrfs: pass a btrfs_inode to btrfs_ioctl_send() 2024-07-11 15:33:28 +02:00
space-info.c btrfs: fix uninitialized return value from btrfs_reclaim_sweep() 2024-08-27 16:42:09 +02:00
space-info.h btrfs: fix uninitialized return value from btrfs_reclaim_sweep() 2024-08-27 16:42:09 +02:00
subpage.c btrfs: pass a btrfs_inode to is_data_inode() 2024-07-11 15:33:28 +02:00
subpage.h btrfs: subpage: introduce helpers to handle subpage delalloc locking 2024-07-11 15:33:22 +02:00
super.c btrfs: only enable extent map shrinker for DEBUG builds 2024-08-16 21:22:39 +02:00
super.h btrfs: change BTRFS_MOUNT_* flags to 64bit type 2024-07-19 17:20:23 +02:00
sysfs.c btrfs: introduce new "rescue=ignoresuperflags" mount option 2024-07-11 15:33:30 +02:00
sysfs.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
transaction.c btrfs: fix data race when accessing the last_trans field of a root 2024-07-11 15:52:25 +02:00
transaction.h btrfs: fix race between direct IO write and fsync when using same fd 2024-09-03 20:29:55 +02:00
tree-checker.c btrfs: tree-checker: add dev extent item checks 2024-08-15 20:35:52 +02:00
tree-checker.h btrfs: make sure that WRITTEN is set on all metadata blocks 2024-05-02 22:11:13 +02:00
tree-log.c btrfs: remove super block argument from btrfs_iget() 2024-07-11 15:33:25 +02:00
tree-log.h btrfs: avoid transaction commit on any fsync after subvolume creation 2024-07-11 15:33:24 +02:00
tree-mod-log.c btrfs: change root->root_key.objectid to btrfs_root_id() 2024-05-07 21:31:06 +02:00
tree-mod-log.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
ulist.c btrfs: preallocate ulist memory for qgroup rsv 2024-07-11 15:33:26 +02:00
ulist.h btrfs: preallocate ulist memory for qgroup rsv 2024-07-11 15:33:26 +02:00
uuid-tree.c btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
uuid-tree.h btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
verity.c btrfs: remove unused included headers 2024-03-04 16:24:46 +01:00
verity.h btrfs: add forward declarations and headers, part 1 2024-03-04 16:24:49 +01:00
volumes.c btrfs: abort transaction on errors in btrfs_free_chunk() 2024-07-11 15:33:27 +02:00
volumes.h btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
xattr.c btrfs: pass a btrfs_inode to btrfs_set_prop() 2024-07-11 15:33:29 +02:00
xattr.h btrfs: constify pointer parameters where applicable 2024-07-11 15:33:22 +02:00
zlib.c btrfs: enhance compression error messages 2024-07-11 15:52:25 +02:00
zoned.c btrfs: zoned: handle broken write pointer on zones 2024-09-02 23:39:34 +02:00
zoned.h btrfs: change BTRFS_MOUNT_* flags to 64bit type 2024-07-19 17:20:23 +02:00
zstd.c btrfs: enhance compression error messages 2024-07-11 15:52:25 +02:00