linux/fs/btrfs
Filipe Manana 609e804d77 Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
When we are mixing buffered writes with direct IO writes against the same
file and snapshotting is happening concurrently, we can end up with a
corrupt file content in the snapshot. Example:

1) Inode/file is empty.

2) Snapshotting starts.

2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
   inode to 256Kb, disk_i_size remains zero. This happens after the task
   doing the snapshot flushes all existing delalloc.

3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
   completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
   updates the inode item in the fs tree with a size of 1Mb (which is
   the value of disk_i_size).

4) The dealloc for the range [0, 256Kb[ did not start yet.

5) The transaction used in the DIO ordered extent completion, which updated
   the inode item, is committed by the snapshotting task.

6) Snapshot creation completes.

7) Dealloc for the range [0, 256Kb[ is flushed.

After that when reading the file from the snapshot we always get zeroes for
the range [0, 256Kb[, the file has a size of 1Mb and the data written by
the direct IO write is found. From an application's point of view this is
a corruption, since in the source subvolume it could never read a version
of the file that included the data from the direct IO write without the
data from the buffered write included as well. In the snapshot's tree,
file extent items are missing for the range [0, 256Kb[.

The issue, obviously, does not happen when using the -o flushoncommit
mount option.

Fix this by flushing delalloc for all the roots that are about to be
snapshotted when committing a transaction. This guarantees total ordering
when updating the disk_i_size of an inode since the flush for dealloc is
done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
is done once no more external writers exist. This is similar to what we
do when using the flushoncommit mount option, but we do it only if the
transaction has snapshots to create and only for the roots of the
subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
snapshot creation ioctl, so the flush work we do inside the transaction
is minimized.

This issue, involving buffered and direct IO writes with snapshotting, is
often triggered by fstest btrfs/078, and got reported by fsck when not
using the NO_HOLES features, for example:

  $ cat results/btrfs/078.full
  (...)
  _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
  *** fsck.btrfs output ***
  [1/7] checking root items
  [2/7] checking extents
  [3/7] checking free space cache
  [4/7] checking fs roots
  root 258 inode 264 errors 100, file extent discount
  Found file extent holes:
        start: 524288, len: 65536
  ERROR: errors found in fs roots

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-13 17:13:48 +01:00
..
tests btrfs: remove always true if branch in find_delalloc_range 2018-12-17 14:51:44 +01:00
acl.c Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl 2019-02-25 14:13:17 +01:00
async-thread.c btrfs: simplify workqueue name when allocating 2019-02-25 14:13:24 +01:00
async-thread.h
backref.c btrfs: honor path->skip_locking in backref code 2019-02-25 14:13:39 +01:00
backref.h
btrfs_inode.h Btrfs: fix fsync of files with multiple hard links in new directories 2018-12-17 14:51:43 +01:00
check-integrity.c btrfs: Fix typos in comments and strings 2018-12-17 14:51:50 +01:00
check-integrity.h
compression.c btrfs: change set_level() to bound the level passed in 2019-02-25 14:13:32 +01:00
compression.h btrfs: change set_level() to bound the level passed in 2019-02-25 14:13:32 +01:00
ctree.c Btrfs: remove assertion when searching for a key in a node/leaf 2019-02-25 14:19:23 +01:00
ctree.h btrfs: check for refs on snapshot delete resume 2019-02-27 14:08:47 +01:00
dedupe.h
delayed-inode.c Btrfs: kill btrfs_clear_path_blocking 2018-10-15 17:23:38 +02:00
delayed-inode.h Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root 2018-10-15 17:23:33 +02:00
delayed-ref.c btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record 2019-02-25 14:13:39 +01:00
delayed-ref.h btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record 2019-02-25 14:13:39 +01:00
dev-replace.c btrfs: drop the lock on error in btrfs_dev_replace_cancel 2019-02-25 14:13:41 +01:00
dev-replace.h btrfs: dev-replace: open code trivial locking helpers 2018-12-17 14:51:45 +01:00
dir-item.c btrfs: Remove root parameter from btrfs_insert_dir_item 2018-10-15 17:23:25 +02:00
disk-io.c btrfs: scrub: convert scrub_workers_refcnt to refcount_t 2019-02-25 14:13:38 +01:00
disk-io.h btrfs: drop extra enum initialization where using defaults 2018-12-17 14:51:43 +01:00
export.c btrfs: Remove 'objectid' member from struct btrfs_root 2018-10-15 17:23:25 +02:00
export.h
extent_io.c Btrfs: fix corruption reading shared and compressed extents after hole punching 2019-02-27 12:24:07 +01:00
extent_io.h btrfs: Remove EXTENT_FIRST_DELALLOC bit 2019-02-25 14:13:36 +01:00
extent_map.c btrfs: Remove impossible condition from mergable_maps 2019-02-25 14:13:21 +01:00
extent_map.h btrfs: Remove impossible condition from mergable_maps 2019-02-25 14:13:21 +01:00
extent-tree.c btrfs: save drop_progress if we drop refs at all 2019-02-27 14:08:47 +01:00
file-item.c btrfs: replace btrfs_io_bio::end_io with a simple helper 2018-12-17 14:51:40 +01:00
file.c btrfs: Remove unused arguments from btrfs_get_extent_fiemap 2019-02-25 14:13:17 +01:00
free-space-cache.c Btrfs: fix deadlock on tree root leaf when finding free extent 2018-11-06 16:42:32 +01:00
free-space-cache.h
free-space-tree.c btrfs: use EXPORT_FOR_TESTS for conditionally exported functions 2018-12-17 14:51:37 +01:00
free-space-tree.h
inode-item.c
inode-map.c btrfs: prune unused includes 2018-08-06 13:12:43 +02:00
inode-map.h
inode.c btrfs: reserve extra space during evict 2019-02-25 14:13:35 +01:00
ioctl.c Btrfs: fix deadlock between clone/dedupe and rename 2019-02-27 12:24:16 +01:00
Kconfig
locking.c btrfs: simplify waiting loop in btrfs_tree_lock 2019-02-25 14:13:28 +01:00
locking.h btrfs: merge btrfs_set_lock_blocking_rw with it's caller 2019-02-25 14:13:28 +01:00
lzo.c btrfs: change set_level() to bound the level passed in 2019-02-25 14:13:32 +01:00
Makefile
math.h
ordered-data.c Btrfs: remove no longer used stuff for tracking pending ordered extents 2018-12-17 14:51:25 +01:00
ordered-data.h btrfs: switch BTRFS_ORDERED_* to enums 2018-12-17 14:51:43 +01:00
orphan.c
print-tree.c btrfs: annotate unlikely branches after V0 extent type removal 2018-08-06 13:12:41 +02:00
print-tree.h
props.c
props.h
qgroup.c btrfs: move ulist allocation out of transaction in quota enable 2019-02-27 14:10:25 +01:00
qgroup.h btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record 2019-02-25 14:13:39 +01:00
raid56.c btrfs: Fix typos in comments and strings 2018-12-17 14:51:50 +01:00
raid56.h
rcu-string.h
reada.c btrfs: dev-replace: open code trivial locking helpers 2018-12-17 14:51:45 +01:00
ref-verify.c btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpers 2019-02-25 14:13:27 +01:00
ref-verify.h
relocation.c Btrfs: add missing error handling after doing leaf/node binary search 2019-02-25 14:19:23 +01:00
root-tree.c btrfs: check for refs on snapshot delete resume 2019-02-27 14:08:47 +01:00
scrub.c btrfs: init csum_list before possible free 2019-02-25 14:13:41 +01:00
send.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
send.h
struct-funcs.c btrfs: prune unused includes 2018-08-06 13:12:43 +02:00
super.c btrfs: add zstd compression level support 2019-02-25 14:13:33 +01:00
sysfs.c btrfs: Add sysfs support for metadata_uuid feature 2018-12-17 14:51:37 +01:00
sysfs.h btrfs: drop extra enum initialization where using defaults 2018-12-17 14:51:43 +01:00
transaction.c Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes 2019-03-13 17:13:48 +01:00
transaction.h btrfs: drop extra enum initialization where using defaults 2018-12-17 14:51:43 +01:00
tree-checker.c btrfs: Fix typos in comments and strings 2018-12-17 14:51:50 +01:00
tree-checker.h
tree-defrag.c btrfs: open code now trivial btrfs_set_lock_blocking 2019-02-25 14:13:27 +01:00
tree-log.c btrfs: remove WARN_ON in log_dir_items 2019-03-13 17:13:32 +01:00
tree-log.h Btrfs: remove no longer used io_err from btrfs_log_ctx 2018-12-17 14:51:31 +01:00
ulist.c
ulist.h
uuid-tree.c
volumes.c btrfs: ensure that a DUP or RAID1 block group has exactly two stripes 2019-02-25 14:13:41 +01:00
volumes.h btrfs: introduce new ioctl to unregister a btrfs device 2019-02-25 14:13:30 +01:00
xattr.c Btrfs: use nofs context when initializing security xattrs to avoid deadlock 2018-12-17 14:51:49 +01:00
xattr.h
zlib.c btrfs: change set_level() to bound the level passed in 2019-02-25 14:13:32 +01:00
zstd.c btrfs: zstd: ensure reclaim timer is properly cleaned up 2019-02-27 17:45:04 +01:00