linux/fs/btrfs
Qu Wenruo 28d70e237d btrfs: scrub: Fix RAID56 recovery race condition
When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
   To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
   Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
   Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16 (assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
        Corrupted       |       Correct          |      Correct        |
|   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
------------------------------------------------------------------------
Read out D1             |Read out D2             |Read full stripe     |
Check csum              |Check csum              |Check parity         |
Csum mismatch           |Csum match, continue    |Parity mismatch      |
handle_errored_block    |                        |handle_errored_block |
 Read out full stripe   |                        | Read out full stripe|
 D1 csum error(err++)   |                        | D1 csum error(err++)|
 Recover D1             |                        | Recover D1          |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protection, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli <kreijack@libero.it>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
..
tests btrfs: remove unused qgroup members from btrfs_trans_handle 2017-04-18 14:07:25 +02:00
acl.c posix_acl: Clear SGID bit when setting file permissions 2016-09-22 10:55:32 +02:00
async-thread.c btrfs: fix crash when tracepoint arguments are freed by wq callbacks 2017-01-09 11:24:50 +01:00
async-thread.h btrfs: limit async_work allocation and worker func duration 2016-12-13 11:01:30 -08:00
backref.c btrfs: replace hardcoded value with SEQ_LAST macro 2017-04-18 14:07:25 +02:00
backref.h
btrfs_inode.h btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode 2017-02-28 11:30:12 +01:00
check-integrity.c btrfs: take an fs_info directly when the root is not used otherwise 2016-12-06 16:06:59 +01:00
check-integrity.h btrfs: take an fs_info directly when the root is not used otherwise 2016-12-06 16:06:59 +01:00
compression.c btrfs: convert compressed_bio.pending_bios from atomic_t to refcount_t 2017-04-18 14:07:24 +02:00
compression.h btrfs: derive maximum output size in the compression implementation 2017-02-28 14:26:36 +01:00
ctree.c btrfs: sink GFP flags parameter to tree_mod_log_insert_root 2017-04-18 14:07:26 +02:00
ctree.h btrfs: scrub: Introduce full stripe lock for RAID56 2017-04-18 14:07:27 +02:00
dedupe.h btrfs: expand cow_file_range() to support in-band dedup and subpage-blocksize 2016-07-26 13:52:25 +02:00
delayed-inode.c btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
delayed-inode.h btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
delayed-ref.c btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
delayed-ref.h btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
dev-replace.c Btrfs: switch to div64_u64 if with a u64 divisor 2017-04-18 14:07:26 +02:00
dev-replace.h btrfs: constify device path passed to relevant helpers 2017-02-28 14:26:07 +01:00
dir-item.c btrfs: do proper error handling in btrfs_insert_xattr_item 2017-02-28 14:27:11 +01:00
disk-io.c btrfs: remove redundant parameter from btree_readahead_hook 2017-04-18 14:07:25 +02:00
disk-io.h btrfs: convert btrfs_root.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
export.c btrfs: Make btrfs_ino take a struct btrfs_inode 2017-02-14 15:50:51 +01:00
export.h
extent_io.c Btrfs: enable repair during read for raid56 profile 2017-04-18 14:07:26 +02:00
extent_io.h btrfs: convert extent_state.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
extent_map.c btrfs: convert extent_map.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
extent_map.h btrfs: convert extent_map.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
extent-tree.c btrfs: scrub: Introduce full stripe lock for RAID56 2017-04-18 14:07:27 +02:00
file-item.c Merge branch 'for-chris-4.11-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11 2017-02-28 14:35:09 -08:00
file.c Btrfs: handle only applicable errors returned by btrfs_get_extent 2017-04-18 14:07:27 +02:00
free-space-cache.c btrfs: use clear_page where appropriate 2017-04-18 14:07:26 +02:00
free-space-cache.h btrfs: free-space-cache, clean up unnecessary root arguments 2017-02-17 12:03:56 +01:00
free-space-tree.c btrfs: remove unused parameter from clean_tree_block 2017-02-17 12:03:51 +01:00
free-space-tree.h
hash.c btrfs: advertise which crc32c implementation is being used at module load 2016-06-06 14:08:28 +02:00
hash.h btrfs: advertise which crc32c implementation is being used at module load 2016-06-06 14:08:28 +02:00
inode-item.c btrfs: take an fs_info directly when the root is not used otherwise 2016-12-06 16:06:59 +01:00
inode-map.c btrfs: all btrfs_delalloc_release_metadata take btrfs_inode 2017-02-28 11:30:07 +01:00
inode-map.h Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots 2016-01-15 19:25:02 +01:00
inode.c Btrfs: handle only applicable errors returned by btrfs_get_extent 2017-04-18 14:07:27 +02:00
ioctl.c btrfs: track exclusive filesystem operation in flags 2017-04-18 14:07:25 +02:00
Kconfig
locking.c btrfs: cleanup, remove stray return statements 2016-01-07 14:30:52 +01:00
locking.h
lzo.c btrfs: derive maximum output size in the compression implementation 2017-02-28 14:26:36 +01:00
Makefile
math.h
ordered-data.c btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
ordered-data.h btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
orphan.c
print-tree.c btrfs: take an fs_info directly when the root is not used otherwise 2016-12-06 16:06:59 +01:00
print-tree.h btrfs: take an fs_info directly when the root is not used otherwise 2016-12-06 16:06:59 +01:00
props.c btrfs: Make btrfs_ino take a struct btrfs_inode 2017-02-14 15:50:51 +01:00
props.h
qgroup.c btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint 2017-04-18 14:07:26 +02:00
qgroup.h btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint 2017-04-18 14:07:26 +02:00
raid56.c btrfs: Wait for in-flight bios before freeing target device for raid56 2017-04-18 14:07:26 +02:00
raid56.h btrfs: take an fs_info directly when the root is not used otherwise 2016-12-06 16:06:59 +01:00
rcu-string.h
reada.c btrfs: remove local blocksize variable in reada_find_extent 2017-04-18 14:07:25 +02:00
relocation.c btrfs: Make btrfs_orphan_add take btrfs_inode 2017-02-28 11:30:10 +01:00
root-tree.c btrfs: Use ktime_get_real_ts for root ctime 2017-04-18 14:07:27 +02:00
scrub.c btrfs: scrub: Fix RAID56 recovery race condition 2017-04-18 14:07:27 +02:00
send.c Btrfs: fix an integer overflow check 2017-03-29 14:29:08 +02:00
send.h Btrfs: use linux/sizes.h to represent constants 2016-01-07 14:38:02 +01:00
struct-funcs.c btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
super.c btrfs: No need to check !(flags & MS_RDONLY) twice 2017-04-18 14:07:25 +02:00
sysfs.c btrfs: convert printk(KERN_* to use pr_* calls 2016-09-26 18:08:44 +02:00
sysfs.h btrfs: sysfs: introduce helper for syncing bits with sysfs files 2016-01-21 18:50:40 +01:00
transaction.c btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option 2017-04-18 14:07:26 +02:00
transaction.h btrfs: remove unused qgroup members from btrfs_trans_handle 2017-04-18 14:07:25 +02:00
tree-defrag.c
tree-log.c btrfs: convert extent_map.refs from atomic_t to refcount_t 2017-04-18 14:07:23 +02:00
tree-log.h btrfs: Make btrfs_del_inode_ref take btrfs_inode 2017-02-14 15:50:54 +01:00
ulist.c btrfs: ulist: rename ulist_fini to ulist_release 2017-02-17 12:03:50 +01:00
ulist.h btrfs: ulist: rename ulist_fini to ulist_release 2017-02-17 12:03:50 +01:00
uuid-tree.c btrfs: return the actual error value from from btrfs_uuid_tree_iterate 2016-12-19 18:08:15 +01:00
volumes.c btrfs: use q which is already obtained from bdev_get_queue 2017-04-18 14:07:26 +02:00
volumes.h btrfs: drop redundant parameters from btrfs_map_sblock 2017-04-18 14:07:26 +02:00
xattr.c btrfs: fix over-80 lines introduced by previous cleanups 2017-02-14 15:50:57 +01:00
xattr.h btrfs: Switch to generic xattr handlers 2016-05-17 19:17:09 -04:00
zlib.c btrfs: derive maximum output size in the compression implementation 2017-02-28 14:26:36 +01:00