Commit Graph

221 Commits

Author SHA1 Message Date
Filipe Manana
df8d116ffa Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs
If we have an inode with a large number of hard links, some of which may
be extrefs, turn a regular ref into an extref, fsync the inode and then
replay the fsync log (after a crash/reboot), we can endup with an fsync
log that makes the replay code always fail with -EOVERFLOW when processing
the inode's references.

This is easy to reproduce with the test case I made for xfstests. Its steps
are the following:

   _scratch_mkfs "-O extref" >> $seqres.full 2>&1
   _init_flakey
   _mount_flakey

   # Create a test file with 3001 hard links. This number is large enough to
   # make btrfs start using extrefs at some point even if the fs has the maximum
   # possible leaf/node size (64Kb).
   echo "hello world" > $SCRATCH_MNT/foo
   for i in `seq 1 3000`; do
       ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
   done

   # Make sure all metadata and data are durably persisted.
   sync

   # Now remove one link, add a new one with a new name, add another new one with
   # the same name as the one we just removed and fsync the inode.
   rm -f $SCRATCH_MNT/foo_link_0001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
   rm -f $SCRATCH_MNT/foo_link_0002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

   # Simulate a crash/power loss. This makes sure the next mount
   # will see an fsync log and will replay that log.

   _load_flakey_table $FLAKEY_DROP_WRITES
   _unmount_flakey

   _load_flakey_table $FLAKEY_ALLOW_WRITES
   _mount_flakey

   # Check that the number of hard links is correct, we are able to remove all
   # the hard links and read the file's data. This is just to verify we don't
   # get stale file handle errors (due to dangling directory index entries that
   # point to inodes that no longer exist).
   echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
   [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
   for ((i = 1; i <= 3003; i++)); do
       name=foo_link_`printf "%04d" $i`
       if [ $i -eq 2 ]; then
           [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
       else
           [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
       fi
   done
   rm -f $SCRATCH_MNT/foo_link_*
   cat $SCRATCH_MNT/foo
   rm -f $SCRATCH_MNT/foo

   status=0
   exit

The fix is simply to correct the overflow condition when overwriting a
reference item because it was wrong, trying to increase the item in the
fs/subvol tree by an impossible amount. Also ensure that we don't insert
one normal ref and one ext ref for the same dentry - this happened because
processing a dir index entry from the parent in the log happened when
the normal ref item was full, which made the logic insert an extref and
later when the normal ref had enough room, it would be inserted again
when processing the ref item from the child inode in the log.

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon. This test only passes if the previous
patch titled "Btrfs: fix fsync when extend references are added to an inode"
is applied too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:05 -08:00
Filipe Manana
2c2c452b0c Btrfs: fix fsync when extend references are added to an inode
If we added an extended reference to an inode and fsync'ed it, the log
replay code would make our inode have an incorrect link count, which
was lower then the expected/correct count.
This resulted in stale directory index entries after deleting some of
the hard links, and any access to the dangling directory entries resulted
in -ESTALE errors because the entries pointed to inode items that don't
exist anymore.

This is easy to reproduce with the test case I made for xfstests, and
the bulk of that test is:

    _scratch_mkfs "-O extref" >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create a test file with 3001 hard links. This number is large enough to
    # make btrfs start using extrefs at some point even if the fs has the maximum
    # possible leaf/node size (64Kb).
    echo "hello world" > $SCRATCH_MNT/foo
    for i in `seq 1 3000`; do
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
    done

    # Make sure all metadata and data are durably persisted.
    sync

    # Add one more link to the inode that ends up being a btrfs extref and fsync
    # the inode.
    ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

    # Simulate a crash/power loss. This makes sure the next mount
    # will see an fsync log and will replay that log.

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Now after the fsync log replay btrfs left our inode with a wrong link count N,
    # which was smaller than the correct link count M (N < M).
    # So after removing N hard links, the remaining M - N directory entries were
    # still visible to user space but it was impossible to do anything with them
    # because they pointed to an inode that didn't exist anymore. This resulted in
    # stale file handle errors (-ESTALE) when accessing those dentries for example.
    #
    # So remove all hard links except the first one and then attempt to read the
    # file, to verify we don't get an -ESTALE error when accessing the inodel
    #
    # The btrfs fsck tool also detected the incorrect inode link count and it
    # reported an error message like the following:
    #
    # root 5 inode 257 errors 2001, no inode item, link count wrong
    #   unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref
    #
    # The fstests framework automatically calls fsck after a test is run, so we
    # don't need to call fsck explicitly here.

    rm -f $SCRATCH_MNT/foo_link_*
    cat $SCRATCH_MNT/foo

    status=0
    exit

So make sure an fsync always flushes the delayed inode item, so that the
fsync log contains it (needed in order to trigger the link count fixup
code) and fix the extref counting function, which always return -ENOENT
to its caller (and made it assume there were always 0 extrefs).

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Filipe Manana
d36808e0d4 Btrfs: fix directory inconsistency after fsync log replay
If we have an inode (file) with a link count greater than 1, remove
one of its hard links, fsync the inode, power fail/crash and then
replay the fsync log on the next mount, we end up getting the parent
directory's metadata inconsistent - its i_size still reflects the
deleted hard link and has dangling index entries (with no matching
inode reference entries). This prevents the directory from ever being
deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE
even if all of its children inodes are deleted, and the dangling index
entries can never be removed (as they point to an inode that does not
exist anymore).

This is easy to reproduce with the following excerpt from the test case
for xfstests that I just made:

    _scratch_mkfs >> $seqres.full 2>&1

    _init_flakey
    _mount_flakey

    # Create a test file with 2 hard links in the same directory.
    mkdir -p $SCRATCH_MNT/a/b
    echo "hello world" > $SCRATCH_MNT/a/b/foo
    ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar

    # Make sure all metadata and data are durably persisted.
    sync

    # Now remove one of the hard links and fsync the inode.
    rm -f $SCRATCH_MNT/a/b/bar
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo

    # Simulate a crash/power loss. This makes sure the next mount
    # will see an fsync log and will replay that log.

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Remove the last hard link of the file and attempt to remove its parent
    # directory - this failed in btrfs because the fsync log and replay code
    # didn't decrement the parent directory's i_size and left dangling directory
    # index entries - this made the btrfs rmdir implementation always fail with
    # the error -ENOTEMPTY.
    #
    # The dangling directory index entries were visible to user space, but it was
    # impossible to do anything on them (unlink, open, read, write, stat, etc)
    # because the inode they pointed to did not exist anymore.
    #
    # The parent directory's metadata inconsistency (stale index entries) was
    # also detected by btrfs' fsck tool, which is run automatically by the fstests
    # framework when the test finishes. The error message reported by fsck was:
    #
    # root 5 inode 259 errors 2001, no inode item, link count wrong
    #   unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref
    #
    rm -f $SCRATCH_MNT/a/b/*
    rmdir $SCRATCH_MNT/a/b
    rmdir $SCRATCH_MNT/a

To fix this just make sure that after an unlink, if the inode is fsync'ed,
he parent inode is fully logged in the fsync log.

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Chris Mason
57bbddd7fb Merge branch 'cleanup/blocksize-diet-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:49:35 -08:00
David Sterba
9c4f61f01d btrfs: simplify insert_orphan_item
We can search and add the orphan item in one go,
btrfs_insert_orphan_item will find out if the item already exists.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:48 +01:00
David Sterba
381cf6587f btrfs: fix leak of path in btrfs_find_item
If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.

Move the path allocation to the callers.

CC: <stable@vger.kernel.org> # 3.14+
Fixes: 3f870c2899 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:46 +01:00
David Sterba
a83fffb75d btrfs: sink blocksize parameter to btrfs_find_create_tree_block
Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.

Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:21 +01:00
Filipe Manana
b38ef71cb1 Btrfs: ensure ordered extent errors aren't missed on fsync
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:

1) We start all ordered extents by calling filemap_fdatawrite_range();

2) We do some other work like locking the inode's i_mutex, start a transaction,
   start a log transaction, etc;

3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
   ordered extents from inode's ordered tree into a list;

4) But by the time we do ordered extent collection, some ordered extents we started
   at step 1) might have already completed with an error, and therefore we didn't
   found them in the ordered tree and had no idea they finished with an error. This
   makes our fsync return success (0) to userspace, but has no bad effects on the log
   like for example insertion of file extent items into the log that point to unwritten
   extents, because the invalid extent maps were removed before the ordered extent
   completed (in inode.c:btrfs_finish_ordered_io).

So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.

This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:57 -08:00
Filipe Manana
0870295b23 Btrfs: collect only the necessary ordered extents on ranged fsync
Instead of collecting all ordered extents from the inode's ordered tree
and then wait for all of them to complete, just collect the ones that
overlap the fsync range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:56 -08:00
Filipe Manana
5ab5e44a36 Btrfs: don't ignore log btree writeback errors
If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:55 -08:00
Josef Bacik
50d9aa99bd Btrfs: make sure logged extents complete in the current transaction V3
Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described.  It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits.  Consider
the following

write extent 0-4k
log extent in log tree
commit transaction
	< power fail happens here
ordered extent completes

We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed.  If we lose
power before the transaction commit we are save, otherwise we are not.

Fix this by keeping track of all extents we logged in this transaction.  Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding.  This will make sure that if we lose
power after the transaction commit we still have our data.  This also fixes the
problem of the improperly updated extent generation.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:58:32 -08:00
Josef Bacik
9dba8cf128 Btrfs: make sure we wait on logged extents when fsycning two subvols
If we have two fsync()'s race on different subvols one will do all of its work
to get into the log_tree, wait on it's outstanding IO, and then allow the
log_tree to finish it's commit.  The problem is we were just free'ing that
subvols logged extents instead of waiting on them, so whoever lost the race
wouldn't really have their data on disk.  Fix this by waiting properly instead
of freeing the logged extents.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:10 -08:00
Linus Torvalds
4f4274af70 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe is nailing down some problems with our skinny extent variation,
  and Dave's patch fixes endian problems in the new super block checks"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
  Btrfs: properly clean up btrfs_end_io_wq_cache
  Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
  btrfs: use macro accessors in superblock validation checks
2014-11-01 10:41:26 -07:00
Filipe Manana
1a4ed8fdca Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
If we couldn't find our extent item, we accessed the current slot
(path->slots[0]) to check if it corresponds to an equivalent skinny
metadata item. However this slot could be beyond our last item in the
leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
we shouldn't process it.

Since btrfs_lookup_extent() is only used to find extent items for data
extents, fix this by removing completely the logic that looks up for an
equivalent skinny metadata item, since it can not exist.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-27 13:16:52 -07:00
Linus Torvalds
90d0c376f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "The largest set of changes here come from Miao Xie.  He's cleaning up
  and improving read recovery/repair for raid, and has a number of
  related fixes.

  I've merged another set of fsync fixes from Filipe, and he's also
  improved the way we handle metadata write errors to make sure we force
  the FS readonly if things go wrong.

  Otherwise we have a collection of fixes and cleanups.  Dave Sterba
  gets a cookie for removing the most lines (thanks Dave)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
  btrfs: Fix compile error when CONFIG_SECURITY is not set.
  Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
  btrfs: Make btrfs handle security mount options internally to avoid losing security label.
  Btrfs: send, don't delay dir move if there's a new parent inode
  btrfs: add more superblock checks
  Btrfs: fix race in WAIT_SYNC ioctl
  Btrfs: be aware of btree inode write errors to avoid fs corruption
  Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
  btrfs: fix shadow warning on cmp
  Btrfs: fix compilation errors under DEBUG
  Btrfs: fix crash of btrfs_release_extent_buffer_page
  Btrfs: add missing end_page_writeback on submit_extent_page failure
  btrfs: Fix the wrong condition judgment about subset extent map
  Btrfs: fix build_backref_tree issue with multiple shared blocks
  Btrfs: cleanup error handling in build_backref_tree
  btrfs: move checks for DUMMY_ROOT into a helper
  btrfs: new define for the inline extent data start
  btrfs: kill extent_buffer_page helper
  btrfs: drop constant param from btrfs_release_extent_buffer_page
  btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
  ...
2014-10-11 08:03:52 -04:00
Linus Torvalds
28596c9722 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull "trivial tree" updates from Jiri Kosina:
 "Usual pile from trivial tree everyone is so eagerly waiting for"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  Remove MN10300_PROC_MN2WS0038
  mei: fix comments
  treewide: Fix typos in Kconfig
  kprobes: update jprobe_example.c for do_fork() change
  Documentation: change "&" to "and" in Documentation/applying-patches.txt
  Documentation: remove obsolete pcmcia-cs from Changes
  Documentation: update links in Changes
  Documentation: Docbook: Fix generated DocBook/kernel-api.xml
  score: Remove GENERIC_HAS_IOMAP
  gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
  tty: doc: Fix grammar in serial/tty
  dma-debug: modify check_for_stack output
  treewide: fix errors in printk
  genirq: fix reference in devm_request_threaded_irq comment
  treewide: fix synchronize_rcu() in comments
  checkstack.pl: port to AArch64
  doc: queue-sysfs: minor fixes
  init/do_mounts: better syntax description
  MIPS: fix comment spelling
  powerpc/simpleboot: fix comment
  ...
2014-10-07 21:16:26 -04:00
Filipe Manana
8407f55326 Btrfs: fix data corruption after fast fsync and writeback error
When we do a fast fsync, we start all ordered operations and then while
they're running in parallel we visit the list of modified extent maps
and construct their matching file extent items and write them to the
log btree. After that, in btrfs_sync_log() we wait for all the ordered
operations to finish (via btrfs_wait_logged_extents).

The problem with this is that we were completely ignoring errors that
can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
or -EIO errors for example. When such error happens, it means we have parts
of the on disk extent that weren't written to, and so we end up logging
file extent items that point to these extents that contain garbage/random
data - so after a crash/reboot plus log replay, we get our inode's metadata
pointing to those extents.

This worked in contrast with the full (non-fast) fsync path, where we
start all ordered operations, wait for them to finish and then write
to the log btree. In this path, after each ordered operation completes
we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
-EIO if so (via btrfs_wait_ordered_range).

So if an error happens with any ordered operation, just return a -EIO
error to userspace, so that it knows that not all of its previous writes
were durably persisted and the application can take proper action (like
redo the writes for e.g.) - and definitely not leave any file extent items
in the log refer to non fully written extents.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19 06:57:51 -07:00
Filipe Manana
a2cc11db24 Btrfs: fix directory recovery from fsync log
When replaying a directory from the fsync log, if a directory entry
exists both in the fs/subvol tree and in the log, the directory's inode
got its i_size updated incorrectly, accounting for the dentry's name
twice.

Reproducer, from a test for xfstests:

    _scratch_mkfs >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    touch $SCRATCH_MNT/foo
    sync

    touch $SCRATCH_MNT/bar
    xfs_io -c "fsync" $SCRATCH_MNT
    xfs_io -c "fsync" $SCRATCH_MNT/bar

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    [ -f $SCRATCH_MNT/foo ] || echo "file foo is missing"
    [ -f $SCRATCH_MNT/bar ] || echo "file bar is missing"

    _unmount_flakey
    _check_scratch_fs $FLAKEY_DEV

The filesystem check at the end failed with the message:
"root 5 root dir 256 error".

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:27 -07:00
Filipe Manana
f98de9b9c0 Btrfs: make btrfs_search_forward return with nodes unlocked
None of the uses of btrfs_search_forward() need to have the path
nodes (level >= 1) read locked, only the leaf needs to be locked
while the caller processes it. Therefore make it return a path
with all nodes unlocked, except for the leaf.

This change is motivated by the observation that during a file
fsync we repeatdly call btrfs_search_forward() and process the
returned leaf while upper nodes of the returned path (level >= 1)
are read locked, which unnecessarily blocks other tasks that want
to write to the same fs/subvol btree.
Therefore instead of modifying the fsync code to unlock all nodes
with level >= 1 immediately after calling btrfs_search_forward(),
change btrfs_search_forward() to do it, so that it benefits all
callers.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:02 -07:00
David Sterba
707e8a0715 btrfs: use nodesize everywhere, kill leafsize
The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.

Shaves a few bytes from .text:

  text    data     bss     dec     hex filename
852418   24560   23112  900090   dbbfa btrfs.ko.before
851074   24584   23112  898770   db6d2 btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:14 -07:00
David Sterba
962a298f35 btrfs: kill the key type accessor helpers
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:12 -07:00
Filipe Manana
125c4cf9f3 Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
When a ranged fsync finishes if there are still extent maps in the modified
list, still set the inode's logged_trans and last_log_commit. This is important
in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
inode ref gets deleted from the log and the respective dentries in its parent
are deleted too from the log (if the parent directory was fsync'ed in the same
transaction).

Instead make btrfs_inode_in_log() return false if the list of modified extent
maps isn't empty.

This is an incremental on top of the v4 version of the patch:

    "Btrfs: fix fsync data loss after a ranged fsync"

which was added to its v5, but didn't make it on time.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-16 16:12:19 -07:00
Filipe Manana
49dae1bc1c Btrfs: fix fsync data loss after a ranged fsync
While we're doing a full fsync (when the inode has the flag
BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
portion of the file), we might have ordered operations that are started
before or while we're logging the inode and that fall outside the fsync
range.

Therefore when a full ranged fsync finishes don't remove every extent
map from the list of modified extent maps - as for some of them, that
fall outside our fsync range, their respective ordered operation hasn't
finished yet, meaning the corresponding file extent item wasn't inserted
into the fs/subvol tree yet and therefore we didn't log it, and we must
let the next fast fsync (one that checks only the modified list) see this
extent map and log a matching file extent item to the log btree and wait
for its ordered operation to finish (if it's still ongoing).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-08 13:56:43 -07:00
Rasmus Villemoes
a71db86e86 fs/btrfs/tree-log.c: Fix closing brace followed by if
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-08-26 09:35:51 +02:00
Filipe Manana
74121f7cbb Btrfs: fix hole detection during file fsync
The file hole detection logic during a file fsync wasn't correct,
because it didn't look back (in a previous leaf) for the last file
extent item that can be in a leaf to the left of our leaf and that
has a generation lower than the current transaction id. This made it
assume that a hole exists when it really doesn't exist in the file.

Such false positive hole detection happens in the following scenario:

* We have a file that has many file extent items, covering 3 or more
  btree leafs (the first leaf must contain non file extent items too).

* Two ranges of the file are modified, with their extent items being
  located at 2 different leafs and those leafs aren't consecutive.

* When processing the second modified leaf, we weren't checking if
  some file extent item exists that is located in some leaf that is
  between our 2 modified leafs, and therefore assumed the range defined
  between the last file extent item in the first leaf and the first file
  extent item in the second leaf matched a hole.

Fortunately this didn't result in overriding the log with wrong data,
instead it made the last loop in copy_items() attempt to insert a
duplicated key (for a hole file extent item), which makes the file
fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
turn ends up doing a full transaction commit, which is much more
expensive then writing only to the log tree and wait for it to be
durably persisted (as well as the file's modified extents/pages).
Therefore fix the hole detection logic, so that we don't pay the
cost of doing full transaction commits.

I could trigger this issue with the following test for xfstests (which
never fails, either without or with this patch). The last fsync call
results in a full transaction commit, due to the -EEXIST error mentioned
above. I could also observe this behaviour happening frequently when
running xfstests/generic/075 in a loop.

Test:

    _cleanup()
    {
        _cleanup_flakey
        rm -fr $tmp
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/dmflakey

    # real QA test starts here
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_dm_flakey
    _need_to_be_root

    rm -f $seqres.full

    # Create a file with many file extent items, each representing a 4Kb extent.
    # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
    # as of btrfs-progs 3.12).
    _scratch_mkfs -l 16384 >/dev/null 2>&1
    _init_flakey
    SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
    MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
    _mount_flakey

    # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
    $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
            $SCRATCH_MNT/foo | _filter_xfs_io

    # For any of the following fsync calls, inode doesn't have the flag
    # BTRFS_INODE_NEEDS_FULL_SYNC set.
    for ((i = 1; i <= 500; i++)); do
        OFFSET=$((4096 * i))
        LEN=4096
        $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
                $SCRATCH_MNT/foo | _filter_xfs_io
    done

    # Commit transaction and bump next transaction's id (to 7).
    sync

    # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
    # inode runtime flags.
    $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo

    # Commit transaction and bump next transaction's id (to 8).
    sync

    # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
    # in the middle, containing only file extent items, isn't touched. So the
    # next fsync, when calling btrfs_search_forward(), won't visit that middle
    # leaf. First and 3rd leaf have now a generation with value 8, while the
    # middle leaf remains with a generation with value 6.
    $XFS_IO_PROG \
        -c "pwrite -S 0xee -b 4096 0 4096" \
        -c "pwrite -S 0xff -b 4096 2043904 4096" \
        -c "fsync" \
        $SCRATCH_MNT/foo | _filter_xfs_io

    _load_flakey_table $FLAKEY_DROP_WRITES
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    # During mount, we'll replay the log created by the fsync above, and the file's
    # md5 digest should be the same we got before the unmount.
    _mount_flakey
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    _unmount_flakey
    MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"

    status=0
    exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:24 -07:00
Miao Xie
995946dd29 Btrfs: use helpers for last_trans_log_full_commit instead of opencode
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:45 -07:00
Miao Xie
27cdeb7096 Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:40 -07:00
Miao Xie
50471a388c Btrfs: stop joining the log transaction if sync log fails
If the log sync fails, there is something wrong in the log tree, we
should not continue to join the log transaction and log the metadata.
What we should do is to do a full commit.

This patch fixes this problem by setting ->last_trans_log_full_commit
to the current transaction id, it will tell the tasks not to join
the log transaction, and do a full commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:44 -04:00
Miao Xie
d1433debe7 Btrfs: just wait or commit our own log sub-transaction
We might commit the log sub-transaction which didn't contain the metadata we
logged. It was because we didn't record the log transid and just select
the current log sub-transaction to commit, but the right one might be
committed by the other task already. Actually, we needn't do anything
and it is safe that we go back directly in this case.

This patch improves the log sync by the above idea. We record the transid
of the log sub-transaction in which we log the metadata, and the transid
of the log sub-transaction we have committed. If the committed transid
is >= the transid we record when logging the metadata, we just go back.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:43 -04:00
Miao Xie
8b050d350c Btrfs: fix skipped error handle when log sync failed
It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.

This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:43 -04:00
Miao Xie
bb14a59b61 Btrfs: use signed integer instead of unsigned long integer for log transid
The log trans id is initialized to be 0 every time we create a log tree,
and the log tree need be re-created after a new transaction is started,
it means the log trans id is unlikely to be a huge number, so we can use
signed integer instead of unsigned long integer to save a bit space.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:42 -04:00
Miao Xie
7483e1a446 Btrfs: remove unnecessary memory barrier in btrfs_sync_log()
Mutex unlock implies certain memory barriers to make sure all the memory
operation completes before the unlock, and the next mutex lock implies memory
barriers to make sure the all the memory happens after the lock. So it is
a full memory barrier(smp_mb), we needn't add memory barriers. Remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:41 -04:00
Miao Xie
e87ac13687 Btrfs: don't start the log transaction if the log tree init fails
The old code would start the log transaction even the log tree init
failed, it was unnecessary. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:40 -04:00
Miao Xie
48cab2e071 Btrfs: fix the skipped transaction commit during the file sync
We may abort the wait earlier if ->last_trans_log_full_commit was set to
the current transaction id, at this case, we need commit the current
transaction instead of the log sub-transaction. But the current code
didn't tell the caller to do it (return 0, not -EAGAIN). Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:40 -04:00
Miao Xie
5c902ba622 Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ->last_trans_log_full_commit
->last_trans_log_full_commit may be changed by the other tasks without lock,
so we need prevent the compiler from the optimize access just like
	tmp = fs_info->last_trans_log_full_commit
	if (tmp == ...)
		...

	<do something>

	if (tmp == ...)
		...

In fact, we need get the new value of ->last_trans_log_full_commit during
the second access. Fix it by ACCESS_ONCE().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:39 -04:00
Miao Xie
827463c49f Btrfs: don't mix the ordered extents of all files together during logging the inodes
There was a problem in the old code:
If we failed to log the csum, we would free all the ordered extents in the log list
including those ordered extents that were logged successfully, it would make the
log committer not to wait for the completion of the ordered extents.

This patch doesn't insert the ordered extents that is about to be logged into
a global list, instead, we insert them into a local list. If we log the ordered
extents successfully, we splice them with the global list, or we will throw them
away, then do full sync. It can also reduce the lock contention and the traverse
time of list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:36 -04:00
Chris Mason
514ac8ad87 Btrfs: don't use ram_bytes for uncompressed inline items
If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size.  The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.

Reported-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org
2014-01-29 07:06:29 -08:00
Miao Xie
23c671a588 Btrfs: flush the dirty pages of the ordered extent aggressively during logging csum
The performance of fsync dropped down suddenly sometimes, the main reason
of this problem was that we might only flush part dirty pages in a ordered
extent, then got that ordered extent, wait for the csum calcucation. But if
no task flushed the left part, we would wait until the flusher flushed them,
sometimes we need wait for several seconds, it made the performance drop
down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s)

This patch improves the above problem by flushing left dirty pages aggressively.

Test Environment:
CPU:		2CPU * 2Cores
Memory:		4GB
Partition:	20GB(HDD)

Test Command:
 # sysbench --num-threads=8 --test=fileio --file-num=1 \
 > --file-total-size=8G --file-block-size=32768 \
 > --file-io-mode=sync --file-fsync-freq=100 \
 > --file-fsync-end=no --max-requests=10000 \
 > --file-test-mode=rndwr run

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:37 -08:00
Filipe David Borba Manana
1acae57b16 Btrfs: faster file extent item replace operations
When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.

Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.

By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.

Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.

The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:

  sysbench --test=fileio --file-num=4096 --file-total-size=8G \
     --file-test-mode=[rndwr|seqwr] --num-threads=512 \
     --file-block-size=8192 \ --max-requests=1000000 \
     --file-fsync-freq=0 --file-io-mode=sync [prepare|run]

All results below are averages of 10 runs of the respective test.

** SSD sequential writes

Before this change: 225.88 Mb/sec
After this change:  277.26 Mb/sec

** SSD random writes

Before this change: 49.91 Mb/sec
After this change:  56.39 Mb/sec

** HDD sequential writes

Before this change: 68.53 Mb/sec
After this change:  69.87 Mb/sec

** HDD random writes

Before this change: 13.04 Mb/sec
After this change:  14.39 Mb/sec

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:23 -08:00
Kelley Nielsen
3f870c2899 btrfs: expand btrfs_find_item() to include find_orphan_item functionality
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.

Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:37 -08:00
Josef Bacik
16e7549f04 Btrfs: incompatible format change to remove hole extents
Btrfs has always had these filler extent data items for holes in inodes.  This
has made somethings very easy, like logging hole punches and sending hole
punches.  However for large holey files these extent data items are pure
overhead.  So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files.  This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any.  I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Josef Bacik
d006a04816 Btrfs: only drop modified extents if we logged the whole inode
If we fsync, seek and write, rename and then fsync again we will lose the
modified hole extent because the rename will drop all of the modified extents
since we didn't do the fast search.  We need to only drop the modified extents
if we didn't do the fast search and we were logging the entire inode as we don't
need them anymore, otherwise this is being premature.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:32 -05:00
Josef Bacik
6cfab851f4 Btrfs: make sure to copy everything if we rename
If we rename a file that is already in the log and we fsync again we will lose
the new name.  This is because we just log the inode update and not the new ref.
To fix this we just need to check if we are logging the new name of the inode
and copy all the metadata instead of just updating the inode itself.  With this
patch my testcase now passes.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:24 -05:00
Dulshani Gunawardhana
fae7f21cec btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:53 -05:00
Filipe David Borba Manana
269d040ff2 Btrfs: log recovery, don't unlink inode always on error
If we get any error while doing a dir index/item lookup in the
log tree, we were always unlinking the corresponding inode in
the subvolume. It makes sense to unlink only if the lookup failed
to find the dir index/item, which corresponds to NULL or -ENOENT,
and not when other errors happen (like a transient -ENOMEM or -EIO).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:48 -05:00
Filipe David Borba Manana
488111aa0e Btrfs: fix csum search offset/length calculation in log tree
We were setting the csums search offset and length to the right values if
the extent is compressed, but later on right before doing the csums lookup
we were overriding these two parameters regardless of compression being
set or not for the extent.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:42 -05:00
Wang Shilong
9b011adfe1 Btrfs: remove scrub_super_lock holding in btrfs_sync_log()
Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.

However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:13 -05:00
Zach Brown
8b558c5f09 btrfs: remove fs/btrfs/compat.h
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink().  This doesn't belong in mainline.

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:19 -05:00
Filipe David Borba Manana
e93ae26fe1 Btrfs: optimize tree-log.c:count_inode_refs()
Avoid repeated tree searches by processing all inode ref items in
a leaf at once instead of processing one at a time, followed by a
path release and a tree search for a key with a decremented offset.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:19 -05:00
Geyslan G. Bem
03b2f08b5f btrfs: Fix memory leakage in the tree-log.c
In add_inode_ref() function:

Initializes local pointers.

Reduces the logical condition with the __add_inode_ref() return
value by using only one 'goto out'.

Centralizes the exiting, ensuring the freeing of all used memory.

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:11 -05:00