This adds basic O_DIRECT read and write support. In the write case, we
just do a normal buffered write followed by a cache flush. O_DIRECT +
O_SYNC are required to trigger metadata syncs.
In the read case, there is a basic btrfs_get_block call for use by
the generic O_DIRECT code. This does honor multi-volume mapping rules
but it skips all checksumming.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before it was done by the bio end_io routine, the work queue code is able
to scale much better with faster IO subsystems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.
But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.
The new code validates checksums when the page is read. It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a block is freed, it can be immediately reused if it is from
the current transaction. But, an extra check is required to make sure
the block had not been written yet. If it were reused after being written,
the transid in the block header might match the transid of the
next time the block was allocated.
The parent node records the transaction ID of the block it is pointing to,
and this is used as part of validating the block on reads. So, there
can only be one version of a block per transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checksums were only verified by btrfs_read_tree_block, which meant the
functions to probe the page cache for blocks were not validating checksums.
Normally this is fine because the buffers will only be in cache if they
have already been validated.
But, there is a window while the buffer is being read from disk where
it could be up to date in the cache but not yet verified. This patch
makes sure all buffers go through checksum verification before they
are used.
This is safer, and it prevents modification of buffers before they go
through the csum code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There was an optimization to drop the fs_mutex when doing snapshot deletion
reads, but this can lead to false positives on checksumming errors. Keep
the lock for now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_name_hash, Local variable 'buf' is declared as
__u32 buf[2];
but we then try to do this:
buf[0] = 0x67452301;
buf[1] = 0xefcdab89;
buf[2] = 0x98badcfe;
buf[3] = 0x10325476;
Oops. Fix buf to be the proper size.
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows detection of blocks that have already been written in the
running transaction so they can be recowed instead of modified again.
It is step one in trusting the transid field of the block pointers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Here's a patch against the unstable tree that gets the code to build
against Linus's current tree (2.6.24-git12). This is needed as the
kobject/kset api has changed there.
I tried to make the smallest changes needed, and it builds and loads
successfully, but I don't have a btrfs volume anywhere (yet) to try to
see if things still work properly :)
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we checkum file data during writepage, the checksumming is done one
page at a time, making it difficult to do bulk metadata modifications
to insert checksums for large ranges of the file at once.
This patch changes btrfs to checksum on a per-bio basis instead. The
bios are checksummed before they are handed off to the block layer, so
each bio is contiguous and only has pages from the same inode.
Checksumming on a bio basis allows us to insert and modify the file
checksum items in large groups. It also allows the checksumming to
be done more easily by async worker threads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed that we don't clear the extent state tree dirty and delalloc
bits when we clear the dirty bits on the page during file write.
This leads to csum errors later on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reduce CPU time searching for free blocks by optimizing find_first_extent_bit
Fix find_free_extent to make better use of the last_alloc hint. Before it
was often finding blocks just before the hint.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs set/get macros lose type information needed to avoid
unaligned accesses on sparc64.
ere is a patch for the kernel bits which fixes most of the
unaligned accesses on sparc64.
btrfs_name_hash is modified to return the hash value instead
of getting a return location via a (potentially unaligned)
pointer.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A few codes were not properly updated for changes of extent map. This
may be the causes of "no csum found for inode" issue.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Now that delayed allocation accounting works, i_blocks accounting is changed
to only modify i_blocks when extents inserted or removed.
The fillattr call is changed to include the delayed allocation byte count
in the i_blocks result.
Signed-off-by: Chris Mason <chris.mason@oracle.com>