linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-12 07:01:57 +00:00

Author	SHA1	Message	Date
Josef Bacik	97e728d435	Btrfs: try to keep a healthy ratio of metadata vs data block groups This patch makes the chunk allocator keep a good ratio of metadata vs data block groups. By default for every 8 data block groups, we'll allocate 1 metadata chunk, or about 12% of the disk will be allocated for metadata. This can be changed by specifying the metadata_ratio mount option. This is simply the number of data block groups that have to be allocated to force a metadata chunk allocation. By making sure we allocate metadata chunks more often, we are less likely to get into situations where the whole disk has been allocated as data block groups. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-24 15:46:02 -04:00
Chris Mason	546888da82	Btrfs: fix btrfs fallocate oops and deadlock Btrfs fallocate was incorrectly starting a transaction with a lock held on the extent_io tree for the file, which could deadlock. Strictly speaking it was using join_transaction which would be safe, but it is better to move the transaction outside of the lock. When preallocated extents are overwritten, btrfs_mark_buffer_dirty was being called on an unlocked buffer. This was triggering an assertion and oops because the lock is supposed to be held. The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had been run. btrfs_del_item takes care of dirtying things, so the solution is a to skip the btrfs_mark_buffer_dirty call in this case. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-21 12:45:12 -04:00
Chris Mason	8c594ea81d	Btrfs: use the right node in reada_for_balance reada_for_balance was using the wrong index into the path node array, so it wasn't reading the right blocks. We never directly used the results of the read done by this function because the btree search is started over at the end. This fixes reada_for_balance to reada in the correct node and to avoid searching past the last slot in the node. It also makes sure to hold the parent lock while we are finding the nodes to read. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-20 15:53:09 -04:00
Chris Mason	11c8349b4e	Btrfs: fix oops on page->mapping->host during writepage The extent_io writepage call updates the writepage index in the inode as it makes progress. But, it was doing the update after unlocking the page, which isn't legal because page->mapping can't be trusted once the page is unlocked. This lead to an oops, especially common with compression turned on. The fix here is to update the writeback index before unlocking the page. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-20 15:53:09 -04:00
Chris Mason	d313d7a31a	Btrfs: add a priority queue to the async thread helpers Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a higher priority. But, the checksumming helper threads prevent it from being fully effective. There are two problems. First, a big queue of pending checksumming will delay the synchronous IO behind other lower priority writes. Second, the checksumming uses an ordered async work queue. The ordering makes sure that IOs are sent to the block layer in the same order they are sent to the checksumming threads. Usually this gives us less seeky IO. But, when we start mixing IO priorities, the lower priority IO can delay the higher priority IO. This patch solves both problems by adding a high priority list to the async helper threads, and a new btrfs_set_work_high_prio(), which is used to make put a new async work item onto the higher priority list. The ordering is still done on high priority IO, but all of the high priority bios are ordered separately from the low priority bios. This ordering is purely an IO optimization, it is not involved in data or metadata integrity. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-20 15:53:08 -04:00
Chris Mason	ffbd517d5a	Btrfs: use WRITE_SYNC for synchronous writes Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for writes we plan on waiting on in the near future. This patch mirrors recent changes in other filesystems and the generic code to use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for other latency critical writes. Btrfs uses async worker threads for checksumming before the write is done, and then again to actually submit the bios. The bio submission code just runs a per-device list of bios that need to be sent down the pipe. This list is split into low priority and high priority lists so the WRITE_SYNC IO happens first. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-20 15:53:08 -04:00
Linus Torvalds	b983471794	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: BUG to BUG_ON changes Btrfs: remove dead code Btrfs: remove dead code Btrfs: fix typos in comments Btrfs: remove unused ftrace include Btrfs: fix __ucmpdi2 compile bug on 32 bit builds Btrfs: free inode struct when btrfs_new_inode fails Btrfs: fix race in worker_loop Btrfs: add flushoncommit mount option Btrfs: notreelog mount option Btrfs: introduce btrfs_show_options Btrfs: rework allocation clustering Btrfs: Optimize locking in btrfs_next_leaf() Btrfs: break up btrfs_search_slot into smaller pieces Btrfs: kill the pinned_mutex Btrfs: kill the block group alloc mutex Btrfs: clean up find_free_extent Btrfs: free space cache cleanups Btrfs: unplug in the async bio submission threads Btrfs: keep processing bios for a given bdev if our proc is batching	2009-04-03 15:14:44 -07:00
Linus Torvalds	8fe74cf053	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Remove two unneeded exports and make two symbols static in fs/mpage.c Cleanup after commit `585d3bc06f` Trim includes of fdtable.h Don't crap into descriptor table in binfmt_som Trim includes in binfmt_elf Don't mess with descriptor table in load_elf_binary() Get rid of indirect include of fs_struct.h New helper - current_umask() check_unsafe_exec() doesn't care about signal handlers sharing New locking/refcounting for fs_struct Take fs_struct handling to new file (fs/fs_struct.c) Get rid of bumping fs_struct refcount in pivot_root(2) Kill unsharing fs_struct in __set_personality()	2009-04-02 21:09:10 -07:00
Stoyan Gaydarov	c293498be6	Btrfs: BUG to BUG_ON changes Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 17:05:11 -04:00
Dan Carpenter	3e7ad38d20	Btrfs: remove dead code Remove an unneeded return statement and conditional Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 16:46:06 -04:00
Dan Carpenter	ff0a5836ac	Btrfs: remove dead code merge is always NULL at this point. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 16:46:06 -04:00
Wu Fengguang	d4a789474a	Btrfs: fix typos in comments Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 16:46:06 -04:00
Jim Owens	2e966ed22c	Btrfs: remove unused ftrace include Signed-off-by: jim owens <jowens@hp.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 17:02:55 -04:00
Heiko Carstens	93dbfad7ac	Btrfs: fix __ucmpdi2 compile bug on 32 bit builds We get this on 32 builds: fs/built-in.o: In function `extent_fiemap': (.text+0x1019f2): undefined reference to `__ucmpdi2' Happens because of a switch statement with a 64 bit argument. Convert this to an if statement to fix this. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-03 10:33:45 -04:00
Shen Feng	09771430f3	Btrfs: free inode struct when btrfs_new_inode fails btrfs_new_inode doesn't call iput to free the inode when it fails. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 16:46:06 -04:00
Amit Gud	b5555f7711	Btrfs: fix race in worker_loop Need to check kthread_should_stop after schedule_timeout() before calling schedule(). This causes threads to sleep with potentially no one to wake them up causing mount(2) to hang in btrfs_stop_workers waiting for threads to stop. Signed-off-by: Amit Gud <gud@ksu.edu> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 17:01:27 -04:00
Sage Weil	dccae99995	Btrfs: add flushoncommit mount option The 'flushoncommit' mount option forces any data dirtied by a write in a prior transaction to commit as part of the current commit. This makes the committed state a fully consistent view of the file system from the application's perspective (i.e., it includes all completed file system operations). This was previously the behavior only when a snapshot is created. This is used by Ceph to ensure that completed writes make it to the platter along with the metadata operations they are bound to (by BTRFS_IOC_TRANS_{START,END}). Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 16:59:01 -04:00
Sage Weil	3a5e14048a	Btrfs: notreelog mount option Add a 'notreelog' mount option to disable the tree log (used by fsync, O_SYNC writes). This is much slower, but the tree logging produces inconsistent views into the FS for ceph. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 16:49:40 -04:00
Eric Paris	a9572a15a8	Btrfs: introduce btrfs_show_options btrfs options can change at times other than mount, yet /proc/mounts shows the options string used when the fs was mounted (an example would be when btrfs determines that barriers aren't useful and turns them off.) This patch instead outputs the actual options in use by btrfs. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-02 16:46:06 -04:00
Chris Mason	fa9c0d795f	Btrfs: rework allocation clustering Because btrfs is copy-on-write, we end up picking new locations for blocks very often. This makes it fairly difficult to maintain perfect read patterns over time, but we can at least do some optimizations for writes. This is done today by remembering the last place we allocated and trying to find a free space hole big enough to hold more than just one allocation. The end result is that we tend to write sequentially to the drive. This happens all the time for metadata and it happens for data when mounted -o ssd. But, the way we record it is fairly racey and it tends to fragment the free space over time because we are trying to allocate fairly large areas at once. This commit gets rid of the races by adding a free space cluster object with dedicated locking to make sure that only one process at a time is out replacing the cluster. The free space fragmentation is somewhat solved by allowing a cluster to be comprised of smaller free space extents. This part definitely adds some CPU time to the cluster allocations, but it allows the allocator to consume the small holes left behind by cow. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-03 09:47:43 -04:00
Chris Mason	8e73f27501	Btrfs: Optimize locking in btrfs_next_leaf() btrfs_next_leaf was using blocking locks when it could have been using faster spinning ones instead. This adds a few extra checks around the pieces that block and switches over to spinning locks. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-03 10:14:18 -04:00
Chris Mason	c8c42864f6	Btrfs: break up btrfs_search_slot into smaller pieces btrfs_search_slot was doing too many things at once. This breaks it up into more reasonable units. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-03 10:14:18 -04:00
Josef Bacik	04018de5d4	Btrfs: kill the pinned_mutex This patch removes the pinned_mutex. The extent io map has an internal tree lock that protects the tree itself, and since we only copy the extent io map when we are committing the transaction we don't need it there. We also don't need it when caching the block group since searching through the tree is also protected by the internal map spin lock. Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-04-03 10:14:18 -04:00
Josef Bacik	6226cb0a5e	Btrfs: kill the block group alloc mutex This patch removes the block group alloc mutex used to protect the free space tree for allocations and replaces it with a spin lock which is used only to protect the free space rb tree. This means we only take the lock when we are directly manipulating the tree, which makes us a touch faster with multi-threaded workloads. This patch also gets rid of btrfs_find_free_space and replaces it with btrfs_find_space_for_alloc, which takes the number of bytes you want to allocate, and empty_size, which is used to indicate how much free space should be at the end of the allocation. It will return an offset for the allocator to use. If we don't end up using it we _must_ call btrfs_add_free_space to put it back. This is the tradeoff to kill the alloc_mutex, since we need to make sure nobody else comes along and takes our space. Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-04-03 10:14:18 -04:00
Josef Bacik	2552d17e32	Btrfs: clean up find_free_extent I've replaced the strange looping constructs with a list_for_each_entry on space_info->block_groups. If we have a hint we just jump into the loop with the block group and start looking for space. If we don't find anything we start at the beginning and start looking. We never come out of the loop with a ref on the block_group _unless_ we found space to use, then we drop it after we set the trans block_group. Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-04-03 10:14:19 -04:00
Josef Bacik	70cb074345	Btrfs: free space cache cleanups This patch cleans up the free space cache code a bit. It better documents the idiosyncrasies of tree_search_offset and makes the code make a bit more sense. I took out the info allocation at the start of __btrfs_add_free_space and put it where it makes more sense. This was left over cruft from when alloc_mutex existed. Also all of the re-searches we do to make sure we inserted properly. Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-04-03 10:14:19 -04:00
Chris Mason	bedf762ba3	Btrfs: unplug in the async bio submission threads Btrfs pages being written get set to writeback, and then may go through a number of steps before they hit the block layer. This includes compression, checksumming and async bio submission. The end result is that someone who writes a page and then does wait_on_page_writeback is likely to unplug the queue before the bio they cared about got there. We could fix this by marking bios sync, or by doing more frequent unplugs, but this commit just changes the async bio submission code to unplug after it has processed all the bios for a device. The async bio submission does a fair job of collection bios, so this shouldn't be a huge problem for reducing merging at the elevator. For streaming O_DIRECT writes on a 5 drive array, it boosts performance from 386MB/s to 460MB/s. Thanks to Hisashi Hifumi for helping with this work. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-03 10:32:58 -04:00
Chris Mason	b765ead57d	Btrfs: keep processing bios for a given bdev if our proc is batching Btrfs uses async helper threads to submit write bios so the checksumming helper threads don't block on the disk. The submit bio threads may process bios for more than one block device, so when they find one device congested they try to move on to other devices instead of blocking in get_request_wait for one device. This does a pretty good job of keeping multiple devices busy, but the congested flag has a number of problems. A congested device may still give you a request, and other procs that aren't backing off the congested device may starve you out. This commit uses the io_context stored in current to decide if our process has been made a batching process by the block layer. If so, it keeps sending IO down for at least one batch. This helps make sure we do a good amount of work each time we visit a bdev, and avoids large IO stalls in multi-device workloads. It's also very ugly. A better solution is in the works with Jens Axboe. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-03 10:27:10 -04:00
Linus Torvalds	c226fd659f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: try to free metadata pages when we free btree blocks Btrfs: add extra flushing for renames and truncates Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod Btrfs: optimize fsyncs on old files Btrfs: tree logging unlink/rename fixes Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay Btrfs: limit balancing work while flushing delayed refs Btrfs: readahead checksums during btrfs_finish_ordered_io Btrfs: leave btree locks spinning more often Btrfs: Only let very young transactions grow during commit Btrfs: Check for a blocking lock before taking the spin Btrfs: reduce stack in cow_file_range Btrfs: reduce stalls during transaction commit Btrfs: process the delayed reference queue in clusters Btrfs: try to cleanup delayed refs while freeing extents Btrfs: reduce stack usage in some crucial tree balancing functions Btrfs: do extent allocation and reference count updates in the background Btrfs: don't preallocate metadata blocks during btrfs_search_slot	2009-04-01 10:20:44 -07:00
Nick Piggin	56a76f8275	fs: fix page_mkwrite error cases in core code and btrfs page_mkwrite is called with neither the page lock nor the ptl held. This means a page can be concurrently truncated or invalidated out from underneath it. Callers are supposed to prevent truncate races themselves, however previously the only thing they can do in case they hit one is to raise a SIGBUS. A sigbus is wrong for the case that the page has been invalidated or truncated within i_size (eg. hole punched). Callers may also have to perform memory allocations in this path, where again, SIGBUS would be wrong. The previous patch ("mm: page_mkwrite change prototype to match fault") made it possible to properly specify errors. Convert the generic buffer.c code and btrfs to return sane error values (in the case of page removed from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit without doing anything, and the fault will be retried properly). This fixes core code, and converts btrfs as a template/example. All other filesystems defining their own page_mkwrite should be fixed in a similar manner. Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:14 -07:00
Nick Piggin	c2ec175c39	mm: page_mkwrite change prototype to match fault Change the page_mkwrite prototype to take a struct vm_fault, and return VM_FAULT_xxx flags. There should be no functional change. This makes it possible to return much more detailed error information to the VM (and also can provide more information eg. virtual_address to the driver, which might be important in some special cases). This is required for a subsequent fix. And will also make it easier to merge page_mkwrite() with fault() in future. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Felix Blyakher <felixb@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:14 -07:00
Al Viro	ce3b0f8d5c	New helper - current_umask() current->fs->umask is what most of fs_struct users are doing. Put that into a helper function. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:26 -04:00
Chris Mason	d57e62b897	Btrfs: try to free metadata pages when we free btree blocks COW means we cycle though blocks fairly quickly, and once we free an extent on disk, it doesn't make much sense to keep the pages around. This commit tries to immediately free the page when we free the extent, which lowers our memory footprint significantly. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-31 14:27:58 -04:00
Chris Mason	5a3f23d515	Btrfs: add extra flushing for renames and truncates Renames and truncates are both common ways to replace old data with new data. The filesystem can make an effort to make sure the new data is on disk before actually replacing the old data. This is especially important for rename, which many application use as though it were atomic for both the data and the metadata involved. The current btrfs code will happily replace a file that is fully on disk with one that was just created and still has pending IO. If we crash after transaction commit but before the IO is done, we'll end up replacing a good file with a zero length file. The solution used here is to create a list of inodes that need special ordering and force them to disk before the commit is done. This is similar to the ext3 style data=ordering, except it is only done on selected files. Btrfs is able to get away with this because it does not wait on commits very often, even for fsync (which use a sub-commit). For renames, we order the file when it wasn't already on disk and when it is replacing an existing file. Larger files are sent to filemap_flush right away (before the transaction handle is opened). For truncates, we order if the file goes from non-zero size down to zero size. This is a little different, because at the time of the truncate the file has no dirty bytes to order. But, we flag the inode so that it is added to the ordered list on close (via release method). We also immediately add it to the ordered list of the current transaction so that we can try to flush down any writes the application sneaks in before commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-31 14:27:58 -04:00
Jens Axboe	6933c02e9c	btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty Chris says it's safe to kill. Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-26 11:01:35 +01:00
Chris Mason	1a81af4d1d	Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod btrfs_update_delayed_ref is optimized to add and remove different references in one pass through the delayed ref tree. It is a zero sum on the total number of refs on a given extent. But, the code was recording an extra ref in the head node. This never made it down to the disk but was used when deciding if it was safe to free the extent while dropping snapshots. The fix used here is to make sure the ref_mod count is unchanged on the head ref when btrfs_update_delayed_ref is called. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-25 09:55:11 -04:00
Chris Mason	af4176b49c	Btrfs: optimize fsyncs on old files The fsync log has code to make sure all of the parents of a file are in the log along with the file. It uses a minimal log of the parent directory inodes, just enough to get the parent directory on disk. If the transaction that originally created a file is fully on disk, and the file hasn't been renamed or linked into other directories, we can safely skip the parent directory walk. We know the file is on disk somewhere and we can go ahead and just log that single file. This is more important now because unrelated unlinks in the parent directory might make us force a commit if we try to log the parent. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:52 -04:00
Chris Mason	12fcfd22fe	Btrfs: tree logging unlink/rename fixes The tree logging code allows individual files or directories to be logged without including operations on other files and directories in the FS. It tries to commit the minimal set of changes to disk in order to fsync the single file or directory that was sent to fsync or O_SYNC. The tree logging code was allowing files and directories to be unlinked if they were part of a rename operation where only one directory in the rename was in the fsync log. This patch adds a few new rules to the tree logging. 1) on rename or unlink, if the inode being unlinked isn't in the fsync log, we must force a full commit before doing an fsync of the directory where the unlink was done. The commit isn't done during the unlink, but it is forced the next time we try to log the parent directory. Solution: record transid of last unlink/rename per directory when the directory wasn't already logged. For renames this is only done when renaming to a different directory. mkdir foo/some_dir normal commit rename foo/some_dir foo2/some_dir mkdir foo/some_dir fsync foo/some_dir/some_file The fsync above will unlink the original some_dir without recording it in its new location (foo2). After a crash, some_dir will be gone unless the fsync of some_file forces a full commit 2) we must log any new names for any file or dir that is in the fsync log. This way we make sure not to lose files that are unlinked during the same transaction. 2a) we must log any new names for any file or dir during rename when the directory they are being removed from was logged. 2a is actually the more important variant. Without the extra logging a crash might unlink the old name without recreating the new one 3) after a crash, we must go through any directories with a link count of zero and redo the rm -rf mkdir f1/foo normal commit rm -rf f1/foo fsync(f1) The directory f1 was fully removed from the FS, but fsync was never called on f1, only its parent dir. After a crash the rm -rf must be replayed. This must be able to recurse down the entire directory tree. The inode link count fixup code takes care of the ugly details. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:52 -04:00
Chris Mason	a74ac32207	Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay During log replay, inodes are copied from the log to the main filesystem btrees. Sometimes they have a zero link count in the log but they actually gain links during the replay or have some in the main btree. This patch updates the link count to be at least one after copying the inode out of the log. This makes sure the inode is deleted during an iput while the rest of the replay code is still working on it. The log replay has fixup code to make sure that link counts are correct at the end of the replay, so we could use any non-zero number here and it would work fine. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	a4b6e07d1a	Btrfs: limit balancing work while flushing delayed refs The delayed reference mechanism is responsible for all updates to the extent allocation trees, including those updates created while processing the delayed references. This commit tries to limit the amount of work that gets created during the final run of delayed refs before a commit. It avoids cowing new blocks unless it is required to finish the commit, and so it avoids new allocations that were not really required. The goal is to avoid infinite loops where we are always making more work on the final run of delayed refs. Over the long term we'll make a special log for the last delayed ref updates as well. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	5d13a98f3b	Btrfs: readahead checksums during btrfs_finish_ordered_io This reads in blocks in the checksum btree before starting the transaction in btrfs_finish_ordered_io. It makes it much more likely we'll be able to do operations inside the transaction without needing any btree reads, which limits transaction latencies overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	b9473439d3	Btrfs: leave btree locks spinning more often btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:28 -04:00
Chris Mason	89573b9c51	Btrfs: Only let very young transactions grow during commit Commits are fairly expensive, and so btrfs has code to sit around for a while during the commit and let new writers come in. But, while we're sitting there, new delayed refs might be added, and those can be expensive to process as well. Unless the transaction is very very young, it makes sense to go ahead and let the commit finish without hanging around. The commit grow loop isn't as important as it used to be, the fsync logging code handles most performance critical syncs now. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:28 -04:00
Chris Mason	66d7e85ea7	Btrfs: Check for a blocking lock before taking the spin This reduces contention on the extent buffer spin locks by testing for a blocking lock before trying to take the spinlock. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:27 -04:00
Chris Mason	7f366cfecf	Btrfs: reduce stack in cow_file_range The fs/btrfs/inode.c code to run delayed allocation during writout needed some stack usage optimization. This is the first pass, it does the check for compression earlier on, which allows us to do the common (no compression) case higher up in the call chain. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:27 -04:00
Chris Mason	b7ec40d784	Btrfs: reduce stalls during transaction commit To avoid deadlocks and reduce latencies during some critical operations, some transaction writers are allowed to jump into the running transaction and make it run a little longer, while others sit around and wait for the commit to finish. This is a bit unfair, especially when the callers that jump in do a bunch of IO that makes all the others procs on the box wait. This commit reduces the stalls this produces by pre-reading file extent pointers during btrfs_finish_ordered_io before the transaction is joined. It also tunes the drop_snapshot code to politely wait for transactions that have started writing out their delayed refs to finish. This avoids new delayed refs being flooded into the queue while we're trying to close off the transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	c3e69d58e8	Btrfs: process the delayed reference queue in clusters The delayed reference queue maintains pending operations that need to be done to the extent allocation tree. These are processed by finding records in the tree that are not currently being processed one at a time. This is slow because it uses lots of time searching through the rbtree and because it creates lock contention on the extent allocation tree when lots of different procs are running delayed refs at the same time. This commit changes things to grab a cluster of refs for processing, using a cursor into the rbtree as the starting point of the next search. This way we walk smoothly through the rbtree. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	1887be66dc	Btrfs: try to cleanup delayed refs while freeing extents When extents are freed, it is likely that we've removed the last delayed reference update for the extent. This checks the delayed ref tree when things are freed, and if no ref updates area left it immediately processes the delayed ref. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	44871b1b24	Btrfs: reduce stack usage in some crucial tree balancing functions Many of the tree balancing functions follow the same pattern. 1) cow a block 2) do something to the result This commit breaks them up into two functions so the variables and code required for part two don't suck down stack during part one. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00
Chris Mason	56bec294de	Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00

1 2 3 4 5 ...

966 Commits