linux

Author	SHA1	Message	Date
Josef Bacik	c759c4e161	Btrfs: don't keep trying to build clusters if we are fragmented If we are extremely fragmented then we won't be able to create a free_cluster. So if this happens set last_ptr->fragmented so that all future allcations will give up trying to create a cluster. When we unpin extents we will unset ->fragmented if we free up a sufficient amount of space in a block group. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:55:39 -07:00
Josef Bacik	a5e681d9bd	Btrfs: cut down on loops through the allocator We try really really hard to make allocations, but sometimes it is just not going to happen, especially when free space is extremely fragmented. So add a few short cuts through the looping states. For example if we couldn't allocate a chunk, just go straight to the NO_EMPTY_SIZE loop. If there are no uncached block groups and we've done a full search, go straight to the ALLOC_CHUNK stage. And finally if we already have empty_size and empty_cluster set to 0 go ahead and return -ENOSPC. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:55:37 -07:00
Josef Bacik	2968b1f48b	Btrfs: don't continue setting up space cache when enospc If we hit ENOSPC when setting up a space cache don't bother setting up any of the other space cache's in this transaction, it'll just induce unnecessary latency. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:55:36 -07:00
Josef Bacik	4f4db2174d	Btrfs: keep track of max_extent_size per space_info When we are heavily fragmented we can induce a lot of latency trying to make an allocation happen that is simply not going to happen. Thankfully we keep track of our max_extent_size when going through the allocator, so if we get to the point where we are exiting find_free_extent with ENOSPC then set our space_info->max_extent_size so we can keep future allocations from having to pay this cost. We reset the max_extent_size whenever we release pinned bytes back into this space info so we can redo all the work. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:55:19 -07:00
Josef Bacik	36af4e0737	Btrfs: don't loop in allocator for space cache The space cache needs to have contiguous allocations, and the allocator tries to make allocations by reducing the amount of bytes requested and re-searching. But this just makes us waste time when we are very fragmented, so if we can't find our space just exit, don't bother trying to search again. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:51:46 -07:00
Josef Bacik	3204d33cda	Btrfs: add a flags field to btrfs_transaction I want to set some per transaction flags, so instead of adding yet another int lets just convert the current two int indicators to flags and add a flags field for future use. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:51:45 -07:00
Josef Bacik	0b670dc44c	Btrfs: fix prealloc under heavy fragmentation conditions If we are heavily fragmented we will continually try to prealloc the largest extent size we can every time we call btrfs_reserve_extent. This can be very expensive when we are heavily fragmented, burning lots of CPU cycles and loops through the allocator. So instead notice when we get a smaller chunk from the allocator than what we specified and use this as the new maximum size we try to allocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:51:44 -07:00
Josef Bacik	d0bd456074	Btrfs: add fragment=* debug mount option In tracking down these weird bitmap problems it was helpful to artificially create an extremely fragmented file system. These mount options let us either fragment data or metadata or both. With these options I could reproduce all sorts of weird latencies and hangs that occur under extreme fragmentation and get them fixed. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:51:43 -07:00
Josef Bacik	d9ee522ba3	Btrfs: fix qgroup sanity tests With my changes to allow us to find old roots when resolving indirect refs I introduced a regression to the sanity tests. Since we don't really care to go down into the fs roots we just need to have the old behavior of returning ENOENT for dummy roots for the sanity tests. In the future if we want to get fancy we can populate the test fs trees with the references as well. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:51:41 -07:00
Josef Bacik	161c3549b4	Btrfs: change how we wait for pending ordered extents We have a mechanism to make sure we don't lose updates for ordered extents that were logged in the transaction that is currently running. We add the ordered extent to a transaction list and then the transaction waits on all the ordered extents in that list. However are substantially large file systems this list can be extremely large, and can give us soft lockups, since the ordered extents don't remove themselves from the list when they do complete. To fix this we simply add a counter to the transaction that is incremented any time we have a logged extent that needs to be completed in the current transaction. Then when the ordered extent finally completes it decrements the per transaction counter and wakes up the transaction if we are the last ones. This will eliminate the softlockup. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:51:40 -07:00
Qu Wenruo	56fa9d0762	btrfs: qgroup: Check if qgroup reserved space leaked Add check at btrfs_destroy_inode() time to detect qgroup reserved space leak. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:10 -07:00
Qu Wenruo	51773bec7e	btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hook In clear_bit_hook, qgroup reserved data is already handled quite well, either released by finish_ordered_io or invalidatepage. So calling btrfs_qgroup_free_data() here is completely meaningless, and since btrfs_qgroup_free_data() will lock io_tree, so it can't be called with io_tree lock hold. This patch will add a new function btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease the lockdep warning. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:09 -07:00
Qu Wenruo	14524a846e	btrfs: fallocate: Add support to accurate qgroup reserve Now fallocate will do accurate qgroup reserve space check, unlike old method, which will always reserve the whole length of the range. With this patch, fallocate will: 1) Iterate the desired range and mark in data rsv map Only range which is going to be allocated will be recorded in data rsv map and reserve the space. For already allocated range (normal/prealloc extent) they will be skipped. Also, record the marked range into a new list for later use. 2) If 1) succeeded, do real file extent allocate. And at file extent allocation time, corresponding range will be removed from the range in data rsv map. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:09 -07:00
Qu Wenruo	81fb6f77a0	btrfs: qgroup: Add new trace point for qgroup data reserve Now each qgroup reserve for data will has its ftrace event for better debugging. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:08 -07:00
Qu Wenruo	b9d0b38928	btrfs: Add handler for invalidate page For btrfs_invalidatepage() and its variant evict_inode_truncate_page(), there will be pages don't reach disk. In that case, their reserved space won't be release nor freed by finish_ordered_io() nor delayed_ref handler. So we must free their qgroup reserved space, or we will leaking reserved space again. So this will patch will call btrfs_qgroup_free_data() for invalidatepage() and its variant evict_inode_truncate_page(). And due to the nature of new btrfs_qgroup_reserve/free_data() reserved space will only be reserved or freed once, so for pages which are already flushed to disk, their reserved space will be released and freed by delayed_ref handler. Double free won't be a problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:07 -07:00
Qu Wenruo	94ed938aba	btrfs: qgroup: Add handler for NOCOW and inline For NOCOW and inline case, there will be no delayed_ref created for them, so we should free their reserved data space at proper time(finish_ordered_io for NOCOW and cow_file_inline for inline). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:07 -07:00
Qu Wenruo	7cf5b97650	btrfs: qgroup: Cleanup old inaccurate facilities Cleanup the old facilities which use old btrfs_qgroup_reserve() function call, replace them with the newer version, and remove the "__" prefix in them. Also, make btrfs_qgroup_reserve/free() functions private, as they are now only used inside qgroup codes. Now, the whole btrfs qgroup is swithed to use the new reserve facilities. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:06 -07:00
Qu Wenruo	df480633b8	btrfs: extent-tree: Switch to new delalloc space reserve and release Use new __btrfs_delalloc_reserve_space() and __btrfs_delalloc_release_space() to reserve and release space for delalloc. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:05 -07:00
Qu Wenruo	1ada3a62b5	btrfs: extent-tree: Add new version of btrfs_delalloc_reserve/release_space Add new version of btrfs_delalloc_reserve_space() and btrfs_delalloc_release_space() functions, which supports accurate qgroup reserve. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:05 -07:00
Qu Wenruo	d9d8b2a51a	btrfs: extent-tree: Switch to new check_data_free_space and free_reserved_data_space Use new reserve/free for buffered write and inode cache. For buffered write case, as nodatacow write won't increase quota account, so unlike old behavior which does reserve before check nocow, now we check nocow first and then only reserve data if we can't do nocow write. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:04 -07:00
Qu Wenruo	4ceff0792d	btrfs: extent-tree: Add new version of btrfs_check_data_free_space and btrfs_free_reserved_data_space. Add new functions __btrfs_check_data_free_space() and __btrfs_free_reserved_data_space() to work with new accurate qgroup reserved space framework. The new function will replace old btrfs_check_data_free_space() and btrfs_free_reserved_data_space() respectively, but until all the change is done, let's just use the new name. Also, export internal use function btrfs_alloc_data_chunk_ondemand(), as now qgroup reserve requires precious bytes, some operation can't get the accurate number in advance(like fallocate). But data space info check and data chunk allocate doesn't need to be that accurate, and can be called at the beginning. So export it for later operations. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:41:03 -07:00
Qu Wenruo	7174109c65	btrfs: qgroup: Use new metadata reservation. As we have the new metadata reservation functions, use them to replace the old btrfs_qgroup_reserve() call for metadata. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:40:40 -07:00
Qu Wenruo	55eeaf0578	btrfs: qgroup: Introduce new functions to reserve/free metadata Introduce new functions btrfs_qgroup_reserve/free_meta() to reserve/free metadata reserved space. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:47 -07:00
Qu Wenruo	297d750b9f	btrfs: delayed_ref: release and free qgroup reserved at proper timing Qgroup reserved space needs to be released from inode dirty map and get freed at different timing: 1) Release when the metadata is written into tree After corresponding metadata is written into tree, any newer write will be COWed(don't include NOCOW case yet). So we must release its range from inode dirty range map, or we will forget to reserve needed range, causing accounting exceeding the limit. 2) Free reserved bytes when delayed ref is run When delayed refs are run, qgroup accounting will follow soon and turn the reserved bytes into rfer/excl numbers. As run_delayed_refs and qgroup accounting are all done at commit_transaction() time, we are safe to free reserved space in run_delayed_ref time(). With these timing to release/free reserved space, we should be able to resolve the long existing qgroup reserve space leak problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:47 -07:00
Qu Wenruo	f64d5ca868	btrfs: delayed_ref: Add new function to record reserved space into delayed ref Add new function btrfs_add_delayed_qgroup_reserve() function to record how much space is reserved for that extent. As btrfs only accounts qgroup at run_delayed_refs() time, so newly allocated extent should keep the reserved space until then. So add needed function with related members to do it. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:46 -07:00
Qu Wenruo	f695fdcef8	btrfs: qgroup: Introduce functions to release/free qgroup reserve data space Introduce functions btrfs_qgroup_release/free_data() to release/free reserved data range. Release means, just remove the data range from io_tree, but doesn't free the reserved space. This is for normal buffered write case, when data is written into disc and its metadata is added into tree, its reserved space should still be kept until commit_trans(). So in that case, we only release dirty range, but keep the reserved space recorded some other place until commit_tran(). Free means not only remove data range, but also free reserved space. This is used for case for cleanup and invalidate page. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:46 -07:00
Qu Wenruo	5247255370	btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function Introduce a new function, btrfs_qgroup_reserve_data(), which will use io_tree to accurate qgroup reserve, to avoid reserved space leaking. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:45 -07:00
Qu Wenruo	fefdc55702	btrfs: extent_io: Introduce new function clear_record_extent_bits() Introduce new function clear_record_extent_bits(), which will clear bits for given range and record the details about which ranges are cleared and how many bytes in total it changes. This provides the basis for later qgroup reserve codes. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:44 -07:00
Qu Wenruo	d38ed27f04	btrfs: extent_io: Introduce new function set_record_extent_bits Introduce new function set_record_extent_bits(), which will not only set given bits, but also record how many bytes are changed, and detailed range info. This is quite important for later qgroup reserve framework. The number of bytes will be used to do qgroup reserve, and detailed range info will be used to cleanup for EQUOT case. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:44 -07:00
Qu Wenruo	ac46777213	btrfs: extent_io: Introduce needed structure for recoding set/clear bits Add a new structure, extent_change_set, to record how many bytes are changed in one set/clear_extent_bits() operation, with detailed changed ranges info. This provides the needed facilities for later qgroup reserve framework. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:43 -07:00
Chris Mason	a408365c62	Merge branch 'integration-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4	2015-10-21 18:23:59 -07:00
Chris Mason	a0d58e48db	Merge branch 'cleanups/for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4	2015-10-21 18:21:40 -07:00
Luis de Bethencourt	ddd664f447	btrfs: reada: Fix returned errno code reada is using -1 instead of the -ENOMEM defined macro to specify that a buffer allocation failed. Since the error number is propagated, the caller will get a -EPERM which is the wrong error condition. Also, updating the caller to return the exact value from reada_add_block. Smatch tool warning: reada_add_block() warn: returning -1 instead of -ENOMEM is sloppy Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:29:50 +02:00
Luis de Bethencourt	0b8d8ce029	btrfs: check-integrity: Fix returned errno codes check-integrity is using -1 instead of the -ENOMEM defined macro to specify that a buffer allocation failed. Since the error number is propagated, the caller will get a -EPERM which is the wrong error condition. Also, the smatch tool complains with the following warnings: btrfsic_process_superblock() warn: returning -1 instead of -ENOMEM is sloppy btrfsic_read_block() warn: returning -1 instead of -ENOMEM is sloppy Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:29:44 +02:00
Byongho Lee	d91876496b	btrfs: compress: put variables defined per compress type in struct to make cache friendly Below variables are defined per compress type. - struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES] - spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES] - int comp_num_workspace[BTRFS_COMPRESS_TYPES] - atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES] - wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES] BTW, while accessing one compress type of these variables, the next or before address is other compress types of it. So this patch puts these variables in a struct to make cache friendly. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Byongho Lee	619ed39242	btrfs: cleanup iterating over prop_handlers array This patch eliminates the last item of prop_handlers array which is used to check end of array and instead uses ARRAY_SIZE macro. Though this is a very tiny optimization, using ARRAY_SIZE macro is a good practice to iterate array. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Geliang Tang	8cd1e73111	btrfs: fix a comment typo Just fix a typo in the code comment. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Alexandru Moise	6e4d6fa12c	btrfs: declare rsv_count as unsigned int instead of int rsv_count ultimately gets passed to start_transaction() which now takes an unsigned int as its num_items parameter. The value of rsv_count should always be positive so declare it as being unsigned. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Alexandru Moise	5aed1dd8b4	btrfs: change num_items type from u64 to unsigned int The value of num_items that start_transaction() ultimately always takes is a small one, so a 64 bit integer is overkill. Also change num_items for btrfs_start_transaction() and btrfs_start_transaction_lflush() as well. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Alexandru Moise	bdcd3c97d1	btrfs: cleanup btrfs_balance profile validity checks Improve readability by generalizing the profile validity checks. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Shan Hai	bb78915203	btrfs/file.c: remove an unsed varialbe first_index The commit `b37392ea86` ("Btrfs: cleanup unnecessary parameter and variant of prepare_pages()") makes it redundant. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Shan Hai <haishan.bai@hotmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Zhao Lei	9c170b2644	btrfs: use btrfs_raid_array in btrfs_reduce_alloc_profile btrfs_raid_array[] holds attributes of all raid types. Use btrfs_raid_array[].devs_min is best way for request in btrfs_reduce_alloc_profile(), instead of use complex condition of each raid types. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Zhao Lei	8789f4fe60	btrfs: use btrfs_raid_array for btrfs_get_num_tolerated_disk_barrier_failures() btrfs_raid_array[] is used to define all raid attributes, use it to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(), instead of complex condition in function. It can make code simple and auto-support other possible raid-type in future. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Zhao Lei	af90204750	btrfs: Move btrfs_raid_array to public This array is used to record attributes of each raid type, make it public, and many functions will benifit with this array. For example, num_tolerated_disk_barrier_failures(), we can avoid complex conditions in this function, and get raid attribute simply by accessing above array. It can also make code logic simple, reduce duplication code, and increase maintainability. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Alexandru Moise	e9cf439f0d	btrfs: use a single if() statement for one outcome in get_block_rsv() Rather than have three separate if() statements for the same outcome we should just OR them together in the same if() statement. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Alexandru Moise	a099d0fdb3	btrfs: memset cur_trans->delayed_refs to zero Use memset() to null out the btrfs_delayed_ref_root of btrfs_transaction instead of setting all the members to 0 by hand. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Byongho Lee	568b1c9cca	btrfs: remove unnecessary list_del We can safely iterate whole list items, without using list_del macro. So remove the list_del call. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Byongho Lee	d7641a49a5	btrfs: replace unnecessary list_for_each_entry_safe to list_for_each_entry There is no removing list element while iterating over list. So, replace list_for_each_entry_safe to list_for_each_entry. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Alexandru Moise	f2f767e734	btrfs: trimming some start_transaction() code away Just call kmem_cache_zalloc() instead of calling kmem_cache_alloc(). We're just initializing most fields to 0, false and NULL later on _anyway_, so to make the code mode readable and potentially gain a bit of performance (completely untested claim), we should fill our btrfs_trans_handle with zeros on allocation then just initialize those five remaining fields (not counting the list_heads) as normal. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Alexandru Moise	0412e58c6d	btrfs: Fixed declaration of old_len old_len is used to store the return value of btrfs_item_size_nr(). The return value of btrfs_item_size_nr() is of type u32. To improve code correctness and avoid mixing signed and unsigned integers I've changed old_len to be of type u32 as well. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Alexandru Moise	ce0eac2a1d	btrfs: Fixed dsize and last_off declarations The return values of btrfs_item_offset_nr and btrfs_item_size_nr are of type u32. To avoid mixing signed and unsigned integers we should also declare dsize and last_off to be of type u32. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:48 +02:00
Chandan Rajendra	0d51e28a11	Btrfs: btrfs_submit_bio_hook: Use btrfs_wq_endio_type values instead of integer constants btrfs_submit_bio_hook() uses integer constants instead of values from "enum btrfs_wq_endio_type". Fix this. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-21 18:28:47 +02:00
Filipe Manana	0305cd5f7f	Btrfs: fix truncation of compressed and inlined extents When truncating a file to a smaller size which consists of an inline extent that is compressed, we did not discard (or made unusable) the data between the new file size and the old file size, wasting metadata space and allowing for the truncated data to be leaked and the data corruption/loss mentioned below. We were also not correctly decrementing the number of bytes used by the inode, we were setting it to zero, giving a wrong report for callers of the stat(2) syscall. The fsck tool also reported an error about a mismatch between the nbytes of the file versus the real space used by the file. Now because we weren't discarding the truncated region of the file, it was possible for a caller of the clone ioctl to actually read the data that was truncated, allowing for a security breach without requiring root access to the system, using only standard filesystem operations. The scenario is the following: 1) User A creates a file which consists of an inline and compressed extent with a size of 2000 bytes - the file is not accessible to any other users (no read, write or execution permission for anyone else); 2) The user truncates the file to a size of 1000 bytes; 3) User A makes the file world readable; 4) User B creates a file consisting of an inline extent of 2000 bytes; 5) User B issues a clone operation from user A's file into its own file (using a length argument of 0, clone the whole range); 6) User B now gets to see the 1000 bytes that user A truncated from its file before it made its file world readbale. User B also lost the bytes in the range [1000, 2000[ bytes from its own file, but that might be ok if his/her intention was reading stale data from user A that was never supposed to be public. Note that this contrasts with the case where we truncate a file from 2000 bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In this case reading any byte from the range [1000, 2000[ will return a value of 0x00, instead of the original data. This problem exists since the clone ioctl was added and happens both with and without my recent data loss and file corruption fixes for the clone ioctl (patch "Btrfs: fix file corruption and data loss after cloning inline extents"). So fix this by truncating the compressed inline extents as we do for the non-compressed case, which involves decompressing, if the data isn't already in the page cache, compressing the truncated version of the extent, writing the compressed content into the inline extent and then truncate it. The following test case for fstests reproduces the problem. In order for the test to pass both this fix and my previous fix for the clone ioctl that forbids cloning a smaller inline extent into a larger one, which is titled "Btrfs: fix file corruption and data loss after cloning inline extents", are needed. Without that other fix the test fails in a different way that does not leak the truncated data, instead part of destination file gets replaced with zeroes (because the destination file has a larger inline extent than the source). seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create our test files. File foo is going to be the source of a clone operation # and consists of a single inline extent with an uncompressed size of 512 bytes, # while file bar consists of a single inline extent with an uncompressed size of # 256 bytes. For our test's purpose, it's important that file bar has an inline # extent with a size smaller than foo's inline extent. $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \ -c "pwrite -S 0x2a 128 384" \ $SCRATCH_MNT/foo \| _filter_xfs_io $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar \| _filter_xfs_io # Now durably persist all metadata and data. We do this to make sure that we get # on disk an inline extent with a size of 512 bytes for file foo. sync # Now truncate our file foo to a smaller size. Because it consists of a # compressed and inline extent, btrfs did not shrink the inline extent to the # new size (if the extent was not compressed, btrfs would shrink it to 128 # bytes), it only updates the inode's i_size to 128 bytes. $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo # Now clone foo's inline extent into bar. # This clone operation should fail with errno EOPNOTSUPP because the source # file consists only of an inline extent and the file's size is smaller than # the inline extent of the destination (128 bytes < 256 bytes). However the # clone ioctl was not prepared to deal with a file that has a size smaller # than the size of its inline extent (something that happens only for compressed # inline extents), resulting in copying the full inline extent from the source # file into the destination file. # # Note that btrfs' clone operation for inline extents consists of removing the # inline extent from the destination inode and copy the inline extent from the # source inode into the destination inode, meaning that if the destination # inode's inline extent is larger (N bytes) than the source inode's inline # extent (M bytes), some bytes (N - M bytes) will be lost from the destination # file. Btrfs could copy the source inline extent's data into the destination's # inline extent so that we would not lose any data, but that's currently not # done due to the complexity that would be needed to deal with such cases # (specially when one or both extents are compressed), returning EOPNOTSUPP, as # it's normally not a very common case to clone very small files (only case # where we get inline extents) and copying inline extents does not save any # space (unlike for normal, non-inlined extents). $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Now because the above clone operation used to succeed, and due to foo's inline # extent not being shinked by the truncate operation, our file bar got the whole # inline extent copied from foo, making us lose the last 128 bytes from bar # which got replaced by the bytes in range [128, 256[ from foo before foo was # truncated - in other words, data loss from bar and being able to read old and # stale data from foo that should not be possible to read anymore through normal # filesystem operations. Contrast with the case where we truncate a file from a # size N to a smaller size M, truncate it back to size N and then read the range # [M, N[, we should always get the value 0x00 for all the bytes in that range. # We expected the clone operation to fail with errno EOPNOTSUPP and therefore # not modify our file's bar data/metadata. So its content should be 256 bytes # long with all bytes having the value 0xbb. # # Without the btrfs bug fix, the clone operation succeeded and resulted in # leaking truncated data from foo, the bytes that belonged to its range # [128, 256[, and losing data from bar in that same range. So reading the # file gave us the following content: # # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 # * # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a # * # 0000400 echo "File bar's content after the clone operation:" od -t x1 $SCRATCH_MNT/bar # Also because the foo's inline extent was not shrunk by the truncate # operation, btrfs' fsck, which is run by the fstests framework everytime a # test completes, failed reporting the following error: # # root 5 inode 257 errors 400, nbytes wrong status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-10-16 21:02:53 +01:00
Filipe Manana	5e6ecb362b	Btrfs: fix double range unlock of hole region when reading page If when reading a page we find a hole and our caller had already locked the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end up unlocking the hole's range and then later our caller unlocks it again, which might have already been locked by some other task once the first unlock happened. Currently this can only happen during a call to the extent_same ioctl, as it's the only caller of __do_readpage() that sets the bit EXTENT_BIO_PARENT_LOCKED for bio flags. Fix this by leaving the unlock exclusively to the caller. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-10-14 04:37:00 +01:00
Filipe Manana	8039d87d9e	Btrfs: fix file corruption and data loss after cloning inline extents Currently the clone ioctl allows to clone an inline extent from one file to another that already has other (non-inlined) extents. This is a problem because btrfs is not designed to deal with files having inline and regular extents, if a file has an inline extent then it must be the only extent in the file and must start at file offset 0. Having a file with an inline extent followed by regular extents results in EIO errors when doing reads or writes against the first 4K of the file. Also, the clone ioctl allows one to lose data if the source file consists of a single inline extent, with a size of N bytes, and the destination file consists of a single inline extent with a size of M bytes, where we have M > N. In this case the clone operation removes the inline extent from the destination file and then copies the inline extent from the source file into the destination file - we lose the M - N bytes from the destination file, a read operation will get the value 0x00 for any bytes in the the range [N, M] (the destination inode's i_size remained as M, that's why we can read past N bytes). So fix this by not allowing such destructive operations to happen and return errno EOPNOTSUPP to user space. Currently the fstest btrfs/035 tests the data loss case but it totally ignores this - i.e. expects the operation to succeed and does not check the we got data loss. The following test case for fstests exercises all these cases that result in file corruption and data loss: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" rm -f $seqres.full test_cloning_inline_extents() { local mkfs_opts=$1 local mount_opts=$2 _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1 _scratch_mount $mount_opts # File bar, the source for all the following clone operations, consists # of a single inline extent (50 bytes). $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \ \| _filter_xfs_io # Test cloning into a file with an extent (non-inlined) where the # destination offset overlaps that extent. It should not be possible to # clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \ \| _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo data after clone operation:" # All bytes should have the value 0xaa (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo \| _filter_xfs_io # Test cloning the inline extent against a file which has a hole in its # first 4K followed by a non-inlined extent. It should not be possible # as well to clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \ \| _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo2 data after clone operation:" # All bytes should have the value 0x00 (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo2 $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 \| _filter_xfs_io # Test cloning the inline extent against a file which has a size of zero # but has a prealloc extent. It should not be possible as well to clone # the inline extent from file bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 \| _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "First 50 bytes of foo3 after clone operation:" # Should not be able to read any bytes, file has 0 bytes i_size (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo3 $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 \| _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size not greater than the size of # bar's inline extent (40 < 50). # It should be possible to do the extent cloning from bar to this file. $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \ \| _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4 # Doing IO against any range in the first 4K of the file should work. echo "File foo4 data after clone operation:" # Must match file bar's content. od -t x1 $SCRATCH_MNT/foo4 $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 \| _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size greater than the size of bar's # inline extent (60 > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \ \| _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5 # Reading the file should not fail. echo "File foo5 data after clone operation:" # Must have a size of 60 bytes, with all bytes having a value of 0x03 # (the clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo5 # Test cloning the inline extent against a file which has no extents but # has a size greater than bar's inline extent (16K > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 \| _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6 # Reading the file should not fail. echo "File foo6 data after clone operation:" # Must have a size of 16K, with all bytes having a value of 0x00 (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo6 # Test cloning the inline extent against a file which has no extents but # has a size not greater than bar's inline extent (30 < 50). # It should be possible to clone the inline extent from file bar into # this file. $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 \| _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7 # Reading the file should not fail. echo "File foo7 data after clone operation:" # Must have a size of 50 bytes, with all bytes having a value of 0xbb. od -t x1 $SCRATCH_MNT/foo7 # Test cloning the inline extent against a file which has a size not # greater than the size of bar's inline extent (20 < 50) but has # a prealloc extent that goes beyond the file's size. It should not be # possible to clone the inline extent from bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" \ -c "pwrite -S 0x88 0 20" \ $SCRATCH_MNT/foo8 \| _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8 echo "File foo8 data after clone operation:" # Must have a size of 20 bytes, with all bytes having a value of 0x88 # (the clone operation did not modify our file). od -t x1 $SCRATCH_MNT/foo8 _scratch_unmount } echo -e "\nTesting without compression and without the no-holes feature...\n" test_cloning_inline_extents echo -e "\nTesting with compression and without the no-holes feature...\n" test_cloning_inline_extents "" "-o compress" echo -e "\nTesting without compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "" echo -e "\nTesting with compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "-o compress" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-10-14 04:36:43 +01:00
Robin Ruede	b96b1db039	btrfs: fix resending received snapshot with parent This fixes a regression introduced by `37b8d27d` between v4.1 and v4.2. When a snapshot is received, its received_uuid is set to the original uuid of the subvolume. When that snapshot is then resent to a third filesystem, it's received_uuid is set to the second uuid instead of the original one. The same was true for the parent_uuid. This behaviour was partially changed in `37b8d27d`, but in that patch only the parent_uuid was taken from the real original, not the uuid itself, causing the search for the parent to fail in the case below. This happens for example when trying to send a series of linked snapshots (e.g. created by snapper) from the backup file system back to the original one. The following commands reproduce the issue in v4.2.1 (no error in 4.1.6) # setup three test file systems for i in 1 2 3; do truncate -s 50M fs$i mkfs.btrfs fs$i mkdir $i mount fs$i $i done echo "content" > 1/testfile btrfs su snapshot -r 1/ 1/snap1 echo "changed content" > 1/testfile btrfs su snapshot -r 1/ 1/snap2 # works fine: btrfs send 1/snap1 \| btrfs receive 2/ btrfs send -p 1/snap1 1/snap2 \| btrfs receive 2/ # ERROR: could not find parent subvolume btrfs send 2/snap1 \| btrfs receive 3/ btrfs send -p 2/snap1 2/snap2 \| btrfs receive 3/ Signed-off-by: Robin Ruede <rruede+git@gmail.com> Fixes: `37b8d27de5` ("Btrfs: use received_uuid of parent during send") Cc: stable@vger.kernel.org # v4.2+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Ed Tomlinson <edt@aei.ca>	2015-10-13 20:04:10 +01:00
Filipe Manana	d906d49fc5	Btrfs: send, fix file corruption due to incorrect cloning operations If we have a file that shares an extent with other files, when processing the extent item relative to a shared extent, we blindly issue a clone operation that will target a length matching the length in the extent item and uses as a source some other file the receiver already has and points to the same extent. However that range in the other file might not exclusively point only to the shared extent, and so using that length will result in the receiver getting a file with different data from the one in the send snapshot. This issue happened both for incremental and full send operations. So fix this by issuing clone operations with lengths that don't cover regions of the source file that point to different extents (or have holes). The following test case for fstests reproduces the problem. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $send_files_dir rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root _require_cp_reflink _require_xfs_io_command "fpunch" send_files_dir=$TEST_DIR/btrfs-test-$seq rm -f $seqres.full rm -fr $send_files_dir mkdir $send_files_dir _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test file with a single 100K extent. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \ $SCRATCH_MNT/foo \| _filter_xfs_io # Clone our file into a new file named bar. cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Now overwrite parts of our foo file. $XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \ -c "pwrite -S 0xcc 90K 10K" \ -c "fpunch 70K 10k" \ $SCRATCH_MNT/foo \| _filter_xfs_io _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/snap echo "File digests in the original filesystem:" md5sum $SCRATCH_MNT/snap/foo \| _filter_scratch md5sum $SCRATCH_MNT/snap/bar \| _filter_scratch _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap # Now recreate the filesystem by receiving the send stream and verify # we get the same file contents that the original filesystem had. _scratch_unmount _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap # We expect the destination filesystem to have exactly the same file # data as the original filesystem. # The btrfs send implementation had a bug where it sent a clone # operation from file foo into file bar covering the whole [0, 100K[ # range after creating and writing the file foo. This was incorrect # because the file bar now included the updates done to file foo after # we cloned foo to bar, breaking the COW nature of reflink copies # (cloned extents). echo "File digests in the new filesystem:" md5sum $SCRATCH_MNT/snap/foo \| _filter_scratch md5sum $SCRATCH_MNT/snap/bar \| _filter_scratch status=0 exit Another test case that reproduces the problem when we have compressed extents: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $send_files_dir rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root _require_cp_reflink send_files_dir=$TEST_DIR/btrfs-test-$seq rm -f $seqres.full rm -fr $send_files_dir mkdir $send_files_dir _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create our file with an extent of 100K starting at file offset 0K. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \ -c "fsync" \ $SCRATCH_MNT/foo \| _filter_xfs_io # Rewrite part of the previous extent (its first 40K) and write a new # 100K extent starting at file offset 100K. $XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \ -c "pwrite -S 0xcc 100K 100K" \ $SCRATCH_MNT/foo \| _filter_xfs_io # Our file foo now has 3 file extent items in its metadata: # # 1) One covering the file range 0 to 40K; # 2) One covering the file range 40K to 100K, which points to the first # extent we wrote to the file and has a data offset field with value # 40K (our file no longer uses the first 40K of data from that # extent); # 3) One covering the file range 100K to 200K. # Now clone our file foo into file bar. cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Create our snapshot for the send operation. _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/snap echo "File digests in the original filesystem:" md5sum $SCRATCH_MNT/snap/foo \| _filter_scratch md5sum $SCRATCH_MNT/snap/bar \| _filter_scratch _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap # Now recreate the filesystem by receiving the send stream and verify we # get the same file contents that the original filesystem had. # Btrfs send used to issue a clone operation from foo's range # [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed # to by foo's second file extent item, this was incorrect because of bad # accounting of the file extent item's data offset field. The correct # range to clone from should have been [40K, 100K[. _scratch_unmount _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap echo "File digests in the new filesystem:" # Must match the digests we got in the original filesystem. md5sum $SCRATCH_MNT/snap/foo \| _filter_scratch md5sum $SCRATCH_MNT/snap/bar \| _filter_scratch status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-10-13 01:05:27 +01:00
Chris Mason	6db4a7335d	Merge branch 'fix/waitqueue-barriers' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4	2015-10-12 16:24:40 -07:00
Chris Mason	62fb50ab7c	Merge branch 'anand/sysfs-updates-v4.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 Signed-off-by: Chris Mason <clm@fb.com>	2015-10-12 16:24:15 -07:00
Chris Mason	640926ffdd	Merge branch 'cleanup/messages' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4	2015-10-12 16:22:26 -07:00
Trond Myklebust	daf3761c9f	namei: results of d_is_negative() should be checked after dentry revalidation Leandro Awa writes: "After switching to version 4.1.6, our parallelized and distributed workflows now fail consistently with errors of the form: T34: ./regex.c:39:22: error: config.h: No such file or directory From our 'git bisect' testing, the following commit appears to be the possible cause of the behavior we've been seeing: commit 766c4cbfacd8" Al Viro says: "What happens is that `766c4cbfac` got the things subtly wrong. We used to treat d_is_negative() after lookup_fast() as "fall with ENOENT". That was wrong - checking ->d_flags outside of ->d_seq protection is unreliable and failing with hard error on what should've fallen back to non-RCU pathname resolution is a bug. Unfortunately, we'd pulled the test too far up and ran afoul of another kind of staleness. The dentry might have been absolutely stable from the RCU point of view (and we might be on UP, etc), but stale from the remote fs point of view. If ->d_revalidate() returns "it's actually stale", dentry gets thrown away and the original code wouldn't even have looked at its ->d_flags. What we need is to check ->d_flags where `766c4cbfac` does (prior to ->d_seq validation) but only use the result in cases where we do not discard this dentry outright" Reported-by: Leandro Awa <lawa@nvidia.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911 Fixes: `766c4cbfac` ("namei: d_is_negative() should be checked...") Tested-by: Leandro Awa <lawa@nvidia.com> Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-10 10:17:27 -07:00
David Sterba	ee86395458	btrfs: comment the rest of implicit barriers before waitqueue_active There are atomic operations that imply the barrier for waitqueue_active mixed in an if-condition. Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-10 18:42:00 +02:00
David Sterba	779adf0f64	btrfs: remove extra barrier before waitqueue_active Removing barriers is scary, but a call to atomic_dec_and_test implies a barrier, so we don't need to issue another one. Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-10 18:40:33 +02:00
David Sterba	a83342aa0c	btrfs: add comments to barriers before waitqueue_active Reduce number of undocumented barriers out there. Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-10 18:40:04 +02:00
David Sterba	33a9eca7e4	btrfs: comment waitqueue_active implied by locks Suggested-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-10 18:35:10 +02:00
David Sterba	b666a9cd99	btrfs: add barrier for waitqueue_active in clear_btree_io_tree waitqueue_active should be preceded by a barrier, in this function we don't need to call it all the time. Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-10 18:24:48 +02:00
David Sterba	730d9ec36b	btrfs: remove waitqueue_active check from btrfs_rm_dev_replace_unblocked Normally the waitqueue_active would need a barrier, but this is not necessary here because it's not a performance sensitive context and we can call wake_up directly. Suggested-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-10 18:16:38 +02:00
Linus Torvalds	175d58cfed	Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are small and assorted. Neil's is the oldest, I dropped the ball thinking he was going to send it in" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: support NFSv2 export Btrfs: open_ctree: Fix possible memory leak Btrfs: fix deadlock when finalizing block group creation Btrfs: update fix for read corruption of compressed and shared extents Btrfs: send, fix corner case for reference overwrite detection	2015-10-09 16:39:35 -07:00
David Sterba	f14d104dbd	btrfs: switch more printks to our helpers Convert the simple cases, not all functions provide a way to reach the fs_info. Also skipped debugging messages (print-tree, integrity checker and pr_debug) and messages that are printed from possibly unfinished mount. Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 13:08:03 +02:00
David Sterba	9464732266	btrfs: switch message printers to ratelimited variants Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 13:04:06 +02:00
David Sterba	1dd6d7ca9d	btrfs: introduce ratelimited variants of message printing functions Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 11:07:56 +02:00
David Sterba	b14af3b46f	btrfs: switch message printers to ratelimited _in_rcu variants Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 11:07:55 +02:00
David Sterba	24aa6b41d4	btrfs: introduce ratelimited _in_rcu variants of message printing functions Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 11:07:55 +02:00
David Sterba	ecaeb14b91	btrfs: switch message printers to _in_rcu variants Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 11:07:55 +02:00
David Sterba	08a84e25a8	btrfs: introduce _in_rcu variants of message printing functions Due to the missing variants there are messages that lack the information printed by btrfs_info etc helpers. Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 11:07:55 +02:00
Linus Torvalds	a0eeb8dd34	NFS client bugfixes for Linux 4.3 Highlights include: Bugfixes: - Fix a use-after-free bug in the RPC/RDMA client - Fix a write performance regression - Fix up page writeback accounting - Don't try to reclaim unused state owners - Fix a NFSv4 nograce recovery hang - reset states to use open_stateid when returning delegation voluntarily - Fix a tracepoint NULL-pointer dereference -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWFIgBAAoJEGcL54qWCgDy3qwQAJrMvwiO0shZe9+PsUZcDIhw 1CnDmWYafJmpNGK+YEZatI+tdR9pwSYXdfiCGj/Ijfvl1PXUgyVAmNARAB9oFUza DVvZjqJ6aiFzeawGC8f2IfwY4XcAy4+BIZOiwp2JafepRnoSgZl24olKbO4cQ7UD i5IaDrYYvAxefsUoRogEF19H1y8zC1yUA2aDKrriV6A9rEZSbaZLRfS8BHppXBjY w0OP74neD4rnn/rL0YDEdsjiI17W7QwoMk05yzOJH3wQt/Y4Ll/lwLO4y3URpIGF wzHzMIeggGPPEM9e1JixPc3Y9F9kCHW8YjGJ3xxY2C6q8vt7dzpaVhh10AxycZtZ gcbepjMhoL7gJqu5DQ/0S86Sb5jNaL0KlUDsEnqtOfe3/UiyTJ/f57TMfdscm+wI pdyFFtxUHcFueO1a2XuEOuSIUFzFuwIQ2aiHlbu90ev04dd7dqzU0PffhRlzu3tJ 8+ZHQMbSmotUmhxlpI+VA4rG0JUsaLY09chH5r0NvsXm0LR+z3vX7Q6oONN7IBDv 5hULj4ecB69smBv+FjQyVUAu0LiahINAGu0p0wEjTdBwFMic5qpVVfhTs8qrkGRZ M8RYrANtVhY17fJf5WF7Wyt58icAWRKDHslGdzUav+2VFBfNK1ZeG+QhYYqDNF5k SkJsG4iCIN9JazwqfqJI =aoNS -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Bugfixes: - Fix a use-after-free bug in the RPC/RDMA client - Fix a write performance regression - Fix up page writeback accounting - Don't try to reclaim unused state owners - Fix a NFSv4 nograce recovery hang - reset states to use open_stateid when returning delegation voluntarily - Fix a tracepoint NULL-pointer dereference" * tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a tracepoint NULL-pointer dereference nfs4: reset states to use open_stateid when returning delegation voluntarily NFSv4: Fix a nograce recovery hang NFSv4.1: nfs4_opendata_check_deleg needs to handle NFS4_OPEN_CLAIM_DELEG_CUR_FH NFSv4: Don't try to reclaim unused state owners NFS: Fix a write performance regression NFS: Fix up page writeback accounting xprtrdma: disconnect and flush cqs before freeing buffers	2015-10-07 08:54:22 +01:00
Anna Schumaker	39d0d3bdf7	NFS: Fix a tracepoint NULL-pointer dereference Running xfstest generic/013 with the tracepoint nfs:nfs4_open_file enabled produces a NULL-pointer dereference when calculating fileid and filehandle of the opened file. Fix this by checking if state is NULL before trying to use the inode pointer. Reported-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-10-06 18:56:25 -04:00
NeilBrown	7d35199e15	BTRFS: support NFSv2 export The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as that returned by encode_fh - it may be larger. With NFSv2, the filehandle is fixed length, so it may appear longer than expected and be zero-padded. So we must test that fh_len is at least some value, not exactly equal to it. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: David Sterba <dsterba@suse.cz>	2015-10-06 06:55:23 -07:00
chandan	e5fffbac4a	Btrfs: open_ctree: Fix possible memory leak After reading one of chunk or tree root tree's root node from disk, if the root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release the memory used by the root node. Fix this. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>	2015-10-06 06:55:22 -07:00
Linus Torvalds	3c68319b28	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull CIFS fixes from Steve French: "Two fixes for problems pointed out by automated tools. Thanks PaX/grsecurity team and Dan Carpenter (and the Smatch tool)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: [CIFS] Update cifs version number [SMB3] Do not fall back to SMBWriteX in set_file_size error cases [SMB3] Missing null tcon check	2015-10-06 14:30:21 +01:00
Filipe Manana	d9a0540a79	Btrfs: fix deadlock when finalizing block group creation Josef ran into a deadlock while a transaction handle was finalizing the creation of its block groups, which produced the following trace: [260445.593112] fio D ffff88022a9df468 0 8924 4518 0x00000084 [260445.593119] ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488 [260445.593126] ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0 [260445.593132] ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00 [260445.593137] Call Trace: [260445.593145] [<ffffffff8175a437>] schedule+0x37/0x80 [260445.593189] [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs] [260445.593197] [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0 [260445.593225] [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs] [260445.593253] [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs] [260445.593295] [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs] [260445.593324] [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [260445.593351] [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [260445.593394] [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs] [260445.593427] [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs] [260445.593459] [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs] [260445.593491] [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs] [260445.593524] [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs] [260445.593532] [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170 [260445.593564] [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs] [260445.593597] [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs] [260445.593626] [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs] [260445.593654] [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs] [260445.593682] [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs] [260445.593724] [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs] [260445.593752] [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [260445.593830] [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [260445.593905] [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs] [260445.593946] [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs] [260445.593990] [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs] [260445.594042] [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs] [260445.594089] [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs] [260445.594115] [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0 [260445.594133] [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180 [260445.594149] [<ffffffff8123e35d>] do_fsync+0x3d/0x70 [260445.594169] [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110 [260445.594187] [<ffffffff8123e600>] SyS_fsync+0x10/0x20 [260445.594204] [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71 This happened because the same transaction handle created a large number of block groups and while finalizing their creation (inserting new items and updating existing items in the chunk and device trees) a new metadata extent had to be allocated and no free space was found in the current metadata block groups, which made find_free_extent() attempt to allocate a new block group via do_chunk_alloc(). However at do_chunk_alloc() we ended up allocating a new system chunk too and exceeded the threshold of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the final part of block group creation again (at btrfs_create_pending_block_groups()) and attempt to lock again the root of the chunk tree when it's already write locked by the same task. Similarly we can deadlock on extent tree nodes/leafs if while we are running delayed references we end up creating a new metadata block group in order to allocate a new node/leaf for the extent tree (as part of a CoW operation or growing the tree), as btrfs_create_pending_block_groups inserts items into the extent tree as well. In this case we get the following trace: [14242.773581] fio D ffff880428ca3418 0 3615 3100 0x00000084 [14242.773588] ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438 [14242.773594] ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460 [14242.773600] ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190 [14242.773606] Call Trace: [14242.773613] [<ffffffff8175a437>] schedule+0x37/0x80 [14242.773656] [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs] [14242.773664] [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0 [14242.773692] [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs] [14242.773720] [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs] [14242.773750] [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [14242.773758] [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200 [14242.773786] [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs] [14242.773818] [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs] [14242.773850] [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs] [14242.773934] [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs] [14242.773998] [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs] [14242.774041] [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs] [14242.774078] [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs] [14242.774118] [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs] [14242.774155] [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs] [14242.774194] [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs] [14242.774235] [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [14242.774274] [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [14242.774318] [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs] [14242.774358] [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs] [14242.774391] [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs] [14242.774432] [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs] [14242.774474] [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs] [14242.774516] [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs] [14242.774558] [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs] [14242.774599] [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs] [14242.774642] [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs] [14242.774650] [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0 [14242.774657] [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180 [14242.774663] [<ffffffff8123e35d>] do_fsync+0x3d/0x70 [14242.774669] [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110 [14242.774675] [<ffffffff8123e600>] SyS_fsync+0x10/0x20 [14242.774681] [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71 Fix this by never recursing into the finalization phase of block group creation and making sure we never trigger the finalization of block group creation while running delayed references. Reported-by: Josef Bacik <jbacik@fb.com> Fixes: `00d80e342c` ("Btrfs: fix quick exhaustion of the system array in the superblock") Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-10-05 16:56:38 -07:00
Filipe Manana	808f80b467	Btrfs: update fix for read corruption of compressed and shared extents My previous fix in commit `005efedf2c` ("Btrfs: fix read corruption of compressed and shared extents") was effective only if the compressed extents cover a file range with a length that is not a multiple of 16 pages. That's because the detection of when we reached a different range of the file that shares the same compressed extent as the previously processed range was done at extent_io.c:__do_contiguous_readpages(), which covers subranges with a length up to 16 pages, because extent_readpages() groups the pages in clusters no larger than 16 pages. So fix this by tracking the start of the previously processed file range's extent map at extent_readpages(). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create our test file with a single extent of 64Kb that is going to # be compressed no matter which compression algo is used (zlib/lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \ $SCRATCH_MNT/foo \| _filter_xfs_io # Now clone the compressed extent into an adjacent file offset. $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo echo "File digest before unmount:" md5sum $SCRATCH_MNT/foo \| _filter_scratch # Remount the fs or clear the page cache to trigger the bug in # btrfs. Because the extent has an uncompressed length that is a # multiple of 16 pages, all the pages belonging to the second range # of the file (64K to 128K), which points to the same extent as the # first range (0K to 64K), had their contents full of zeroes instead # of the byte 0xaa. This was a bug exclusively in the read path of # compressed extents, the correct data was stored on disk, btrfs # just failed to fill in the pages correctly. _scratch_remount echo "File digest after remount:" # Must match the digest we got before. md5sum $SCRATCH_MNT/foo \| _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Timofey Titovets <nefelim4ag@gmail.com>	2015-10-05 16:56:27 -07:00
Filipe Manana	b786f16ac3	Btrfs: send, fix corner case for reference overwrite detection When the inode given to did_overwrite_ref() matches the current progress and has a reference that collides with the reference of other inode that has the same number as the current progress, we were always telling our caller that the inode's reference was overwritten, which is incorrect because the other inode might be a new inode (different generation number) in which case we must return false from did_overwrite_ref() so that its callers don't use an orphanized path for the inode (as it will never be orphanized, instead it will be unlinked and the new inode created later). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $send_files_dir rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root send_files_dir=$TEST_DIR/btrfs-test-$seq rm -f $seqres.full rm -fr $send_files_dir mkdir $send_files_dir _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test file with a single extent of 64K. mkdir -p $SCRATCH_MNT/foo $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo/bar \ \| _filter_xfs_io _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/mysnap1 _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \ $SCRATCH_MNT/mysnap2 echo "File digest before being replaced:" md5sum $SCRATCH_MNT/mysnap1/foo/bar \| _filter_scratch # Remove the file and then create a new one in the same location with # the same name but with different content. This new file ends up # getting the same inode number as the previous one, because that inode # number was the highest inode number used by the snapshot's root and # therefore when attempting to find the a new inode number for the new # file, we end up reusing the same inode number. This happens because # currently btrfs uses the highest inode number summed by 1 for the # first inode created once a snapshot's root is loaded (done at # fs/btrfs/inode-map.c:btrfs_find_free_objectid in the linux kernel # tree). # Having these two different files in the snapshots with the same inode # number (but different generation numbers) caused the btrfs send code # to emit an incorrect path for the file when issuing an unlink # operation because it failed to realize they were different files. rm -f $SCRATCH_MNT/mysnap2/foo/bar $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 96K" \ $SCRATCH_MNT/mysnap2/foo/bar \| _filter_xfs_io _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 \ $SCRATCH_MNT/mysnap2_ro _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 \ $SCRATCH_MNT/mysnap2_ro -f $send_files_dir/2.snap echo "File digest in the original filesystem after being replaced:" md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar \| _filter_scratch # Now recreate the filesystem by receiving both send streams and verify # we get the same file contents that the original filesystem had. _scratch_unmount _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap echo "File digest in the new filesystem:" # Must match the digest from the new file. md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar \| _filter_scratch status=0 exit Reported-by: Martin Raiber <martin@urbackup.org> Fixes: `8b191a6849` ("Btrfs: incremental send, check if orphanized dir inode needs delayed rename") Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-10-05 16:56:27 -07:00
Steve French	616a5399b8	[CIFS] Update cifs version number Update modinfo cifs.ko version number to 2.08 Signed-off-by: Steve French <steve.french@primarydata.com>	2015-10-03 16:54:17 -05:00
Jeff Layton	5e99b532bb	nfs4: reset states to use open_stateid when returning delegation voluntarily When the client goes to return a delegation, it should always update any nfs4_state currently set up to use that delegation stateid to instead use the open stateid. It already does do this in some cases, particularly in the state recovery code, but not currently when the delegation is voluntarily returned (e.g. in advance of a RENAME). This causes the client to try to continue using the delegation stateid after the DELEGRETURN, e.g. in LAYOUTGET. Set the nfs4_state back to using the open stateid in nfs4_open_delegation_recall, just before clearing the NFS_DELEGATED_STATE bit. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-10-02 15:43:07 -04:00
Benjamin Coddington	e92c1e0d40	NFSv4: Fix a nograce recovery hang Since commit `5cae02f427` an OPEN_CONFIRM should have a privileged sequence in the recovery case to allow nograce recovery to proceed for NFSv4.0. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-10-02 15:43:07 -04:00
Trond Myklebust	72d79ff83c	NFSv4.1: nfs4_opendata_check_deleg needs to handle NFS4_OPEN_CLAIM_DELEG_CUR_FH We need to warn against broken NFSv4.1 servers that try to hand out delegations in response to NFS4_OPEN_CLAIM_DELEG_CUR_FH. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-10-02 15:43:07 -04:00
Trond Myklebust	4a0954ef34	NFSv4: Don't try to reclaim unused state owners Currently, we don't test if the state owner is in use before we try to recover it. The problem is that if the refcount is zero, then the state owner will be waiting on the lru list for garbage collection. The expectation in that case is that if you bump the refcount, then you must also remove the state owner from the lru list. Otherwise the call to nfs4_put_state_owner will corrupt that list by trying to add our state owner a second time. Avoid the whole problem by just skipping state owners that hold no state. Reported-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-10-02 15:43:07 -04:00
Trond Myklebust	8fa4592a14	NFS: Fix a write performance regression If all other conditions in nfs_can_extend_write() are met, and there are no locks, then we should be able to assume close-to-open semantics and the ability to extend our write to cover the whole page. With this patch, the xfstests generic/074 test completes in 242s instead of >1400s on my test rig. Fixes: `bd61e0a9c8` ("locks: convert posix locks to file_lock_context") Cc: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-10-02 15:43:07 -04:00
Trond Myklebust	40f90271a8	NFS: Fix up page writeback accounting Currently, we are crediting all the calls to nfs_writepages_callback() (i.e. the nfs_writepages() callback) to nfs_writepage(). Aside from being inconsistent with the behaviour of the equivalent readpage/readpages accounting, this also means that we cannot distinguish between bulk writes and single page writebacks (which confuses the 'nfsiostat -p' tool). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-10-02 15:43:07 -04:00
Steve French	646200a041	[SMB3] Do not fall back to SMBWriteX in set_file_size error cases The error paths in set_file_size for cifs and smb3 are incorrect. In the unlikely event that a server did not support set file info of the file size, the code incorrectly falls back to trying SMBWriteX (note that only the original core SMB Write, used for example by DOS, can set the file size this way - this actually does not work for the more recent SMBWriteX). The idea was since the old DOS SMB Write could set the file size if you write zero bytes at that offset then use that if server rejects the normal set file info call. Fortunately the SMBWriteX will never be sent on the wire (except when file size is zero) since the length and offset fields were reversed in the two places in this function that call SMBWriteX causing the fall back path to return an error. It is also important to never call an SMB request from an SMB2/sMB3 session (which theoretically would be possible, and can cause a brief session drop, although the client recovers) so this should be fixed. In practice this path does not happen with modern servers but the error fall back to SMBWriteX is clearly wrong. Removing the calls to SMBWriteX in the error paths in cifs_set_file_size Pointed out by PaX/grsecurity team Signed-off-by: Steve French <steve.french@primarydata.com> Reported-by: PaX Team <pageexec@freemail.hu> CC: Emese Revfy <re.emese@gmail.com> CC: Brad Spengler <spender@grsecurity.net> CC: Stable <stable@vger.kernel.org>	2015-10-01 22:48:37 -05:00
Ross Zwisler	8346c416d1	dax: fix NULL pointer in __dax_pmd_fault() Commit `46c043ede4` ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") moved some code in __dax_pmd_fault() that was responsible for zeroing newly allocated PMD pages. The new location didn't properly set up 'kaddr', so when run this code resulted in a NULL pointer BUG. Fix this by getting the correct 'kaddr' via bdev_direct_access(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-01 21:42:35 -04:00
Liu Bo	73416dab23	Btrfs: move kobj stuff out of dev_replace lock range To avoid deadlock described in commit `084b6e7c76` ("btrfs: Fix a lockdep warning when running xfstest."), we should move kobj stuff out of dev_replace lock range. "It is because the btrfs_kobj_{add/rm}_device() will call memory allocation with GFP_KERNEL, which may flush fs page cache to free space, waiting for it self to do the commit, causing the deadlock. To solve the problem, move btrfs_kobj_{add/rm}_device() out of the dev_replace lock range, also involing split the btrfs_rm_dev_replace_srcdev() function into remove and free parts. Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are called out of the lock range." Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> [added lockup description] Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-01 18:07:59 +02:00
Anand Jain	f190aa471a	Btrfs: add helper for closing one device Signed-off-by: Anand Jain <anand.jain@oracle.com> [reworded subject and changelog] Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-01 18:00:05 +02:00
Anand Jain	097efc966a	Btrfs: don't log error from btrfs_get_bdev_and_sb Originally the message was not in a helper but ended up there. We should print error messages from callers instead. Signed-off-by: Anand Jain <anand.jain@oracle.com> [reworded subject and changelog] Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-01 17:56:47 +02:00
Anand Jain	9e271ae27e	Btrfs: kernel operation should come after user input has been verified By general rule of thumb there shouldn't be any way that user land could trigger a kernel operation just by sending wrong arguments. Here do commit cleanups after user input has been verified. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-01 17:45:10 +02:00
Anand Jain	12b1c2637b	Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks This patch updates and renames btrfs_scratch_superblocks, (which is used by the replace device thread), with those fixes from the scratch superblock code section of btrfs_rm_device(). The fixes are: Scratch all copies of superblock Notify kobject that superblock has been changed Update time on the device So that btrfs_rm_device() can use the function btrfs_scratch_superblocks() instead of its own scratch code. And further replace deivce code which similarly releases device back to the system, will have the fixes from the btrfs device delete. Signed-off-by: Anand Jain <anand.jain@oracle.com> [renamed to btrfs_scratch_superblock] Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-01 17:37:34 +02:00
Anand Jain	29c36d7253	Btrfs: add btrfs_read_dev_one_super() to read one specific SB This uses a chunk of code from btrfs_read_dev_super() and creates a function called btrfs_read_dev_one_super() so that next patch can use it for scratch superblock. Signed-off-by: Anand Jain <anand.jain@oracle.com> [renamed bufhead to bh] Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-01 17:29:38 +02:00
Anand Jain	d74a625987	Btrfs: use BTRFS_ERROR_DEV_MISSING_NOT_FOUND when missing device is not found Use btrfs specific error code BTRFS_ERROR_DEV_MISSING_NOT_FOUND instead of -ENOENT. Next this removes the logging when user specifies "missing" and we don't find it in the kernel device list. Logging are for system events not for user input errors. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-01 16:47:16 +02:00
Linus Torvalds	f97b870ece	This pull request contains three bug fixes for both UBI and UBIFS. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABAgAGBQJWCmxzAAoJEEtJtSqsAOnWvfUP/R4NXpQmTJvmKfPaHJxuKMO3 uzEZET8qoc54OVN/GvvPFPRsZhZ5C6a1apWiCg77/WuDm9HHHEYrJVMYcOwqkPU1 5eqXSYdsvS7MjuSJS1fW4zIG+/HYaTXGJ/3bdP0vogtjzaKIBksKBmMTRNOAL8b8 2R6htwkVTMJdOUq6/xQuxG7FzT5m6wPEqUENfqGB3livbiqvU7OTud8I6yvcfD1M tN02BuUduFgBR/4TwMQSbLzWH0T+XG74t79J5s7sBJwe5/dEeTUXV0HfcPEuG/9+ 8TBDeoaxz+m9bvQYROPSRlkAIkh9TPsxTeKTdBDN67/CB2y5P06rz+Kta7ygNSTD Dn/fZ0I2JhQOtz2EiXvK9N36aHbZAltUFpFp0KNf8GUUM9vNMDY3sjeGQidAwxMc /qVtu+Syk5+HMz8hQCWpdIbqk3ahZsOvTADwedMn+vxxri6IaQqcnBWmIRy7rffq prYxJx0VTVbLua5WXCOJILQCGEELqsnUKlnCm6LtznBUpff0Wmj6KsXmmXLs/X7X NoztNx9FfhHQkWIIx92vu2cbC76LvsCXSuAfwC7k3KyW1hA9uWkc39Hs7yO5UcBp lQZwsIZTe7qSuVt8lVC5omTeIgQiSc/Gte3WFEtNXNo2uq1VJa717NH6qwNOPayy /L6on4YEUleHKrvJFjcd =j/qn -----END PGP SIGNATURE----- Merge tag 'upstream-4.3-rc4' of git://git.infradead.org/linux-ubifs Pull UBI/UBIFS fixes from Richard Weinberger: "This contains three bug fixes for both UBI and UBIFS" * tag 'upstream-4.3-rc4' of git://git.infradead.org/linux-ubifs: UBI: return ENOSPC if no enough space available UBI: Validate data_size UBIFS: Kill unneeded locking in ubifs_init_security	2015-10-01 07:57:27 -04:00

1 2 3 4 5 ...

42141 Commits