linux

Author	SHA1	Message	Date
Filipe Manana	f4dfe68710	Btrfs: fix extent_same allowing destination offset beyond i_size When using the same file as the source and destination for a dedup (extent_same ioctl) operation we were allowing it to dedup to a destination offset beyond the file's size, which doesn't make sense and it's not allowed for the case where the source and destination files are not the same file. This made de deduplication operation successful only when the source range corresponded to a hole, a prealloc extent or an extent with all bytes having a value of 0x00. This was also leaving a file hole (between i_size and destination offset) without the corresponding file extent items, which can be reproduced with the following steps for example: $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar wrote 404990/404990 bytes at offset 304457 395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec) $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792 Deduping 2 total files (28672, 24576): /mnt/sdi/foobar (929792, 24576): /mnt/sdi/foobar 1 files asked to be deduped i: 0, status: 0, bytes_deduped: 24576 24576 total bytes deduped in this operation $ umount /mnt/sdi $ btrfsck /dev/sdi Checking filesystem on /dev/sdi UUID: 98c528aa-0833-427d-9403-b98032ffbf9d checking extents checking free space cache checking fs roots root 5 inode 257 errors 100, file extent discount Found file extent holes: start: 712704, len: 217088 found 540673 bytes used err is 1 total csum bytes: 400 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123675 file data blocks allocated: 671744 referenced 671744 btrfs-progs v4.2.3 So fix this by not allowing the destination to go beyond the file's size, just as we do for the same where the source and destination files are not the same. A test for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-03-01 08:23:33 -08:00
Filipe Manana	2be63d5ce9	Btrfs: fix file loss on log replay after renaming a file and fsync We have two cases where we end up deleting a file at log replay time when we should not. For this to happen the file must have been renamed and a directory inode must have been fsynced/logged. Two examples that exercise these two cases are listed below. Case 1) $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir -p /mnt/a/b $ mkdir /mnt/c $ touch /mnt/a/b/foo $ sync $ mv /mnt/a/b/foo /mnt/c/ # Create file bar just to make sure the fsync on directory a/ does # something and it's not a no-op. $ touch /mnt/a/bar $ xfs_io -c "fsync" /mnt/a < power fail / crash > The next time the filesystem is mounted, the log replay procedure deletes file foo. Case 2) $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a $ mkdir /mnt/b $ mkdir /mnt/c $ touch /mnt/a/foo $ ln /mnt/a/foo /mnt/b/foo_link $ touch /mnt/b/bar $ sync $ unlink /mnt/b/foo_link $ mv /mnt/b/bar /mnt/c/ $ xfs_io -c "fsync" /mnt/a/foo < power fail / crash > The next time the filesystem is mounted, the log replay procedure deletes file bar. The reason why the files are deleted is because when we log inodes other then the fsync target inode, we ignore their last_unlink_trans value and leave the log without enough information to later replay the rename operations. So we need to look at the last_unlink_trans values and fallback to a transaction commit if they are greater than the id of the last committed transaction. So fix this by looking at the last_unlink_trans values and fallback to transaction commits when needed. Also, when logging other inodes (for case 1 we logged descendants of the fsync target inode while for case 2 we logged ascendants) we need to care about concurrent tasks updating the last_unlink_trans of inodes we are logging (which was already an existing problem in check_parent_dirs_for_sync()). Since we can not acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes deadlocks with other concurrent operations that acquire the i_mutex of 2 inodes (other fsyncs or renames for example), we need to serialize on the log_mutex of the inode we are logging. A task setting a new value for an inode's last_unlink_trans must acquire the inode's log_mutex and it must do this update before doing the actual unlink operation (which is already the case except when deleting a snapshot). Conversely the task logging the inode must first log the inode and then check the inode's last_unlink_trans value while holding its log_mutex, as if its value is not greater then the id of the last committed transaction it means it logged a safe state of the inode's items, while if its value is not smaller then the id of the last committed transaction it means the inode state it has logged might not be safe (the concurrent task might have just updated last_unlink_trans but hasn't done yet the unlink operation) and therefore a transaction commit must be done. Test cases for xfstests follow in separate patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-03-01 08:23:29 -08:00
Filipe Manana	1ec9a1ae1e	Btrfs: fix unreplayable log after snapshot delete + parent dir fsync If we delete a snapshot, fsync its parent directory and crash/power fail before the next transaction commit, on the next mount when we attempt to replay the log tree of the root containing the parent directory we will fail and prevent the filesystem from mounting, which is solvable by wiping out the log trees with the btrfs-zero-log tool but very inconvenient as we will lose any data and metadata fsynced before the parent directory was fsynced. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ btrfs subvolume snapshot /mnt /mnt/testdir/snap $ btrfs subvolume delete /mnt/testdir/snap $ xfs_io -c "fsync" /mnt/testdir < crash / power failure and reboot > $ mount /dev/sdc /mnt mount: mount(2) failed: No such file or directory And in dmesg/syslog we get the following message and trace: [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [192066.363010] ------------[ cut here ]------------ [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]() [192066.367250] BTRFS: Transaction aborted (error -2) [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G W 4.4.0-rc6-btrfs-next-20+ #1 [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [192066.380889] 0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8 [192066.382561] ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe [192066.384191] ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710 [192066.385827] Call Trace: [192066.386373] [<ffffffff81257570>] dump_stack+0x4e/0x79 [192066.387387] [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2 [192066.388429] [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.389236] [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50 [192066.389884] [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.390621] [<ffffffff81184b55>] ? iput+0xb0/0x266 [192066.391200] [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [192066.391930] [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs] [192066.392715] [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs] [192066.393510] [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs] [192066.394241] [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs] [192066.394958] [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs] [192066.395628] [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs] [192066.396790] [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs] [192066.397891] [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs] [192066.398897] [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs] [192066.399823] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf [192066.400739] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3 [192066.401700] [<ffffffff811714b9>] mount_fs+0x67/0x131 [192066.402482] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde [192066.403930] [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs] [192066.404831] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf [192066.405726] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3 [192066.406621] [<ffffffff811714b9>] mount_fs+0x67/0x131 [192066.407401] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde [192066.408247] [<ffffffff8118ae36>] do_mount+0x893/0x9d2 [192066.409047] [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c [192066.409842] [<ffffffff8118b187>] SyS_mount+0x75/0xa1 [192066.410621] [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]--- [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree) [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30 [192066.444613] BTRFS: open_ctree failed This happens because when we are replaying the log and processing the directory entry pointing to the snapshot in the subvolume tree, we treat its btrfs_dir_item item as having a location with a key type matching BTRFS_INODE_ITEM_KEY, which is wrong because the type matches BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the object id refers to a root number and not to an inode in the root containing the parent directory. So fix this by triggering a transaction commit if an fsync against the parent directory is requested after deleting a snapshot. This is the simplest approach for a rare use case. Some alternative that avoids the transaction commit would require more code to explicitly delete the snapshot at log replay time (factoring out common code from ioctl.c: btrfs_ioctl_snap_destroy()), special care at fsync time to remove the log tree of the snapshot's root from the log root of the root of tree roots, amongst other steps. A test case for xfstests that triggers the issue follows. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create a snapshot at the root of our filesystem (mount point path), delete it, # fsync the mount point path, crash and mount to replay the log. This should # succeed and after the filesystem is mounted the snapshot should not be visible # anymore. _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1 _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT _flakey_drop_and_remount [ -e $SCRATCH_MNT/snap1 ] && \ echo "Snapshot snap1 still exists after log replay" # Similar scenario as above, but this time the snapshot is created inside a # directory and not directly under the root (mount point path). mkdir $SCRATCH_MNT/testdir _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2 _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir _flakey_drop_and_remount [ -e $SCRATCH_MNT/testdir/snap2 ] && \ echo "Snapshot snap2 still exists after log replay" _unmount_flakey echo "Silence is golden" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-03-01 08:23:25 -08:00
Chris Mason	c05c5ee5ea	Btrfs patchsets for 4.6 -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJW0GSnAAoJEMVl1fnXbVg75qAP/0xbZPJtvTgRMSRnARtFJ28w vCsxqY+AatNJDuEpg2My/vscZvAVXGcTWjnM8NkXMMKN+oags47QN4qD0cuNv2kI JWcz7Ppt3GY6lcQbTj/Ce6N8RPRCNGsU7vxev+sKZ+jjXn+vuc+wKXnyJgaL1qcN XhcP2MccrXTVVJXLbGMFoaJXWWfd2i9uJ2MplmjFP7HQi5zP+5t/dsVaAQbc1dqx 2TqgTJkUEPQqK8geAKom5wdLTmpLSgMWvg1m4lkYpDO89Fi+hFAKeeuJZvNutxVa hA0QLrLyZmr4tbZhM1of35Kl7N1uwCzOd8u6xsxurB12bibz67RbQpK+fazlCjKa wZJvJV+N3gqgCusLHlXYX0YalQxpWRQiKkjzpMy3Pq4K4soLrw20tQOnnBFhLR1y ZwqmZUN33lhFNCIWqLS4BLqDG+Z7Sf2aGhFtspMDjSUJe9gLbIpvH9sW6CexJI2r FnxTaVZ08uY0ky1dvZcRDR6zDDbVUpoQKWmwdZpxoEO1eLKjD01VsMOw5zlAaxdc a5SxKMVt0Gq56oTPgp0MuLHJr20pxx03yr+yl69VM8R1dAG/y61Dq5DwiFNQ8+J6 jrX+eVYGBgTNYw/UGb14UPwVjQFFEs/vouphy6MmOVvNz+YZI6thN1uScB0vw7BV p/oFts5Fo0ipJgaBzGu4 =CRdD -----END PGP SIGNATURE----- Merge tag 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6 Btrfs patchsets for 4.6	2016-03-01 08:13:56 -08:00
David Sterba	f5bc27c71a	Merge branch 'dev/control-ioctl' into for-chris-4.6	2016-02-26 15:38:34 +01:00
David Sterba	fa695b01bc	Merge branch 'misc-4.6' into for-chris-4.6 # Conflicts: # fs/btrfs/file.c	2016-02-26 15:38:34 +01:00
David Sterba	f004fae0cf	Merge branch 'cleanups-4.6' into for-chris-4.6	2016-02-26 15:38:33 +01:00
David Sterba	675d276b32	Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6	2016-02-26 15:38:32 +01:00
David Sterba	e9ddd77a31	Merge branch 'foreign/josef/space-updates' into for-chris-4.6	2016-02-26 15:38:31 +01:00
David Sterba	ff7db6e05a	Merge branch 'foreign/zhaolei/reada' into for-chris-4.6	2016-02-26 15:38:30 +01:00
David Sterba	23c1a966f2	Merge branch 'foreign/qu/norecovery-v7' into for-chris-4.6	2016-02-26 15:38:30 +01:00
David Sterba	67d605fec1	Merge branch 'dev/rename-keys' into for-chris-4.6	2016-02-26 15:38:29 +01:00
David Sterba	e22b3d1fbe	Merge branch 'dev/gfp-flags' into for-chris-4.6	2016-02-26 15:38:28 +01:00
David Sterba	5f1b5664d9	Merge branch 'chandan/prep-subpage-blocksize' into for-chris-4.6 # Conflicts: # fs/btrfs/file.c	2016-02-26 15:38:28 +01:00
Liu Bo	73beece9ca	Btrfs: fix lockdep deadlock warning due to dev_replace Xfstests btrfs/011 complains about a deadlock warning, [ 1226.649039] ========================================================= [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ] [ 1226.649039] 4.1.0+ #270 Not tainted [ 1226.649039] --------------------------------------------------------- [ 1226.652955] kswapd0/46 just changed the state of lock: [ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1226.652955] other info that might help us debug this: [ 1226.652955] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1226.652955] Possible interrupt unsafe locking scenario: [ 1226.652955] CPU0 CPU1 [ 1226.652955] ---- ---- [ 1226.652955] lock(&fs_info->dev_replace.lock); [ 1226.652955] local_irq_disable(); [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] lock(&found->groups_sem); [ 1226.652955] <Interrupt> [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] * DEADLOCK * Commit `084b6e7c76` ("btrfs: Fix a lockdep warning when running xfstest.") tried to fix a similar one that has the exactly same warning, but with that, we still run to this. The above lock chain comes from btrfs_commit_transaction ->btrfs_run_delayed_items ... ->__btrfs_update_delayed_inode ... ->__btrfs_cow_block ... ->find_free_extent ->cache_block_group ->load_free_space_cache ->btrfs_readpages ->submit_one_bio ... ->__btrfs_map_block ->btrfs_dev_replace_lock However, with high memory pressure, tasks which hold dev_replace.lock can be interrupted by kswapd and then kswapd is intended to release memory occupied by superblock, inodes and dentries, where we may call evict_inode, and it comes to [ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30 [ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700 delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads to a ABBA deadlock. To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but things are simpler here since we only needs read's spinlock to blocking lock. With this, btrfs/011 no more produces warnings in dmesg. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-23 13:10:10 +01:00
David Sterba	d5131b658c	btrfs: drop unused argument in btrfs_ioctl_get_supported_features Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-23 12:56:35 +01:00
David Sterba	c5868f8362	btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls The control device is accessible when no filesystem is mounted and we may want to query features supported by the module. This is already possible using the sysfs files, this ioctl is for parity and convenience. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-23 12:56:21 +01:00
David Sterba	f7e98a7fff	btrfs: change max_inline default to 2048 The current practical default is ~4k on x86_64 (the logic is more complex, simplified for brevity), the inlined files land in the metadata group and thus consume space that could be needed for the real metadata. The inlining brings some usability surprises: 1) total space consumption measured on various filesystems and btrfs with DUP metadata was quite visible because of the duplicated data within metadata 2) inlined data may exhaust the metadata, which are more precious in case the entire device space is allocated to chunks (ie. balance cannot make the space more compact) 3) performance suffers a bit as the inlined blocks are duplicate and stored far away on the device. Proposed fix: set the default to 2048 This fixes namely 1), the total filesysystem space consumption will be on par with other filesystems. Partially fixes 2), more data are pushed to the data block groups. The characteristics of 3) are based on actual small file size distribution. The change is independent of the metadata blockgroup type (though it's most visible with DUP) or system page size as these parameters are not trival to find out, compared to file size. Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-23 12:55:27 +01:00
David Sterba	11ea474f74	btrfs: remove error message from search ioctl for nonexistent tree Let's remove the error message that appears when the tree_id is not present. This can happen with the quota tree and has been observed in practice. The applications are supposed to handle -ENOENT and we don't need to report that in the system log as it's not a fatal error. Reported-by: Vlastimil Babka <vbabka@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-23 12:54:48 +01:00
Arnd Bergmann	f827ba9a64	btrfs: avoid uninitialized variable warning With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides to partially inline the get_state_failrec() function but cannot figure out that means the failrec pointer is always valid if the function returns success, which causes a harmless warning: fs/btrfs/extent_io.c: In function 'clean_io_failure': fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized] This marks get_state_failrec() and set_state_failrec() both as 'noinline', which avoids the warning in all cases for me, and seems less ugly than adding a fake initialization. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `47dc196ae7` ("btrfs: use proper type for failrec in extent_state") Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-23 12:42:46 +01:00
Linus Torvalds	ce6b71432d	Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "My for-linus-4.5 branch has a btrfs DIO error passing fix. I know how much you love DIO, so I'm going to suggest against reading it. We'll follow up with a patch to drop the error arg from dio_end_io in the next merge window." * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix direct IO requests not reporting IO error to user space	2016-02-19 13:40:42 -08:00
Kinglong Mee	aa66b0bb08	btrfs: fix memory leak of fs_info in block group cache When starting up linux with btrfs filesystem, I got many memory leak messages by kmemleak as, unreferenced object 0xffff880066882000 (size 4096): comm "modprobe", pid 730, jiffies 4294690024 (age 196.599s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811d09aa>] kmem_cache_alloc_trace+0xea/0x1e0 [<ffffffffa03620fb>] btrfs_alloc_dummy_fs_info+0x6b/0x2a0 [btrfs] [<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs] [<ffffffffa0360aa9>] btrfs_test_free_space_cache+0x39/0xed0 [btrfs] [<ffffffffa03b5a74>] trace_raw_output_xfs_attr_class+0x54/0xe0 [xfs] [<ffffffff81002122>] do_one_initcall+0xb2/0x1f0 [<ffffffff811765aa>] do_init_module+0x5e/0x1e9 [<ffffffff810fec09>] load_module+0x20a9/0x2690 [<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0 [<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff8800573f8000 (size 10256): comm "modprobe", pid 730, jiffies 4294690185 (age 196.460s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8119ca6e>] kmalloc_order+0x5e/0x70 [<ffffffff8119caa4>] kmalloc_order_trace+0x24/0x90 [<ffffffffa03620b3>] btrfs_alloc_dummy_fs_info+0x23/0x2a0 [btrfs] [<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs] [<ffffffffa036603d>] run_test+0xfd/0x320 [btrfs] [<ffffffffa0366f34>] btrfs_test_free_space_tree+0x94/0xee [btrfs] [<ffffffffa03b5aab>] trace_raw_output_xfs_attr_class+0x8b/0xe0 [xfs] [<ffffffff81002122>] do_one_initcall+0xb2/0x1f0 [<ffffffff811765aa>] do_init_module+0x5e/0x1e9 [<ffffffff810fec09>] load_module+0x20a9/0x2690 [<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0 [<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff This patch lets btrfs using fs_info stored in btrfs_root for block group cache directly without allocating a new one. Fixes: `d0bd456074` ("Btrfs: add fragment=* debug mount option") Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 13:28:24 +01:00
Zhao Lei	4da2e26a2a	btrfs: Continue write in case of can_not_nocow btrfs failed in xfstests btrfs/080 with -o nodatacow. Can be reproduced by following script: DEV=/dev/vdg MNT=/mnt/tmp umount $DEV &>/dev/null mkfs.btrfs -f $DEV mount -o nodatacow $DEV $MNT dd if=/dev/zero of=$MNT/test bs=1 count=2048 & btrfs subvolume snapshot -r $MNT $MNT/test_snap & wait -- We can see dd failed on NO_SPACE. Reason: __btrfs_buffered_write should run cow write when no_cow impossible, and current code is designed with above logic. But check_can_nocow() have 2 type of return value(0 and <0) on can_not_no_cow, and current code only continue write on first case, the second case happened in doing subvolume. Fix: Continue write when check_can_nocow() return 0 and <0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>	2016-02-18 13:18:06 +01:00
Kinglong Mee	5598e9005a	btrfs: drop null testing before destroy functions Cleanup. kmem_cache_destroy has support NULL argument checking, so drop the double null testing before calling it. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:46:03 +01:00
Sudip Mukherjee	89771cc98c	btrfs: fix build warning We were getting build warning about: fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used uninitialized in this function It is not a valid warning as used_bg is never used uninitilized since locked is initially false so we can never be in the section where 'used_bg' is used. But gcc is not able to understand that and we can initialize it while declaring to silence the warning. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:46:03 +01:00
David Sterba	47dc196ae7	btrfs: use proper type for failrec in extent_state We use the private member of extent_state to store the failrec and play pointless pointer games. Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:46:03 +01:00
Deepa Dinamani	04b285f35e	btrfs: Replace CURRENT_TIME by current_fs_time() CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_fs_time() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:46:03 +01:00
Dave Jones	8f682f6955	btrfs: remove open-coded swap() in backref.c:__merge_refs The kernel provides a swap() that does the same thing as this code. Signed-off-by: Dave Jones <dsj@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:45:55 +01:00
Byongho Lee	ac1407ba24	btrfs: remove redundant error check While running btrfs_mksubvol(), d_really_is_positive() is called twice. First in btrfs_mksubvol() and second inside btrfs_may_create(). So I remove the first one. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:35:27 +01:00
Byongho Lee	0138b6fe8f	btrfs: simplify expression in btrfs_calc_trans_metadata_size() Simplify expression in btrfs_calc_trans_metadata_size(). Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:33:17 +01:00
Josef Bacik	baee879064	Btrfs: check reserved when deciding to background flush We will sometimes start background flushing the various enospc related things (delayed nodes, delalloc, etc) if we are getting close to reserving all of our available space. We don't want to do this however when we are actually using this space as it causes unneeded thrashing. We currently try to do this by checking bytes_used >= thresh, but bytes_used is only part of the equation, we need to use bytes_reserved as well as this represents space that is very likely to become bytes_used in the future. My tracing tool will keep count of the number of times we kick off the async flusher, the following are counts for the entire run of generic/027 No Patch Patch avg: 5385 5009 median: 5500 4916 We skewed lower than the average with my patch and higher than the average with the patch, overall it cuts the flushing from anywhere from 5-10%, which in the case of actual ENOSPC is quite helpful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:29:43 +01:00
Josef Bacik	88d3a5aaf6	Btrfs: add transaction space reservation tracepoints There are a few places where we add to trans->bytes_reserved but don't have the corresponding trace point. With these added my tool no longer sees transaction leaks. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:22:41 +01:00
Josef Bacik	dc95f7bfc5	Btrfs: fix truncate_space_check truncate_space_check is using btrfs_csum_bytes_to_leaves() but forgetting to multiply by nodesize so we get an actual byte count. We need a tracepoint here so that we have the matching reserve for the release that will come later. Also add a comment to make clear what the intent of truncate_space_check is. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:22:24 +01:00
Josef Bacik	fb4b10e5d5	Btrfs: change how we update the global block rsv I'm writing a tool to visualize the enospc system in order to help debug enospc bugs and I found weird data and ran it down to when we update the global block rsv. We add all of the remaining free space to the block rsv, do a trace event, then remove the extra and do another trace event. This makes my visualization look silly and is unintuitive code as well. Fix this stuff to only add the amount we are missing, or free the amount we are missing. This is less clean to read but more explicit in what it is doing, as well as only emitting events for values that make sense. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:21:48 +01:00
Zhao Lei	7aff8cf4a6	btrfs: reada: ignore creating reada_extent for a non-existent device For a non-existent device, old code bypasses adding it in dev's reada queue. And to solve problem of unfinished waitting in raid5/6, commit `5fbc7c59fd` ("Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting") adding an exception for the first stripe, in short, the first stripe will always be processed whether the device exists or not. Actually we have a better way for the above request: just bypass creation of the reada_extent for non-existent device, it will make code simple and effective. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:27:23 +01:00
Zhao Lei	4fe7a0e138	btrfs: reada: avoid undone reada extents in btrfs_reada_wait Reada background works is not designed to finish all jobs completely, it will break in following case: 1: When a device reaches workload limit (MAX_IN_FLIGHT) 2: Total reads reach max limit (10000) 3: All devices don't have queued more jobs, often happened in DUP case And if all background works exit with remaining jobs, btrfs_reada_wait() will wait indefinetelly. Above problem is rarely happened in old code, because: 1: Every work queues 2x new works So many works reduced chances of undone jobs. 2: One work will continue 10000 times loop in case of no-jobs It reduced no-thread window time. But after we fixed above case, the "undone reada extents" frequently happened. Fix: Check to ensure we have at least one thread if there are undone jobs in btrfs_reada_wait(). Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:27:23 +01:00
Zhao Lei	2fefd5583f	btrfs: reada: limit max works count Reada creates 2 works for each level of tree recursively. In case of a tree having many levels, the number of created works is 2^level_of_tree. Actually we don't need so many works in parallel, this patch limits max works to BTRFS_MAX_MIRRORS * 2. The per-fs works_counter will be also used for btrfs_reada_wait() to check is there are background workers. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:27:23 +01:00
Zhao Lei	895a11b868	btrfs: reada: simplify dev->reada_in_flight processing No need to decrease dev->reada_in_flight in __readahead_hook()'s internal and reada_extent_put(). reada_extent_put() have no chance to decrease dev->reada_in_flight in free operation, because reada_extent have additional refcnt when scheduled to a dev. We can put inc and dec operation for dev->reada_in_flight to one place instead to make logic simple and safe, and move useless reada_extent->scheduled_for to a bool flag instead. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:27:23 +01:00
Zhao Lei	8afd6841e1	btrfs: reada: Fix a debug code typo Remove one copy of loop to fix the typo of iterate zones. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:12 +01:00
Zhao Lei	57f16e0826	btrfs: reada: Jump into cleanup in direct way for __readahead_hook() Current code set nritems to 0 to make for_loop useless to bypass it, and set generation's value which is not necessary. Jump into cleanup directly is better choise. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:12 +01:00
Zhao Lei	02873e4325	btrfs: reada: Use fs_info instead of root in __readahead_hook's argument What __readahead_hook() need exactly is fs_info, no need to convert fs_info to root in caller and convert back in __readahead_hook() Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:12 +01:00
Zhao Lei	6e39dbe8b9	btrfs: reada: Pass reada_extent into __readahead_hook directly reada_start_machine_dev() already have reada_extent pointer, pass it into __readahead_hook() directly instead of search radix_tree will make code run faster. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:12 +01:00
Zhao Lei	b257cf5006	btrfs: reada: move reada_extent_put to place after __readahead_hook() We can't release reada_extent earlier than __readahead_hook(), because __readahead_hook() still need to use it, it is necessary to hode a refcnt to avoid it be freed. Actually it is not a problem after my patch named: Avoid many times of empty loop It make reada_extent in above line include at least one reada_extctl, which keeps additional one refcnt for reada_extent. But we still need this patch to make the code in pretty logic. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:12 +01:00
Zhao Lei	1e7970c0f3	btrfs: reada: Remove level argument in severial functions level is not used in severial functions, remove them from arguments, and remove relative code for get its value. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:12 +01:00
Zhao Lei	3194502118	btrfs: reada: bypass adding extent when all zone failed When failed adding all dev_zones for a reada_extent, the extent will have no chance to be selected to run, and keep in memory for ever. We should bypass this extent to avoid above case. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:12 +01:00
Zhao Lei	6a159d2ae4	btrfs: reada: add all reachable mirrors into reada device list If some device is not reachable, we should bypass and continus addingb next, instead of break on bad device. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:12 +01:00
Zhao Lei	a3f7fde243	btrfs: reada: Move is_need_to_readahead contition earlier Move is_need_to_readahead contition earlier to avoid useless loop to get relative data for readahead. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 10:26:10 +01:00
Zhao Lei	97d5f0e63d	btrfs: reada: Avoid many times of empty loop We can see following loop(10000 times) in trace_log: [ 75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 ...(10000 times) [ 124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 Reason: If more than one user trigger reada in same extent, the first task finished setting of reada data struct and call reada_start_machine() to start, and the second task only add a ref_count but have not add reada_extctl struct completely, the reada_extent can not finished all jobs, and will be selected in __reada_start_machine() for 10000 times(total times in __reada_start_machine()). Fix: For a reada_extent without job, we don't need to run it, just return 0 to let caller break. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-16 13:21:45 +01:00
Zhao Lei	8e9aa51f54	btrfs: reada: Add missed segment checking in reada_find_zone In rechecking zone-in-tree, we still need to check zone include our logical address. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-16 13:21:45 +01:00
Zhao Lei	c37f49c7ef	btrfs: reada: reduce additional fs_info->reada_lock in reada_find_zone We can avoid additional locking-acquirment and one pair of kref_get/put by combine two condition. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-16 13:21:45 +01:00

1 2 3 4 5 ...

5294 Commits