Commit Graph

3091 Commits

Author SHA1 Message Date
Chao Yu
7de36cf3e4 f2fs: fix to recover inode's i_gc_failures during POR
inode.i_gc_failures is used to indicate that skip count of migrating
on blocks of inode, we should guarantee it can be recovered in sudden
power-off case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
19c73a691c f2fs: fix to recover inode's i_flags during POR
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +A /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. lsattr /mnt/f2fs/file

-----------------N- /mnt/f2fs/file

But actually, we expect the corrct result is:

-------A---------N- /mnt/f2fs/file

The reason is we didn't recover inode.i_flags field during mount,
fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
f4474aa6e5 f2fs: fix to recover inode's project id during POR
Testcase to reproduce this bug:
1. mkfs.f2fs -O extra_attr -O project_quota /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr -p 1 /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. lsattr -p /mnt/f2fs/file

    0 -----------------N- /mnt/f2fs/file

But actually, we expect the correct result is:

    1 -----------------N- /mnt/f2fs/file

The reason is we didn't recover inode.i_projid field during mount,
fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Jaegeuk Kim
0a4daae5ff f2fs: update i_size after DIO completion
This is related to
ee70daaba8 ("xfs: update i_size after unwritten conversion in dio completion")

If we update i_size during dio_write, dio_read can read out stale data, which
breaks xfstests/465.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:34 -07:00
Jaegeuk Kim
d83d0f5ba8 f2fs: report ENOENT correctly in f2fs_rename
This fixes wrong error report in f2fs_rename.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:38:11 -07:00
Chengguang Xu
c6b1867b1d f2fs: fix remount problem of option io_bits
Currently we show mount option "io_bits=%u" as "io_size=%uKB",
it will cause option parsing problem(unrecognized mount option)
in remount.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-25 18:44:56 -07:00
Chao Yu
dc4cd1257c f2fs: fix to recover inode's uid/gid during POR
Step to reproduce this bug:
1. logon as root
2. mount -t f2fs /dev/sdd /mnt;
3. touch /mnt/file;
4. chown system /mnt/file; chgrp system /mnt/file;
5. xfs_io -f /mnt/file -c "fsync";
6. godown /mnt;
7. umount /mnt;
8. mount -t f2fs /dev/sdd /mnt;

After step 8) we will expect file's uid/gid are all system, but during
recovery, these two fields were not been recovered, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 14:50:25 -07:00
Jaegeuk Kim
f84262b086 f2fs: avoid infinite loop in f2fs_alloc_nid
If we have an error in f2fs_build_free_nids, we're able to fall into a loop
to find free nids.

Suggested-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 14:49:19 -07:00
Sahitya Tummala
a7d10cf3e4 f2fs: add new idle interval timing for discard and gc paths
This helps to control the frequency of submission of discard and
GC requests independently, based on the need.

Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-19 15:55:14 -07:00
Chao Yu
6f5c2ed0a2 f2fs: split IO error injection according to RW
This patch adds to support injecting error for write IO, this can simulate
IO error like fail_make_request or dm_flakey does.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:07:32 -07:00
Chao Yu
7c1a000d46 f2fs: add SPDX license identifiers
Remove the verbose license text from f2fs files and replace them with
SPDX tags.  This does not change the license of any of the code.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:07:10 -07:00
Chengguang Xu
4cb037ec3f f2fs: surround fault_injection related option parsing using CONFIG_F2FS_FAULT_INJECTION
It's a little bit strange when fault_injection related
options fail with -EINVAL which were already disabled
from config, so surround all fault_injection related option
parsing code using CONFIG_F2FS_FAULT_INJECTION. Meanwhile,
slightly change warning message to keep consistency with
option POSIX_ACL and FS_XATTR.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:05:33 -07:00
Wang Shilong
c8e927579e f2fs: fix setattr project check upon fssetxattr ioctl
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.

This patch try to follow same regular of xfs project
quota:

"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."

Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Zhikang Zhang
b430f72636 f2fs: avoid sleeping under spin_lock
In the call trace below, we might sleep in function dput().

So in order to avoid sleeping under spin_lock, we remove f2fs_mark_inode_dirty_sync
from __try_update_largest_extent && __drop_largest_extent.

BUG: sleeping function called from invalid context at fs/dcache.c:796
Call trace:
	dump_backtrace+0x0/0x3f4
	show_stack+0x24/0x30
	dump_stack+0xe0/0x138
	___might_sleep+0x2a8/0x2c8
	__might_sleep+0x78/0x10c
	dput+0x7c/0x750
	block_dump___mark_inode_dirty+0x120/0x17c
	__mark_inode_dirty+0x344/0x11f0
	f2fs_mark_inode_dirty_sync+0x40/0x50
	__insert_extent_tree+0x2e0/0x2f4
	f2fs_update_extent_tree_range+0xcf4/0xde8
	f2fs_update_extent_cache+0x114/0x12c
	f2fs_update_data_blkaddr+0x40/0x50
	write_data_page+0x150/0x314
	do_write_data_page+0x648/0x2318
	__write_data_page+0xdb4/0x1640
	f2fs_write_cache_pages+0x768/0xafc
	__f2fs_write_data_pages+0x590/0x1218
	f2fs_write_data_pages+0x64/0x74
	do_writepages+0x74/0xe4
	__writeback_single_inode+0xdc/0x15f0
	writeback_sb_inodes+0x574/0xc98
	__writeback_inodes_wb+0x190/0x204
	wb_writeback+0x730/0xf14
	wb_check_old_data_flush+0x1bc/0x1c8
	wb_workfn+0x554/0xf74
	process_one_work+0x440/0x118c
	worker_thread+0xac/0x974
	kthread+0x1a0/0x1c8
	ret_from_fork+0x10/0x1c

Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
e1293bdfa0 f2fs: plug readahead IO in readdir()
Add a plug to merge readahead IO in readdir(), expecting it can
reduce bio count before submitting to block layer.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
042be0f849 f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219

Reproduction way:
- mount image
- run poc code
- umount image

F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
 f2fs_allocate_data_block+0x124/0x580 [f2fs]
 do_write_page+0x78/0x150 [f2fs]
 f2fs_do_write_node_page+0x25/0xa0 [f2fs]
 __write_node_page+0x2bf/0x550 [f2fs]
 f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
 ? sync_inode_metadata+0x2f/0x40
 ? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
 ? up_write+0x1e/0x80
 f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
 ? mark_held_locks+0x5d/0x80
 ? _raw_spin_unlock_irq+0x27/0x50
 kill_f2fs_super+0x68/0x90 [f2fs]
 deactivate_locked_super+0x3d/0x70
 deactivate_super+0x40/0x60
 cleanup_mnt+0x39/0x70
 __cleanup_mnt+0x10/0x20
 task_work_run+0x81/0xa0
 exit_to_usermode_loop+0x59/0xa7
 do_fast_syscall_32+0x1f5/0x22c
 entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---

The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.

Main area: 24 segs, 24 secs 24 zones
  - COLD  data: 0, 0, 0
  - WARM  data: 1, 1, 1
  - HOT   data: 20, 20, 20
  - Dir   dnode: 22, 22, 22
  - File   dnode: 22, 22, 22
  - Indir nodes: 21, 21, 21

So this patch adds sanity check to detect such condition to avoid
this issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
4a70e25544 f2fs: fix memory leak of percpu counter in fill_super()
In fill_super -> init_percpu_info, we should destroy percpu counter
in error path, otherwise memory allcoated for percpu counter will
leak.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
0b2103e886 f2fs: fix memory leak of write_io in fill_super()
It needs to release memory allocated for sbi->write_io in error path,
otherwise, it will cause memory leak.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chengguang Xu
313ed62a3d f2fs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Chao Yu
1378752b99 f2fs: fix to flush all dirty inodes recovered in readonly fs
generic/417 reported as blow:

------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:695!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 21697 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]
Call Trace:
 ? _raw_spin_unlock+0x2c/0x50
 evict+0xa8/0x170
 dispose_list+0x34/0x40
 evict_inodes+0x118/0x120
 generic_shutdown_super+0x41/0x100
 ? rcu_read_lock_sched_held+0x97/0xa0
 kill_block_super+0x22/0x50
 kill_f2fs_super+0x6f/0x80 [f2fs]
 deactivate_locked_super+0x3d/0x70
 deactivate_super+0x40/0x60
 cleanup_mnt+0x39/0x70
 __cleanup_mnt+0x10/0x20
 task_work_run+0x81/0xa0
 exit_to_usermode_loop+0x59/0xa7
 do_fast_syscall_32+0x1f5/0x22c
 entry_SYSENTER_32+0x53/0x86
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]

It can simply reproduced with scripts:

Enable quota feature during mkfs.

Testcase1:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4k" -c "fsync"
4. godown /mnt/f2fs
5. umount /mnt/f2fs
6. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
7. umount /mnt/f2fs

Testcase2:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. touch /mnt/f2fs/file
4. create process[pid = x] do:
	a) open /mnt/f2fs/file;
	b) unlink /mnt/f2fs/file
5. godown -f /mnt/f2fs
6. kill process[pid = x]
7. umount /mnt/f2fs
8. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
9. umount /mnt/f2fs

The reason is: during recovery, i_{c,m}time of inode will be updated, then
the inode can be set dirty w/o being tracked in sbi->inode_list[DIRTY_META]
global list, so later write_checkpoint will not flush such dirty inode into
node page.

Once umount is called, sync_filesystem() in generic_shutdown_super() will
skip syncng dirty inodes due to sb_rdonly check, leaving dirty inodes
there.

To solve this issue, during umount, add remove SB_RDONLY flag in
sb->s_flags, to make sure sync_filesystem() will not be skipped.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Yunlei He
cda9cc595f f2fs: report error if quota off error during umount
Now, we depend on fsck to ensure quota file data is ok,
so we scan whole partition if checkpoint without umount
flag. It's same for quota off error case, which may make
quota file data inconsistent.

generic/019 reports below error:

 __quota_error: 1160 callbacks suppressed
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 VFS: Busy inodes after unmount of zram1. Self-destruct in 5 seconds.  Have a nice day...

If we failed in below path due to fail to write dquot block, we will miss
to release quota inode, fix it.

- f2fs_put_super
 - f2fs_quota_off_umount
  - f2fs_quota_off
   - f2fs_quota_sync   <-- failed
   - dquot_quota_off   <-- missed to call

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Jaegeuk Kim
5ce805869c f2fs: submit bio after shutdown
Sometimes, some merged IOs could get a chance to be submitted, resulting in
system hang in shutdown test. This issues IOs all the time after shutdown.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-07 14:26:43 -07:00
Jaegeuk Kim
0ded69f632 f2fs: avoid wrong decrypted data from disk
1. Create a file in an encrypted directory
2. Do GC & drop caches
3. Read stale data before its bio for metapage was not issued yet

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:41:07 -07:00
Chao Yu
22d7ea1364 Revert "f2fs: use printk_ratelimited for f2fs_msg"
Don't limit printing log, so that we will not miss any key messages.

This reverts commit a36c106dff.

In addition, we use printk_ratelimited to avoid too many log prints.
- error injection
- discard submission failure

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:41:06 -07:00
Sahitya Tummala
abde73c718 f2fs: fix unnecessary periodic wakeup of discard thread when dev is busy
When dev is busy, discard thread wake up timeout can be aligned with the
exact time that it needs to wait for dev to come out of busy. This helps
to avoid unnecessary periodic wakeups and thus save some power.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:32 -07:00
Chao Yu
7d20c8abb2 f2fs: fix to avoid NULL pointer dereference on se->discard_map
https://bugzilla.kernel.org/show_bug.cgi?id=200951

These is a NULL pointer dereference issue reported in bugzilla:

Hi,
in the setup there is a SATA SSD connected to a SATA-to-USB bridge.

The disc is "Samsung SSD 850 PRO 256G" which supports TRIM.
There are four partitions:
 sda1: FAT  /boot
 sda2: F2FS /
 sda3: F2FS /home
 sda4: F2FS

The bridge is ASMT1153e which uses the "uas" driver.
There is no TRIM pass-through, so, when mounting it reports:
 mounting with "discard" option, but the device does not support discard

The USB host is USB3.0 and UASP capable. It is the one on RK3399.

Given this everything works fine, except there is no TRIM support.

In order to enable TRIM a new UDEV rule is added [1]:
 /etc/udev/rules.d/10-sata-bridge-trim.rules:
 ACTION=="add|change", ATTRS{idVendor}=="174c", ATTRS{idProduct}=="55aa", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}="unmap"
After reboot any F2FS write hangs forever and dmesg reports:
 Unable to handle kernel NULL pointer dereference

Also tested on a x86_64 system: works fine even with TRIM enabled.
 same disc
 same bridge
 different usb host controller
 different cpu architecture
 not root filesystem

Regards,
  Vicenç.

[1] Post #5 in https://bbs.archlinux.org/viewtopic.php?id=236280

 Unable to handle kernel NULL pointer dereference at virtual address 000000000000003e
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000626e3122
 [000000000000003e] pgd=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 Modules linked in: overlay snd_soc_hdmi_codec rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_simple_card snd_soc_simple_card_utils snd_soc_rockchip_i2s rockchip_rga snd_soc_rockchip_pcm rockchipdrm videobuf2_dma_sg v4l2_mem2mem rtc_rk808 videobuf2_memops analogix_dp videobuf2_v4l2 videobuf2_common dw_hdmi dw_wdt cec rc_core videodev drm_kms_helper media drm rockchip_thermal rockchip_saradc realtek drm_panel_orientation_quirks syscopyarea sysfillrect sysimgblt fb_sys_fops dwmac_rk stmmac_platform stmmac pwm_bl squashfs loop crypto_user gpio_keys hid_kensington
 CPU: 5 PID: 957 Comm: nvim Not tainted 4.19.0-rc1-1-ARCH #1
 Hardware name: Sapphire-RK3399 Board (DT)
 pstate: 00000005 (nzcv daif -PAN -UAO)
 pc : update_sit_entry+0x304/0x4b0
 lr : update_sit_entry+0x108/0x4b0
 sp : ffff00000ca13bd0
 x29: ffff00000ca13bd0 x28: 000000000000003e
 x27: 0000000000000020 x26: 0000000000080000
 x25: 0000000000000048 x24: ffff8000ebb85cf8
 x23: 0000000000000253 x22: 00000000ffffffff
 x21: 00000000000535f2 x20: 00000000ffffffdf
 x19: ffff8000eb9e6800 x18: ffff8000eb9e6be8
 x17: 0000000007ce6926 x16: 000000001c83ffa8
 x15: 0000000000000000 x14: ffff8000f602df90
 x13: 0000000000000006 x12: 0000000000000040
 x11: 0000000000000228 x10: 0000000000000000
 x9 : 0000000000000000 x8 : 0000000000000000
 x7 : 00000000000535f2 x6 : ffff8000ebff3440
 x5 : ffff8000ebff3440 x4 : ffff8000ebe3a6c8
 x3 : 00000000ffffffff x2 : 0000000000000020
 x1 : 0000000000000000 x0 : ffff8000eb9e5800
 Process nvim (pid: 957, stack limit = 0x0000000063a78320)
 Call trace:
  update_sit_entry+0x304/0x4b0
  f2fs_invalidate_blocks+0x98/0x140
  truncate_node+0x90/0x400
  f2fs_remove_inode_page+0xe8/0x340
  f2fs_evict_inode+0x2b0/0x408
  evict+0xe0/0x1e0
  iput+0x160/0x260
  do_unlinkat+0x214/0x298
  __arm64_sys_unlinkat+0x3c/0x68
  el0_svc_handler+0x94/0x118
  el0_svc+0x8/0xc
 Code: f9400800 b9488400 36080140 f9400f01 (387c4820)
 ---[ end trace a0f21a307118c477 ]---

The reason is it is possible to enable discard flag on block queue via
UDEV, but during mount, f2fs will initialize se->discard_map only if
this flag is set, once the flag is set after mount, f2fs may dereference
NULL pointer on se->discard_map.

So this patch does below changes to fix this issue:
- initialize and update se->discard_map all the time.
- don't clear DISCARD option if device has no QUEUE_FLAG_DISCARD flag
during mount.
- don't issue small discard on zoned block device.
- introduce some functions to enhance the readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Vicente Bergas <vicencb@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:31 -07:00
Chengguang Xu
1618e6e297 f2fs: add additional sanity check in f2fs_acl_from_disk()
Add additinal sanity check for irregular case(e.g. corruption).
If size of extended attribution is smaller than size of acl header,
then return -EINVAL.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:31 -07:00
Linus Torvalds
fe6f0ed0da f2fs-for-4.19-rc1
In this round, we've tuned f2fs to improve general performance by serializing
 block allocation and enhancing discard flows like fstrim which avoids user IO
 contention. And we've added fsync_mode=nobarrier which gives an option to user
 where it skips issuing cache_flush commands to underlying flash storage. And
 there are many bug fixes related to fuzzed images, revoked atomic writes, quota
 ops, and minor direct IO.
 
 Enhancement:
  - add fsync_mode=nobarrier which bypasses cache_flush command
  - enhance the discarding flow which avoids user IOs and issues in LBA order
  - readahead some encrypted blocks during GC
  - enable in-memory inode checksum to verify the blocks if F2FS_CHECK_FS is set
  - enhance nat_bits behavior
  - set -o discard by default
  - set REQ_RAHEAD to bio in ->readpages
 
 Bug fixes:
  - fix a corner case to corrupt atomic_writes revoking flow
  - revisit i_gc_rwsem to fix race conditions
  - fix some dio behaviors captured by xfstests
  - correct handling errors given by quota-related failures
  - add many sanity check flows to avoid fuzz test failures
  - add more error number propagation to their callers
  - fix several corner cases to continue fault injection w/ shutdown loop
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlt82U4ACgkQQBSofoJI
 UNJTLQ/+PhewnNa5tDfUgWdFnUFz3h9/NcC677l0OplOOUNxA8iSa1xamlKf/nf9
 sB5ey0I7oBF8zQGxfndhHQfi6fpfUcNMr14hm+TS/3+d54xLJmiVShD5fjNSV2vB
 Ur0xoozuQDwYF1e3QKdBQjFqaCf78VheTr3aWxyv22/Sg+PYylZJ2K8rHTB7mGPU
 UG0aRnKrP3FPRjL7Q3m0Vm6b6eZ5uNdNrFfjgn/8yuQQ9V197K8vwSbPAsR5/pOh
 miCQXyM708NgEYJRWkWmC/rDSQdU0/h/mGnJWrBrbceW62QefGOgd2jcVfmthHJa
 ZXpj+BEG5bYpCCxGxF6N+u0e28OKonCwO/uvL8YAd5icN7yXtsKzoF1CCuXxOYf1
 9K5SMylCTSyrs/+LV8CJoT2ya8w0l0w+R/txUYn8UT+4AgqU+chS2kJeXqw9tcHB
 WLFs/rnAyofWCI/8frVBmJY+zA1ZZvTqs/lmVYrtJUkiOcMTq34WICBUAEFKV452
 BM5dcu21bSIkapYispEt4Rr7o4P4HHMQ+N1i2yUZMFCz5T0RyzdybeS5THk2yVzd
 L0kxfYU+zHigNX51ez8+Z7DyDLDBp6jkD0e66x73bUK9TGPH+ZbnAL6gwLikmD3M
 +VxYl5nyW/3bxx1HdfK1Xwd4/wYMNBmdtn5NC50oZ+jpB0h8YCE=
 =zgbL
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've tuned f2fs to improve general performance by
  serializing block allocation and enhancing discard flows like fstrim
  which avoids user IO contention. And we've added fsync_mode=nobarrier
  which gives an option to user where it skips issuing cache_flush
  commands to underlying flash storage. And there are many bug fixes
  related to fuzzed images, revoked atomic writes, quota ops, and minor
  direct IO.

  Enhancements:
   - add fsync_mode=nobarrier which bypasses cache_flush command
   - enhance the discarding flow which avoids user IOs and issues in
     LBA order
   - readahead some encrypted blocks during GC
   - enable in-memory inode checksum to verify the blocks if
     F2FS_CHECK_FS is set
   - enhance nat_bits behavior
   - set -o discard by default
   - set REQ_RAHEAD to bio in ->readpages

  Bug fixes:
   - fix a corner case to corrupt atomic_writes revoking flow
   - revisit i_gc_rwsem to fix race conditions
   - fix some dio behaviors captured by xfstests
   - correct handling errors given by quota-related failures
   - add many sanity check flows to avoid fuzz test failures
   - add more error number propagation to their callers
   - fix several corner cases to continue fault injection w/ shutdown
     loop"

* tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits)
  f2fs: readahead encrypted block during GC
  f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
  f2fs: fix performance issue observed with multi-thread sequential read
  f2fs: fix to skip verifying block address for non-regular inode
  f2fs: rework fault injection handling to avoid a warning
  f2fs: support fault_type mount option
  f2fs: fix to return success when trimming meta area
  f2fs: fix use-after-free of dicard command entry
  f2fs: support discard submission error injection
  f2fs: split discard command in prior to block layer
  f2fs: wake up gc thread immediately when gc_urgent is set
  f2fs: fix incorrect range->len in f2fs_trim_fs()
  f2fs: refresh recent accessed nat entry in lru list
  f2fs: fix avoid race between truncate and background GC
  f2fs: avoid race between zero_range and background GC
  f2fs: fix to do sanity check with block address in main area v2
  f2fs: fix to do sanity check with inline flags
  f2fs: fix to reset i_gc_failures correctly
  f2fs: fix invalid memory access
  f2fs: fix to avoid broken of dnode block list
  ...
2018-08-22 13:29:39 -07:00
Chao Yu
6aa58d8ad2 f2fs: readahead encrypted block during GC
During GC, for each encrypted block, we will read block synchronously
into meta page, and then submit it into current cold data log area.

So this block read model with 4k granularity can make poor performance,
like migrating non-encrypted block, let's readahead encrypted block
as well to improve migration performance.

To implement this, we choose meta page that its index is old block
address of the encrypted block, and readahead ciphertext into this
page, later, if readaheaded page is still updated, we will load its
data into target meta page, and submit the write IO.

Note that for OPU, truncation, deletion, we need to invalid meta
page after we invalid old block address, to make sure we won't load
invalid data from target meta page during encrypted block migration.

for ((i = 0; i < 1000; i++))
do {
        xfs_io -f /mnt/f2fs/dir/$i -c "pwrite 0 128k" -c "fsync";
} done

for ((i = 0; i < 1000; i+=2))
do {
        rm /mnt/f2fs/dir/$i;
} done

ret = ioctl(fd, F2FS_IOC_GARBAGE_COLLECT, 0);

Before:
              gc-6549  [001] d..1 214682.212797: block_rq_insert: 8,32 RA 32768 () 786400 + 64 [gc]
              gc-6549  [001] d..1 214682.212802: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.213892: block_bio_queue: 8,32 R 67494144 + 8 [gc]
              gc-6549  [001] .... 214682.213899: block_getrq: 8,32 R 67494144 + 8 [gc]
              gc-6549  [001] .... 214682.213902: block_plug: [gc]
              gc-6549  [001] d..1 214682.213905: block_rq_insert: 8,32 R 4096 () 67494144 + 8 [gc]
              gc-6549  [001] d..1 214682.213908: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.226405: block_bio_queue: 8,32 R 67494152 + 8 [gc]
              gc-6549  [001] .... 214682.226412: block_getrq: 8,32 R 67494152 + 8 [gc]
              gc-6549  [001] .... 214682.226414: block_plug: [gc]
              gc-6549  [001] d..1 214682.226417: block_rq_insert: 8,32 R 4096 () 67494152 + 8 [gc]
              gc-6549  [001] d..1 214682.226420: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.226904: block_bio_queue: 8,32 R 67494160 + 8 [gc]
              gc-6549  [001] .... 214682.226910: block_getrq: 8,32 R 67494160 + 8 [gc]
              gc-6549  [001] .... 214682.226911: block_plug: [gc]
              gc-6549  [001] d..1 214682.226914: block_rq_insert: 8,32 R 4096 () 67494160 + 8 [gc]
              gc-6549  [001] d..1 214682.226916: block_unplug: [gc] 1

After:
              gc-5678  [003] .... 214327.025906: block_bio_queue: 8,32 R 67493824 + 8 [gc]
              gc-5678  [003] .... 214327.025908: block_bio_backmerge: 8,32 R 67493824 + 8 [gc]
              gc-5678  [003] .... 214327.025915: block_bio_queue: 8,32 R 67493832 + 8 [gc]
              gc-5678  [003] .... 214327.025917: block_bio_backmerge: 8,32 R 67493832 + 8 [gc]
              gc-5678  [003] .... 214327.025923: block_bio_queue: 8,32 R 67493840 + 8 [gc]
              gc-5678  [003] .... 214327.025925: block_bio_backmerge: 8,32 R 67493840 + 8 [gc]
              gc-5678  [003] .... 214327.025932: block_bio_queue: 8,32 R 67493848 + 8 [gc]
              gc-5678  [003] .... 214327.025934: block_bio_backmerge: 8,32 R 67493848 + 8 [gc]
              gc-5678  [003] .... 214327.025941: block_bio_queue: 8,32 R 67493856 + 8 [gc]
              gc-5678  [003] .... 214327.025943: block_bio_backmerge: 8,32 R 67493856 + 8 [gc]
              gc-5678  [003] .... 214327.025953: block_bio_queue: 8,32 R 67493864 + 8 [gc]
              gc-5678  [003] .... 214327.025955: block_bio_backmerge: 8,32 R 67493864 + 8 [gc]
              gc-5678  [003] .... 214327.025962: block_bio_queue: 8,32 R 67493872 + 8 [gc]
              gc-5678  [003] .... 214327.025964: block_bio_backmerge: 8,32 R 67493872 + 8 [gc]
              gc-5678  [003] .... 214327.025970: block_bio_queue: 8,32 R 67493880 + 8 [gc]
              gc-5678  [003] .... 214327.025972: block_bio_backmerge: 8,32 R 67493880 + 8 [gc]
              gc-5678  [003] .... 214327.026000: block_bio_queue: 8,32 WS 34123776 + 2048 [gc]
              gc-5678  [003] .... 214327.026019: block_getrq: 8,32 WS 34123776 + 2048 [gc]
              gc-5678  [003] d..1 214327.026021: block_rq_insert: 8,32 R 131072 () 67493632 + 256 [gc]
              gc-5678  [003] d..1 214327.026023: block_unplug: [gc] 1
              gc-5678  [003] d..1 214327.026026: block_rq_issue: 8,32 R 131072 () 67493632 + 256 [gc]
              gc-5678  [003] .... 214327.026046: block_plug: [gc]

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Jaegeuk Kim
6f8d445506 f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
The f2fs_gc() called by f2fs_balance_fs() requires to be called outside of
fi->i_gc_rwsem[WRITE], since f2fs_gc() can try to grab it in a loop.

If it hits the miximum retrials in GC, let's give a chance to release
gc_mutex for a short time in order not to go into live lock in the worst
case.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Jaegeuk Kim
853137cef4 f2fs: fix performance issue observed with multi-thread sequential read
This reverts the commit - "b93f771 - f2fs: remove writepages lock"
to fix the drop in sequential read throughput.

Test: ./tiotest -t 32 -d /data/tio_tmp -f 32 -b 524288 -k 1 -k 3 -L
device: UFS

Before -
read throughput: 185 MB/s
total read requests: 85177 (of these ~80000 are 4KB size requests).
total write requests: 2546 (of these ~2208 requests are written in 512KB).

After -
read throughput: 758 MB/s
total read requests: 2417 (of these ~2042 are 512KB reads).
total write requests: 2701 (of these ~2034 requests are written in 512KB).

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Jens Axboe
74c8164e1c mpage: mpage_readpages() should submit IO as read-ahead
a_ops->readpages() is only ever used for read-ahead, yet we don't flag
the IO being submitted as such.  Fix that up.  Any file system that uses
mpage_readpages() as its ->readpages() implementation will now get this
right.

Since we're passing in whether the IO is read-ahead or not, we don't
need to pass in the 'gfp' separately, as it is dependent on the IO being
read-ahead.  Kill off that member.

Add some documentation notes on ->readpages() being purely for
read-ahead.

Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Chao Yu
dda9f4b9ca f2fs: fix to skip verifying block address for non-regular inode
generic/184 1s ... [failed, exit status 1]- output mismatch
    --- tests/generic/184.out	2015-01-11 16:52:27.643681072 +0800
     QA output created by 184 - silence is golden
    +rm: cannot remove '/mnt/f2fs/null': Bad address
    +mknod: '/mnt/f2fs/null': Bad address
    +chmod: cannot access '/mnt/f2fs/null': Bad address
    +./tests/generic/184: line 36: /mnt/f2fs/null: Bad address
    ...

F2FS-fs (zram0): access invalid blkaddr:259
EIP: f2fs_is_valid_blkaddr+0x14b/0x1b0 [f2fs]
 f2fs_iget+0x927/0x1010 [f2fs]
 f2fs_lookup+0x26e/0x630 [f2fs]
 __lookup_slow+0xb3/0x140
 lookup_slow+0x31/0x50
 walk_component+0x185/0x1f0
 path_lookupat+0x51/0x190
 filename_lookup+0x7f/0x140
 user_path_at_empty+0x36/0x40
 vfs_statx+0x61/0xc0
 __do_sys_stat64+0x29/0x40
 sys_stat64+0x13/0x20
 do_fast_syscall_32+0xaa/0x22c
 entry_SYSENTER_32+0x53/0x86

In f2fs_iget(), we will check inode's first block address, if it is valid,
we will set FI_FIRST_BLOCK_WRITTEN flag in inode.

But we should only do this for regular inode, otherwise, like special
inode, i_addr[0] is used for storing device info instead of block address,
it will fail checking flow obviously.

So for non-regular inode, let's skip verifying address and setting flag.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-14 10:29:11 -07:00
Arnd Bergmann
7fa750a163 f2fs: rework fault injection handling to avoid a warning
When CONFIG_F2FS_FAULT_INJECTION is disabled, we get a warning about an
unused label:

fs/f2fs/segment.c: In function '__submit_discard_cmd':
fs/f2fs/segment.c:1059:1: error: label 'submit' defined but not used [-Werror=unused-label]

This could be fixed by adding another #ifdef around it, but the more
reliable way of doing this seems to be to remove the other #ifdefs
where that is easily possible.

By defining time_to_inject() as a trivial stub, most of the checks for
CONFIG_F2FS_FAULT_INJECTION can go away. This also leads to nicer
formatting of the code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-14 09:49:15 -07:00
Chao Yu
d494500a70 f2fs: support fault_type mount option
Previously, once fault injection is on, by default, all kind of faults
will be injected to f2fs, if we want to trigger single or specified
combined type during the test, we need to configure sysfs entry, it will
be a little inconvenient to integrate sysfs configuring into testsuit,
such as xfstest.

So this patch introduces a new mount option 'fault_type' to assist old
option 'fault_injection', with these two mount options, we can specify
any fault rate/type at mount-time.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
3f16ecd950 f2fs: fix to return success when trimming meta area
generic/251
    --- tests/generic/251.out	2016-05-03 20:20:11.381899000 +0800
     QA output created by 251
     Running the test: done.
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    +fstrim: /mnt/scratch_f2fs: FITRIM ioctl failed: Invalid argument
    ...
Ran: generic/251
Failures: generic/251

The reason is coverage of fstrim locates in meta area, previously we
just return -EINVAL for such case, making generic/251 failed, to fix
this problem, let's relieve restriction to return success with no
block discarded.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
6b9cb1242c f2fs: fix use-after-free of dicard command entry
As Dan Carpenter reported:

The patch 20ee438232: "f2fs: issue small discard by LBA order" from
Jul 8, 2018, leads to the following Smatch warning:

	fs/f2fs/segment.c:1277 __issue_discard_cmd_orderly()
	warn: 'dc' was already freed.

See also:
fs/f2fs/segment.c:2550 __issue_discard_cmd_range() warn: 'dc' was already freed.

In order to fix this issue, let's get error from __submit_discard_cmd(),
and release current discard command after we referenced next one.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
b83dcfe671 f2fs: support discard submission error injection
This patch adds to support discard submission error injection for testing
error handling of __submit_discard_cmd().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
35ec7d5748 f2fs: split discard command in prior to block layer
Some devices has small max_{hw,}discard_sectors, so that in
__blkdev_issue_discard(), one big size discard bio can be split
into multiple small size discard bios, result in heavy load in IO
scheduler and device, which can hang other sync IO for long time.

Now, f2fs is trying to control discard commands more elaboratively,
in order to make less conflict in between discard IO and user IO
to enhance application's performance, so in this patch, we will
split discard bio in f2fs in prior to in block layer to reduce
issuing multiple discard bios in a short time.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Sheng Yong
a690efffd1 f2fs: wake up gc thread immediately when gc_urgent is set
Fixes: 5b0e95398e ("f2fs: introduce sbi->gc_mode to determine the policy")
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
6eae269461 f2fs: fix incorrect range->len in f2fs_trim_fs()
generic/260 reported below error:

     [+] Default length with start set (should succeed)
     [+] Length beyond the end of fs (should succeed)
     [+] Length beyond the end of fs with start set (should succeed)
    +./tests/generic/260: line 94: [: 18446744073709551615: integer expression expected
    +./tests/generic/260: line 104: [: 18446744073709551615: integer expression expected
     Test done
    ...

In f2fs_trim_fs(), if there is no discard being trimmed, we need to correct
range->len before return.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
2296915808 f2fs: refresh recent accessed nat entry in lru list
Introduce nat_list_lock to protect nm_i->nat_entries list, and manage
it as a LRU list, refresh location for therein recent accessed entries
in the list.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
a33c150237 f2fs: fix avoid race between truncate and background GC
Thread A				Background GC
- f2fs_setattr isize to 0
 - truncate_setsize
					- gc_data_segment
					 - f2fs_get_read_data_page page #0
					  - set_page_dirty
					  - set_cold_data
 - f2fs_truncate

- f2fs_setattr isize to 4k
- read 4k <--- hit data in cached page #0

Above race condition can cause read out invalid data in a truncated
page, fix it by i_gc_rwsem[WRITE] lock.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
c7079853c8 f2fs: avoid race between zero_range and background GC
Thread A				Background GC
- f2fs_zero_range
 - truncate_pagecache_range
					- gc_data_segment
					 - get_read_data_page
					  - move_data_page
					   - set_page_dirty
					   - set_cold_data
 - f2fs_do_zero_range
  - dn->data_blkaddr = NEW_ADDR;
  - f2fs_set_data_blkaddr

Actually, we don't need to set dirty & checked flag on the page, since
all valid data in the page should be zeroed by zero_range().

Use i_gc_rwsem[WRITE] to avoid such race condition.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
91291e9998 f2fs: fix to do sanity check with block address in main area v2
This patch adds f2fs_is_valid_blkaddr() in below functions to do sanity
check with block address to avoid pentential panic:
- f2fs_grab_read_bio()
- __written_first_block()

https://bugzilla.kernel.org/show_bug.cgi?id=200465

- Reproduce

- POC (poc.c)
    #define _GNU_SOURCE
    #include <sys/types.h>
    #include <sys/mount.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/xattr.h>

    #include <dirent.h>
    #include <errno.h>
    #include <error.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #include <linux/falloc.h>
    #include <linux/loop.h>

    static void activity(char *mpoint) {

      char *xattr;
      int err;

      err = asprintf(&xattr, "%s/foo/bar/xattr", mpoint);

      char buf2[113];
      memset(buf2, 0, sizeof(buf2));
      listxattr(xattr, buf2, sizeof(buf2));

    }

    int main(int argc, char *argv[]) {
      activity(argv[1]);
      return 0;
    }

- kernel message
[  844.718738] F2FS-fs (loop0): Mounted with checkpoint version = 2
[  846.430929] F2FS-fs (loop0): access invalid blkaddr:1024
[  846.431058] WARNING: CPU: 1 PID: 1249 at fs/f2fs/checkpoint.c:154 f2fs_is_valid_blkaddr+0x10f/0x160
[  846.431059] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.431310] CPU: 1 PID: 1249 Comm: a.out Not tainted 4.18.0-rc3+ #1
[  846.431312] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.431315] RIP: 0010:f2fs_is_valid_blkaddr+0x10f/0x160
[  846.431316] Code: 00 eb ed 31 c0 83 fa 05 75 ae 48 83 ec 08 48 8b 3f 89 f1 48 c7 c2 fc 0b 0f 8b 48 c7 c6 8b d7 09 8b 88 44 24 07 e8 61 8b ff ff <0f> 0b 0f b6 44 24 07 48 83 c4 08 eb 81 4c 8b 47 10 8b 8f 38 04 00
[  846.431347] RSP: 0018:ffff961c414a7bc0 EFLAGS: 00010282
[  846.431349] RAX: 0000000000000000 RBX: ffffc5f787b8ea80 RCX: 0000000000000000
[  846.431350] RDX: 0000000000000000 RSI: ffff89dfffd165d8 RDI: ffff89dfffd165d8
[  846.431351] RBP: ffff961c414a7c20 R08: 0000000000000001 R09: 0000000000000248
[  846.431353] R10: 0000000000000000 R11: 0000000000000248 R12: 0000000000000007
[  846.431369] R13: ffff89dff5492800 R14: ffff89dfae3aa000 R15: ffff89dff4ff88d0
[  846.431372] FS:  00007f882e2fb700(0000) GS:ffff89dfffd00000(0000) knlGS:0000000000000000
[  846.431373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.431374] CR2: 0000000001a88008 CR3: 00000001eb572000 CR4: 00000000000006e0
[  846.431384] Call Trace:
[  846.431426]  f2fs_iget+0x6f4/0xe70
[  846.431430]  ? f2fs_find_entry+0x71/0x90
[  846.431432]  f2fs_lookup+0x1aa/0x390
[  846.431452]  __lookup_slow+0x97/0x150
[  846.431459]  lookup_slow+0x35/0x50
[  846.431462]  walk_component+0x1c6/0x470
[  846.431479]  ? memcg_kmem_charge_memcg+0x70/0x90
[  846.431488]  ? page_add_file_rmap+0x13/0x200
[  846.431491]  path_lookupat+0x76/0x230
[  846.431501]  ? __alloc_pages_nodemask+0xfc/0x280
[  846.431504]  filename_lookup+0xb8/0x1a0
[  846.431534]  ? _cond_resched+0x16/0x40
[  846.431541]  ? kmem_cache_alloc+0x160/0x1d0
[  846.431549]  ? path_listxattr+0x41/0xa0
[  846.431551]  path_listxattr+0x41/0xa0
[  846.431570]  do_syscall_64+0x55/0x100
[  846.431583]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  846.431607] RIP: 0033:0x7f882de1c0d7
[  846.431607] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[  846.431639] RSP: 002b:00007ffe8e66c238 EFLAGS: 00000202 ORIG_RAX: 00000000000000c2
[  846.431641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882de1c0d7
[  846.431642] RDX: 0000000000000071 RSI: 00007ffe8e66c280 RDI: 0000000001a880c0
[  846.431643] RBP: 00007ffe8e66c300 R08: 0000000001a88010 R09: 0000000000000000
[  846.431645] R10: 00000000000001ab R11: 0000000000000202 R12: 0000000000400550
[  846.431646] R13: 00007ffe8e66c400 R14: 0000000000000000 R15: 0000000000000000
[  846.431648] ---[ end trace abca54df39d14f5c ]---
[  846.431651] F2FS-fs (loop0): invalid blkaddr: 1024, type: 5, run fsck to fix.
[  846.431762] WARNING: CPU: 1 PID: 1249 at fs/f2fs/f2fs.h:2697 f2fs_iget+0xd17/0xe70
[  846.431763] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.431797] CPU: 1 PID: 1249 Comm: a.out Tainted: G        W         4.18.0-rc3+ #1
[  846.431798] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.431800] RIP: 0010:f2fs_iget+0xd17/0xe70
[  846.431801] Code: ff ff 48 63 d8 e9 e1 f6 ff ff 48 8b 45 c8 41 b8 05 00 00 00 48 c7 c2 d8 e8 0e 8b 48 c7 c6 1d b0 0a 8b 48 8b 38 e8 f9 b4 00 00 <0f> 0b 48 8b 45 c8 f0 80 48 48 04 e9 d8 f9 ff ff 0f 0b 48 8b 43 18
[  846.431832] RSP: 0018:ffff961c414a7bd0 EFLAGS: 00010282
[  846.431834] RAX: 0000000000000000 RBX: ffffc5f787b8ea80 RCX: 0000000000000006
[  846.431835] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff89dfffd165d0
[  846.431836] RBP: ffff961c414a7c20 R08: 0000000000000000 R09: 0000000000000273
[  846.431837] R10: 0000000000000000 R11: ffff89dfad50ca60 R12: 0000000000000007
[  846.431838] R13: ffff89dff5492800 R14: ffff89dfae3aa000 R15: ffff89dff4ff88d0
[  846.431840] FS:  00007f882e2fb700(0000) GS:ffff89dfffd00000(0000) knlGS:0000000000000000
[  846.431841] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.431842] CR2: 0000000001a88008 CR3: 00000001eb572000 CR4: 00000000000006e0
[  846.431846] Call Trace:
[  846.431850]  ? f2fs_find_entry+0x71/0x90
[  846.431853]  f2fs_lookup+0x1aa/0x390
[  846.431856]  __lookup_slow+0x97/0x150
[  846.431858]  lookup_slow+0x35/0x50
[  846.431874]  walk_component+0x1c6/0x470
[  846.431878]  ? memcg_kmem_charge_memcg+0x70/0x90
[  846.431880]  ? page_add_file_rmap+0x13/0x200
[  846.431882]  path_lookupat+0x76/0x230
[  846.431884]  ? __alloc_pages_nodemask+0xfc/0x280
[  846.431886]  filename_lookup+0xb8/0x1a0
[  846.431890]  ? _cond_resched+0x16/0x40
[  846.431891]  ? kmem_cache_alloc+0x160/0x1d0
[  846.431894]  ? path_listxattr+0x41/0xa0
[  846.431896]  path_listxattr+0x41/0xa0
[  846.431898]  do_syscall_64+0x55/0x100
[  846.431901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  846.431902] RIP: 0033:0x7f882de1c0d7
[  846.431903] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[  846.431934] RSP: 002b:00007ffe8e66c238 EFLAGS: 00000202 ORIG_RAX: 00000000000000c2
[  846.431936] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882de1c0d7
[  846.431937] RDX: 0000000000000071 RSI: 00007ffe8e66c280 RDI: 0000000001a880c0
[  846.431939] RBP: 00007ffe8e66c300 R08: 0000000001a88010 R09: 0000000000000000
[  846.431940] R10: 00000000000001ab R11: 0000000000000202 R12: 0000000000400550
[  846.431941] R13: 00007ffe8e66c400 R14: 0000000000000000 R15: 0000000000000000
[  846.431943] ---[ end trace abca54df39d14f5d ]---
[  846.432033] F2FS-fs (loop0): access invalid blkaddr:1024
[  846.432051] WARNING: CPU: 1 PID: 1249 at fs/f2fs/checkpoint.c:154 f2fs_is_valid_blkaddr+0x10f/0x160
[  846.432051] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.432085] CPU: 1 PID: 1249 Comm: a.out Tainted: G        W         4.18.0-rc3+ #1
[  846.432086] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.432089] RIP: 0010:f2fs_is_valid_blkaddr+0x10f/0x160
[  846.432089] Code: 00 eb ed 31 c0 83 fa 05 75 ae 48 83 ec 08 48 8b 3f 89 f1 48 c7 c2 fc 0b 0f 8b 48 c7 c6 8b d7 09 8b 88 44 24 07 e8 61 8b ff ff <0f> 0b 0f b6 44 24 07 48 83 c4 08 eb 81 4c 8b 47 10 8b 8f 38 04 00
[  846.432120] RSP: 0018:ffff961c414a7900 EFLAGS: 00010286
[  846.432122] RAX: 0000000000000000 RBX: 0000000000000400 RCX: 0000000000000006
[  846.432123] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff89dfffd165d0
[  846.432124] RBP: ffff89dff5492800 R08: 0000000000000001 R09: 000000000000029d
[  846.432125] R10: ffff961c414a7820 R11: 000000000000029d R12: 0000000000000400
[  846.432126] R13: 0000000000000000 R14: ffff89dff4ff88d0 R15: 0000000000000000
[  846.432128] FS:  00007f882e2fb700(0000) GS:ffff89dfffd00000(0000) knlGS:0000000000000000
[  846.432130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.432131] CR2: 0000000001a88008 CR3: 00000001eb572000 CR4: 00000000000006e0
[  846.432135] Call Trace:
[  846.432151]  f2fs_wait_on_block_writeback+0x20/0x110
[  846.432158]  f2fs_grab_read_bio+0xbc/0xe0
[  846.432161]  f2fs_submit_page_read+0x21/0x280
[  846.432163]  f2fs_get_read_data_page+0xb7/0x3c0
[  846.432165]  f2fs_get_lock_data_page+0x29/0x1e0
[  846.432167]  f2fs_get_new_data_page+0x148/0x550
[  846.432170]  f2fs_add_regular_entry+0x1d2/0x550
[  846.432178]  ? __switch_to+0x12f/0x460
[  846.432181]  f2fs_add_dentry+0x6a/0xd0
[  846.432184]  f2fs_do_add_link+0xe9/0x140
[  846.432186]  __recover_dot_dentries+0x260/0x280
[  846.432189]  f2fs_lookup+0x343/0x390
[  846.432193]  __lookup_slow+0x97/0x150
[  846.432195]  lookup_slow+0x35/0x50
[  846.432208]  walk_component+0x1c6/0x470
[  846.432212]  ? memcg_kmem_charge_memcg+0x70/0x90
[  846.432215]  ? page_add_file_rmap+0x13/0x200
[  846.432217]  path_lookupat+0x76/0x230
[  846.432219]  ? __alloc_pages_nodemask+0xfc/0x280
[  846.432221]  filename_lookup+0xb8/0x1a0
[  846.432224]  ? _cond_resched+0x16/0x40
[  846.432226]  ? kmem_cache_alloc+0x160/0x1d0
[  846.432228]  ? path_listxattr+0x41/0xa0
[  846.432230]  path_listxattr+0x41/0xa0
[  846.432233]  do_syscall_64+0x55/0x100
[  846.432235]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  846.432237] RIP: 0033:0x7f882de1c0d7
[  846.432237] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[  846.432269] RSP: 002b:00007ffe8e66c238 EFLAGS: 00000202 ORIG_RAX: 00000000000000c2
[  846.432271] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882de1c0d7
[  846.432272] RDX: 0000000000000071 RSI: 00007ffe8e66c280 RDI: 0000000001a880c0
[  846.432273] RBP: 00007ffe8e66c300 R08: 0000000001a88010 R09: 0000000000000000
[  846.432274] R10: 00000000000001ab R11: 0000000000000202 R12: 0000000000400550
[  846.432275] R13: 00007ffe8e66c400 R14: 0000000000000000 R15: 0000000000000000
[  846.432277] ---[ end trace abca54df39d14f5e ]---
[  846.432279] F2FS-fs (loop0): invalid blkaddr: 1024, type: 5, run fsck to fix.
[  846.432376] WARNING: CPU: 1 PID: 1249 at fs/f2fs/f2fs.h:2697 f2fs_wait_on_block_writeback+0xb1/0x110
[  846.432376] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.432410] CPU: 1 PID: 1249 Comm: a.out Tainted: G        W         4.18.0-rc3+ #1
[  846.432411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.432413] RIP: 0010:f2fs_wait_on_block_writeback+0xb1/0x110
[  846.432414] Code: 66 90 f0 ff 4b 34 74 59 5b 5d c3 48 8b 7d 00 41 b8 05 00 00 00 89 d9 48 c7 c2 d8 e8 0e 8b 48 c7 c6 1d b0 0a 8b e8 df bc fd ff <0f> 0b f0 80 4d 48 04 e9 67 ff ff ff 48 8b 03 48 c1 e8 37 83 e0 07
[  846.432445] RSP: 0018:ffff961c414a7910 EFLAGS: 00010286
[  846.432447] RAX: 0000000000000000 RBX: 0000000000000400 RCX: 0000000000000006
[  846.432448] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff89dfffd165d0
[  846.432449] RBP: ffff89dff5492800 R08: 0000000000000000 R09: 00000000000002d1
[  846.432450] R10: ffff961c414a7820 R11: ffff89dfad50cf80 R12: 0000000000000400
[  846.432451] R13: 0000000000000000 R14: ffff89dff4ff88d0 R15: 0000000000000000
[  846.432453] FS:  00007f882e2fb700(0000) GS:ffff89dfffd00000(0000) knlGS:0000000000000000
[  846.432454] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.432455] CR2: 0000000001a88008 CR3: 00000001eb572000 CR4: 00000000000006e0
[  846.432459] Call Trace:
[  846.432463]  f2fs_grab_read_bio+0xbc/0xe0
[  846.432464]  f2fs_submit_page_read+0x21/0x280
[  846.432466]  f2fs_get_read_data_page+0xb7/0x3c0
[  846.432468]  f2fs_get_lock_data_page+0x29/0x1e0
[  846.432470]  f2fs_get_new_data_page+0x148/0x550
[  846.432473]  f2fs_add_regular_entry+0x1d2/0x550
[  846.432475]  ? __switch_to+0x12f/0x460
[  846.432477]  f2fs_add_dentry+0x6a/0xd0
[  846.432480]  f2fs_do_add_link+0xe9/0x140
[  846.432483]  __recover_dot_dentries+0x260/0x280
[  846.432485]  f2fs_lookup+0x343/0x390
[  846.432488]  __lookup_slow+0x97/0x150
[  846.432490]  lookup_slow+0x35/0x50
[  846.432505]  walk_component+0x1c6/0x470
[  846.432509]  ? memcg_kmem_charge_memcg+0x70/0x90
[  846.432511]  ? page_add_file_rmap+0x13/0x200
[  846.432513]  path_lookupat+0x76/0x230
[  846.432515]  ? __alloc_pages_nodemask+0xfc/0x280
[  846.432517]  filename_lookup+0xb8/0x1a0
[  846.432520]  ? _cond_resched+0x16/0x40
[  846.432522]  ? kmem_cache_alloc+0x160/0x1d0
[  846.432525]  ? path_listxattr+0x41/0xa0
[  846.432526]  path_listxattr+0x41/0xa0
[  846.432529]  do_syscall_64+0x55/0x100
[  846.432531]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  846.432533] RIP: 0033:0x7f882de1c0d7
[  846.432533] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[  846.432565] RSP: 002b:00007ffe8e66c238 EFLAGS: 00000202 ORIG_RAX: 00000000000000c2
[  846.432567] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882de1c0d7
[  846.432568] RDX: 0000000000000071 RSI: 00007ffe8e66c280 RDI: 0000000001a880c0
[  846.432569] RBP: 00007ffe8e66c300 R08: 0000000001a88010 R09: 0000000000000000
[  846.432570] R10: 00000000000001ab R11: 0000000000000202 R12: 0000000000400550
[  846.432571] R13: 00007ffe8e66c400 R14: 0000000000000000 R15: 0000000000000000
[  846.432573] ---[ end trace abca54df39d14f5f ]---
[  846.434280] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  846.434424] PGD 80000001ebd3a067 P4D 80000001ebd3a067 PUD 1eb1ae067 PMD 0
[  846.434551] Oops: 0000 [#1] SMP PTI
[  846.434697] CPU: 0 PID: 44 Comm: kworker/u5:0 Tainted: G        W         4.18.0-rc3+ #1
[  846.434805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  846.435000] Workqueue: fscrypt_read_queue decrypt_work
[  846.435174] RIP: 0010:fscrypt_do_page_crypto+0x6e/0x2d0
[  846.435351] Code: 00 65 48 8b 04 25 28 00 00 00 48 89 84 24 88 00 00 00 31 c0 e8 43 c2 e0 ff 49 8b 86 48 02 00 00 85 ed c7 44 24 70 00 00 00 00 <48> 8b 58 08 0f 84 14 02 00 00 48 8b 78 10 48 8b 0c 24 48 c7 84 24
[  846.435696] RSP: 0018:ffff961c40f9bd60 EFLAGS: 00010206
[  846.435870] RAX: 0000000000000000 RBX: ffffc5f787719b80 RCX: ffffc5f787719b80
[  846.436051] RDX: ffffffff8b9f4b88 RSI: ffffffff8b0ae622 RDI: ffff961c40f9bdb8
[  846.436261] RBP: 0000000000001000 R08: ffffc5f787719b80 R09: 0000000000001000
[  846.436433] R10: 0000000000000018 R11: fefefefefefefeff R12: ffffc5f787719b80
[  846.436562] R13: ffffc5f787719b80 R14: ffff89dff4ff88d0 R15: 0ffff89dfaddee60
[  846.436658] FS:  0000000000000000(0000) GS:ffff89dfffc00000(0000) knlGS:0000000000000000
[  846.436758] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.436898] CR2: 0000000000000008 CR3: 00000001eddd0000 CR4: 00000000000006f0
[  846.437001] Call Trace:
[  846.437181]  ? check_preempt_wakeup+0xf2/0x230
[  846.437276]  ? check_preempt_curr+0x7c/0x90
[  846.437370]  fscrypt_decrypt_page+0x48/0x4d
[  846.437466]  __fscrypt_decrypt_bio+0x5b/0x90
[  846.437542]  decrypt_work+0x12/0x20
[  846.437651]  process_one_work+0x15e/0x3d0
[  846.437740]  worker_thread+0x4c/0x440
[  846.437848]  kthread+0xf8/0x130
[  846.437938]  ? rescuer_thread+0x350/0x350
[  846.438022]  ? kthread_associate_blkcg+0x90/0x90
[  846.438117]  ret_from_fork+0x35/0x40
[  846.438201] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds joydev soundcore serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear qxl ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops pcbc drm 8139too aesni_intel 8139cp floppy psmouse mii aes_x86_64 crypto_simd pata_acpi cryptd glue_helper
[  846.438653] CR2: 0000000000000008
[  846.438713] ---[ end trace abca54df39d14f60 ]---
[  846.438796] RIP: 0010:fscrypt_do_page_crypto+0x6e/0x2d0
[  846.438844] Code: 00 65 48 8b 04 25 28 00 00 00 48 89 84 24 88 00 00 00 31 c0 e8 43 c2 e0 ff 49 8b 86 48 02 00 00 85 ed c7 44 24 70 00 00 00 00 <48> 8b 58 08 0f 84 14 02 00 00 48 8b 78 10 48 8b 0c 24 48 c7 84 24
[  846.439084] RSP: 0018:ffff961c40f9bd60 EFLAGS: 00010206
[  846.439176] RAX: 0000000000000000 RBX: ffffc5f787719b80 RCX: ffffc5f787719b80
[  846.440927] RDX: ffffffff8b9f4b88 RSI: ffffffff8b0ae622 RDI: ffff961c40f9bdb8
[  846.442083] RBP: 0000000000001000 R08: ffffc5f787719b80 R09: 0000000000001000
[  846.443284] R10: 0000000000000018 R11: fefefefefefefeff R12: ffffc5f787719b80
[  846.444448] R13: ffffc5f787719b80 R14: ffff89dff4ff88d0 R15: 0ffff89dfaddee60
[  846.445558] FS:  0000000000000000(0000) GS:ffff89dfffc00000(0000) knlGS:0000000000000000
[  846.446687] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.447796] CR2: 0000000000000008 CR3: 00000001eddd0000 CR4: 00000000000006f0

- Location
https://elixir.bootlin.com/linux/v4.18-rc4/source/fs/crypto/crypto.c#L149
	struct crypto_skcipher *tfm = ci->ci_ctfm;
Here ci can be NULL

Note that this issue maybe require CONFIG_F2FS_FS_ENCRYPTION=y to reproduce.

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
bcbfbd604d f2fs: fix to do sanity check with inline flags
https://bugzilla.kernel.org/show_bug.cgi?id=200221

- Overview
BUG() in clear_inode() when mounting and un-mounting a corrupted f2fs image

- Reproduce

- Kernel message
[  538.601448] F2FS-fs (loop0): Invalid segment/section count (31, 24 x 1376257)
[  538.601458] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[  538.724091] F2FS-fs (loop0): Try to recover 2th superblock, ret: 0
[  538.724102] F2FS-fs (loop0): Mounted with checkpoint version = 2
[  540.970834] ------------[ cut here ]------------
[  540.970838] kernel BUG at fs/inode.c:512!
[  540.971750] invalid opcode: 0000 [#1] SMP KASAN PTI
[  540.972755] CPU: 1 PID: 1305 Comm: umount Not tainted 4.18.0-rc1+ #4
[  540.974034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  540.982913] RIP: 0010:clear_inode+0xc0/0xd0
[  540.983774] Code: 8d a3 30 01 00 00 4c 89 e7 e8 1c ec f8 ff 48 8b 83 30 01 00 00 49 39 c4 75 1a 48 c7 83 a0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 40 00 66 66 66 66 90 55
[  540.987570] RSP: 0018:ffff8801e34a7b70 EFLAGS: 00010002
[  540.988636] RAX: 0000000000000000 RBX: ffff8801e9b744e8 RCX: ffffffffb840eb3a
[  540.990063] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8801e9b746b8
[  540.991499] RBP: ffff8801e34a7b80 R08: ffffed003d36e8ce R09: ffffed003d36e8ce
[  540.992923] R10: 0000000000000001 R11: ffffed003d36e8cd R12: ffff8801e9b74668
[  540.994360] R13: ffff8801e9b74760 R14: ffff8801e9b74528 R15: ffff8801e9b74530
[  540.995786] FS:  00007f4662bdf840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  540.997403] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  540.998571] CR2: 000000000175c568 CR3: 00000001dcfe6000 CR4: 00000000000006e0
[  541.000015] Call Trace:
[  541.000554]  f2fs_evict_inode+0x253/0x630
[  541.001381]  evict+0x16f/0x290
[  541.002015]  iput+0x280/0x300
[  541.002654]  dentry_unlink_inode+0x165/0x1e0
[  541.003528]  __dentry_kill+0x16a/0x260
[  541.004300]  dentry_kill+0x70/0x250
[  541.005018]  dput+0x154/0x1d0
[  541.005635]  do_one_tree+0x34/0x40
[  541.006354]  shrink_dcache_for_umount+0x3f/0xa0
[  541.007285]  generic_shutdown_super+0x43/0x1c0
[  541.008192]  kill_block_super+0x52/0x80
[  541.008978]  kill_f2fs_super+0x62/0x70
[  541.009750]  deactivate_locked_super+0x6f/0xa0
[  541.010664]  deactivate_super+0x5e/0x80
[  541.011450]  cleanup_mnt+0x61/0xa0
[  541.012151]  __cleanup_mnt+0x12/0x20
[  541.012893]  task_work_run+0xc8/0xf0
[  541.013635]  exit_to_usermode_loop+0x125/0x130
[  541.014555]  do_syscall_64+0x138/0x170
[  541.015340]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  541.016375] RIP: 0033:0x7f46624bf487
[  541.017104] Code: 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 c9 2b 00 f7 d8 64 89 01 48
[  541.020923] RSP: 002b:00007fff5e12e9a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  541.022452] RAX: 0000000000000000 RBX: 0000000001753030 RCX: 00007f46624bf487
[  541.023885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000175a1e0
[  541.025318] RBP: 000000000175a1e0 R08: 0000000000000000 R09: 0000000000000014
[  541.026755] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f46629c883c
[  541.028186] R13: 0000000000000000 R14: 0000000001753210 R15: 00007fff5e12ec30
[  541.029626] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  541.039445] ---[ end trace 4ce02f25ff7d3df5 ]---
[  541.040392] RIP: 0010:clear_inode+0xc0/0xd0
[  541.041240] Code: 8d a3 30 01 00 00 4c 89 e7 e8 1c ec f8 ff 48 8b 83 30 01 00 00 49 39 c4 75 1a 48 c7 83 a0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 40 00 66 66 66 66 90 55
[  541.045042] RSP: 0018:ffff8801e34a7b70 EFLAGS: 00010002
[  541.046099] RAX: 0000000000000000 RBX: ffff8801e9b744e8 RCX: ffffffffb840eb3a
[  541.047537] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8801e9b746b8
[  541.048965] RBP: ffff8801e34a7b80 R08: ffffed003d36e8ce R09: ffffed003d36e8ce
[  541.050402] R10: 0000000000000001 R11: ffffed003d36e8cd R12: ffff8801e9b74668
[  541.051832] R13: ffff8801e9b74760 R14: ffff8801e9b74528 R15: ffff8801e9b74530
[  541.053263] FS:  00007f4662bdf840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  541.054891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  541.056039] CR2: 000000000175c568 CR3: 00000001dcfe6000 CR4: 00000000000006e0
[  541.058506] ==================================================================
[  541.059991] BUG: KASAN: stack-out-of-bounds in update_stack_state+0x38c/0x3e0
[  541.061513] Read of size 8 at addr ffff8801e34a7970 by task umount/1305

[  541.063302] CPU: 1 PID: 1305 Comm: umount Tainted: G      D           4.18.0-rc1+ #4
[  541.064838] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  541.066778] Call Trace:
[  541.067294]  dump_stack+0x7b/0xb5
[  541.067986]  print_address_description+0x70/0x290
[  541.068941]  kasan_report+0x291/0x390
[  541.069692]  ? update_stack_state+0x38c/0x3e0
[  541.070598]  __asan_load8+0x54/0x90
[  541.071315]  update_stack_state+0x38c/0x3e0
[  541.072172]  ? __read_once_size_nocheck.constprop.7+0x20/0x20
[  541.073340]  ? vprintk_func+0x27/0x60
[  541.074096]  ? printk+0xa3/0xd3
[  541.074762]  ? __save_stack_trace+0x5e/0x100
[  541.075634]  unwind_next_frame.part.5+0x18e/0x490
[  541.076594]  ? unwind_dump+0x290/0x290
[  541.077368]  ? __show_regs+0x2c4/0x330
[  541.078142]  __unwind_start+0x106/0x190
[  541.085422]  __save_stack_trace+0x5e/0x100
[  541.086268]  ? __save_stack_trace+0x5e/0x100
[  541.087161]  ? unlink_anon_vmas+0xba/0x2c0
[  541.087997]  save_stack_trace+0x1f/0x30
[  541.088782]  save_stack+0x46/0xd0
[  541.089475]  ? __alloc_pages_slowpath+0x1420/0x1420
[  541.090477]  ? flush_tlb_mm_range+0x15e/0x220
[  541.091364]  ? __dec_node_state+0x24/0xb0
[  541.092180]  ? lock_page_memcg+0x85/0xf0
[  541.092979]  ? unlock_page_memcg+0x16/0x80
[  541.093812]  ? page_remove_rmap+0x198/0x520
[  541.094674]  ? mark_page_accessed+0x133/0x200
[  541.095559]  ? _cond_resched+0x1a/0x50
[  541.096326]  ? unmap_page_range+0xcd4/0xe50
[  541.097179]  ? rb_next+0x58/0x80
[  541.097845]  ? rb_next+0x58/0x80
[  541.098518]  __kasan_slab_free+0x13c/0x1a0
[  541.099352]  ? unlink_anon_vmas+0xba/0x2c0
[  541.100184]  kasan_slab_free+0xe/0x10
[  541.100934]  kmem_cache_free+0x89/0x1e0
[  541.101724]  unlink_anon_vmas+0xba/0x2c0
[  541.102534]  free_pgtables+0x101/0x1b0
[  541.103299]  exit_mmap+0x146/0x2a0
[  541.103996]  ? __ia32_sys_munmap+0x50/0x50
[  541.104829]  ? kasan_check_read+0x11/0x20
[  541.105649]  ? mm_update_next_owner+0x322/0x380
[  541.106578]  mmput+0x8b/0x1d0
[  541.107191]  do_exit+0x43a/0x1390
[  541.107876]  ? mm_update_next_owner+0x380/0x380
[  541.108791]  ? deactivate_super+0x5e/0x80
[  541.109610]  ? cleanup_mnt+0x61/0xa0
[  541.110351]  ? __cleanup_mnt+0x12/0x20
[  541.111115]  ? task_work_run+0xc8/0xf0
[  541.111879]  ? exit_to_usermode_loop+0x125/0x130
[  541.112817]  rewind_stack_do_exit+0x17/0x20
[  541.113666] RIP: 0033:0x7f46624bf487
[  541.114404] Code: Bad RIP value.
[  541.115094] RSP: 002b:00007fff5e12e9a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  541.116605] RAX: 0000000000000000 RBX: 0000000001753030 RCX: 00007f46624bf487
[  541.118034] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000175a1e0
[  541.119472] RBP: 000000000175a1e0 R08: 0000000000000000 R09: 0000000000000014
[  541.120890] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f46629c883c
[  541.122321] R13: 0000000000000000 R14: 0000000001753210 R15: 00007fff5e12ec30

[  541.124061] The buggy address belongs to the page:
[  541.125042] page:ffffea00078d29c0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  541.126651] flags: 0x2ffff0000000000()
[  541.127418] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[  541.128963] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  541.130516] page dumped because: kasan: bad access detected

[  541.131954] Memory state around the buggy address:
[  541.132924]  ffff8801e34a7800: 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00 00
[  541.134378]  ffff8801e34a7880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  541.135814] >ffff8801e34a7900: 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
[  541.137253]                                                              ^
[  541.138637]  ffff8801e34a7980: f1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  541.140075]  ffff8801e34a7a00: 00 00 00 00 00 00 00 00 f3 00 00 00 00 00 00 00
[  541.141509] ==================================================================

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/inode.c#L512
	BUG_ON(inode->i_data.nrpages);

The root cause is root directory inode is corrupted, it has both
inline_data and inline_dentry flag, and its nlink is zero, so in
->evict(), after dropping all page cache, it grabs page #0 for inline
data truncation, result in panic in later clear_inode() where we will
check inode->i_data.nrpages value.

This patch adds inline flags check in sanity_check_inode, in addition,
do sanity check with root inode's nlink.

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-13 10:48:17 -07:00
Chao Yu
3093336481 f2fs: fix to reset i_gc_failures correctly
Let's reset i_gc_failures to zero when we unset pinned state for file.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:39:26 -07:00
Chao Yu
d3f07c049d f2fs: fix invalid memory access
syzbot found the following crash on:

HEAD commit:    d9bd94c0bcaa Add linux-next specific files for 20180801
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1001189c400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=cc8964ea4d04518c
dashboard link: https://syzkaller.appspot.com/bug?extid=c966a82db0b14aa37e81
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c966a82db0b14aa37e81@syzkaller.appspotmail.com

loop7: rw=12288, want=8200, limit=20
netlink: 65342 bytes leftover after parsing attributes in process `syz-executor4'.
openvswitch: netlink: Message has 8 unknown bytes.
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
CPU: 1 PID: 7615 Comm: syz-executor7 Not tainted 4.18.0-rc7-next-20180801+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:188 [inline]
RIP: 0010:compound_head include/linux/page-flags.h:142 [inline]
RIP: 0010:PageLocked include/linux/page-flags.h:272 [inline]
RIP: 0010:f2fs_put_page fs/f2fs/f2fs.h:2011 [inline]
RIP: 0010:validate_checkpoint+0x66d/0xec0 fs/f2fs/checkpoint.c:835
Code: e8 58 05 7f fe 4c 8d 6b 80 4d 8d 74 24 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 c6 04 02 00 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 f4 06 00 00 4c 89 ea 4d 8b 7c 24 08 48 b8 00 00
RSP: 0018:ffff8801937cebe8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff8801937cef30 RCX: ffffc90006035000
RDX: 0000000000000000 RSI: ffffffff82fd9658 RDI: 0000000000000005
RBP: ffff8801937cef58 R08: ffff8801ab254700 R09: fffff94000d9e026
R10: fffff94000d9e026 R11: ffffea0006cf0137 R12: fffffffffffffffb
R13: ffff8801937ceeb0 R14: 0000000000000003 R15: ffff880193419b40
FS:  00007f36a61d5700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc04ff93000 CR3: 00000001d0562000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 f2fs_get_valid_checkpoint+0x436/0x1ec0 fs/f2fs/checkpoint.c:860
 f2fs_fill_super+0x2d42/0x8110 fs/f2fs/super.c:2883
 mount_bdev+0x314/0x3e0 fs/super.c:1344
 f2fs_mount+0x3c/0x50 fs/f2fs/super.c:3133
 legacy_get_tree+0x131/0x460 fs/fs_context.c:729
 vfs_get_tree+0x1cb/0x5c0 fs/super.c:1743
 do_new_mount fs/namespace.c:2603 [inline]
 do_mount+0x6f2/0x1e20 fs/namespace.c:2927
 ksys_mount+0x12d/0x140 fs/namespace.c:3143
 __do_sys_mount fs/namespace.c:3157 [inline]
 __se_sys_mount fs/namespace.c:3154 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3154
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45943a
Code: b8 a6 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 bd 8a fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 9a 8a fb ff c3 66 0f 1f 84 00 00 00 00 00
RSP: 002b:00007f36a61d4a88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f36a61d4b30 RCX: 000000000045943a
RDX: 00007f36a61d4ad0 RSI: 0000000020000100 RDI: 00007f36a61d4af0
RBP: 0000000020000100 R08: 00007f36a61d4b30 R09: 00007f36a61d4ad0
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000013
R13: 0000000000000000 R14: 00000000004c8ea0 R15: 0000000000000000
Modules linked in:
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace bd8550c129352286 ]---
RIP: 0010:__read_once_size include/linux/compiler.h:188 [inline]
RIP: 0010:compound_head include/linux/page-flags.h:142 [inline]
RIP: 0010:PageLocked include/linux/page-flags.h:272 [inline]
RIP: 0010:f2fs_put_page fs/f2fs/f2fs.h:2011 [inline]
RIP: 0010:validate_checkpoint+0x66d/0xec0 fs/f2fs/checkpoint.c:835
Code: e8 58 05 7f fe 4c 8d 6b 80 4d 8d 74 24 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 c6 04 02 00 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 f4 06 00 00 4c 89 ea 4d 8b 7c 24 08 48 b8 00 00
RSP: 0018:ffff8801937cebe8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff8801937cef30 RCX: ffffc90006035000
RDX: 0000000000000000 RSI: ffffffff82fd9658 RDI: 0000000000000005
netlink: 65342 bytes leftover after parsing attributes in process `syz-executor4'.
RBP: ffff8801937cef58 R08: ffff8801ab254700 R09: fffff94000d9e026
openvswitch: netlink: Message has 8 unknown bytes.
R10: fffff94000d9e026 R11: ffffea0006cf0137 R12: fffffffffffffffb
R13: ffff8801937ceeb0 R14: 0000000000000003 R15: ffff880193419b40
FS:  00007f36a61d5700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc04ff93000 CR3: 00000001d0562000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

In validate_checkpoint(), if we failed to call get_checkpoint_version(), we
will pass returned invalid page pointer into f2fs_put_page, cause accessing
invalid memory, this patch tries to handle error path correctly to fix this
issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Chao Yu
50fa53eccf f2fs: fix to avoid broken of dnode block list
f2fs recovery flow is relying on dnode block link list, it means fsynced
file recovery depends on previous dnode's persistence in the list, so
during fsync() we should wait on all regular inode's dnode writebacked
before issuing flush.

By this way, we can avoid dnode block list being broken by out-of-order
IO submission due to IO scheduler or driver.

Sheng Yong helps to do the test with this patch:

Target:/data (f2fs, -)
64MB / 32768KB / 4KB / 8

1 / PERSIST / Index

Base:
	SEQ-RD(MB/s)	SEQ-WR(MB/s)	RND-RD(IOPS)	RND-WR(IOPS)	Insert(TPS)	Update(TPS)	Delete(TPS)
1	867.82		204.15		41440.03	41370.54	680.8		1025.94		1031.08
2	871.87		205.87		41370.3		40275.2		791.14		1065.84		1101.7
3	866.52		205.69		41795.67	40596.16	694.69		1037.16		1031.48
Avg	868.7366667	205.2366667	41535.33333	40747.3		722.21		1042.98		1054.753333

After:
	SEQ-RD(MB/s)	SEQ-WR(MB/s)	RND-RD(IOPS)	RND-WR(IOPS)	Insert(TPS)	Update(TPS)	Delete(TPS)
1	798.81		202.5		41143		40613.87	602.71		838.08		913.83
2	805.79		206.47		40297.2		41291.46	604.44		840.75		924.27
3	814.83		206.17		41209.57	40453.62	602.85		834.66		927.91
Avg	806.4766667	205.0466667	40883.25667	40786.31667	603.3333333	837.83		922.0033333

Patched/Original:
	0.928332713	0.999074239	0.984300676	1.000957528	0.835398753	0.803303994	0.874141189

It looks like atomic write will suffer performance regression.

I suspect that the criminal is that we forcing to wait all dnode being in
storage cache before we issue PREFLUSH+FUA.

BTW, will commit ("f2fs: don't need to wait for node writes for atomic write")
cause the problem: we will lose data of last transaction after SPO, even if
atomic write return no error:

- atomic_open();
- write() P1, P2, P3;
- atomic_commit();
 - writeback data: P1, P2, P3;
 - writeback node: N1, N2, N3;  <--- If N1, N2 is not writebacked, N3 with fsync_mark is
writebacked, In SPOR, we won't find N3 since node chain is broken, turns out that losing
last transaction.
 - preflush + fua;
- power-cut

If we don't wait dnode writeback for atomic_write:

	SEQ-RD(MB/s)	SEQ-WR(MB/s)	RND-RD(IOPS)	RND-WR(IOPS)	Insert(TPS)	Update(TPS)	Delete(TPS)
1	779.91		206.03		41621.5		40333.16	716.9		1038.21		1034.85
2	848.51		204.35		40082.44	39486.17	791.83		1119.96		1083.77
3	772.12		206.27		41335.25	41599.65	723.29		1055.07		971.92
Avg	800.18		205.55		41013.06333	40472.99333	744.0066667	1071.08		1030.18

Patched/Original:
	0.92108464	1.001526693	0.987425886	0.993268102	1.030180511	1.026942031	0.976702294

SQLite's performance recovers.

Jaegeuk:
"Practically, I don't see db corruption becase of this. We can excuse to lose
the last transaction."

Finally, we decide to keep original implementation of atomic write interface
sematics that we don't wait all dnode writeback before preflush+fua submission.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Gustavo A. R. Silva
6e45f2a59f f2fs: use true and false for boolean values
Return statements in functions returning bool should use true or false
instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Chao Yu
e494c2f995 f2fs: fix to do sanity check with cp_pack_start_sum
After fuzzing, cp_pack_start_sum could be corrupted, so current log's
summary info should be wrong due to loading incorrect summary block.
Then, if segment's type in current log is exceeded NR_CURSEG_TYPE, it
can lead accessing invalid dirty_i->dirty_segmap bitmap finally.

Add sanity check for cp_pack_start_sum to fix this issue.

https://bugzilla.kernel.org/show_bug.cgi?id=200419

- Reproduce

- Kernel message (f2fs-dev w/ KASAN)
[ 3117.578432] F2FS-fs (loop0): Invalid log blocks per segment (8)

[ 3117.578445] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[ 3117.581364] F2FS-fs (loop0): invalid crc_offset: 30716
[ 3117.583564] WARNING: CPU: 1 PID: 1225 at fs/f2fs/checkpoint.c:90 __get_meta_page+0x448/0x4b0
[ 3117.583570] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.584014] CPU: 1 PID: 1225 Comm: mount Not tainted 4.17.0+ #1
[ 3117.584017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.584022] RIP: 0010:__get_meta_page+0x448/0x4b0
[ 3117.584023] Code: 00 49 8d bc 24 84 00 00 00 e8 74 54 da ff 41 83 8c 24 84 00 00 00 08 4c 89 f6 4c 89 ef e8 c0 d9 95 00 48 89 ef e8 18 e3 00 00 <0f> 0b f0 80 4d 48 04 e9 0f fe ff ff 0f 0b 48 89 c7 48 89 04 24 e8
[ 3117.584072] RSP: 0018:ffff88018eb678c0 EFLAGS: 00010286
[ 3117.584082] RAX: ffff88018f0a6a78 RBX: ffffea0007a46600 RCX: ffffffff9314d1b2
[ 3117.584085] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: ffff88018f0a6a98
[ 3117.584087] RBP: ffff88018ebe9980 R08: 0000000000000002 R09: 0000000000000001
[ 3117.584090] R10: 0000000000000001 R11: ffffed00326e4450 R12: ffff880193722200
[ 3117.584092] R13: ffff88018ebe9afc R14: 0000000000000206 R15: ffff88018eb67900
[ 3117.584096] FS:  00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.584098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.584101] CR2: 00000000016f21b8 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.584112] Call Trace:
[ 3117.584121]  ? f2fs_set_meta_page_dirty+0x150/0x150
[ 3117.584127]  ? f2fs_build_segment_manager+0xbf9/0x3190
[ 3117.584133]  ? f2fs_npages_for_summary_flush+0x75/0x120
[ 3117.584145]  f2fs_build_segment_manager+0xda8/0x3190
[ 3117.584151]  ? f2fs_get_valid_checkpoint+0x298/0xa00
[ 3117.584156]  ? f2fs_flush_sit_entries+0x10e0/0x10e0
[ 3117.584184]  ? map_id_range_down+0x17c/0x1b0
[ 3117.584188]  ? __put_user_ns+0x30/0x30
[ 3117.584206]  ? find_next_bit+0x53/0x90
[ 3117.584237]  ? cpumask_next+0x16/0x20
[ 3117.584249]  f2fs_fill_super+0x1948/0x2b40
[ 3117.584258]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584279]  ? sget_userns+0x65e/0x690
[ 3117.584296]  ? set_blocksize+0x88/0x130
[ 3117.584302]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584305]  mount_bdev+0x1c0/0x200
[ 3117.584310]  mount_fs+0x5c/0x190
[ 3117.584320]  vfs_kern_mount+0x64/0x190
[ 3117.584330]  do_mount+0x2e4/0x1450
[ 3117.584343]  ? lockref_put_return+0x130/0x130
[ 3117.584347]  ? copy_mount_string+0x20/0x20
[ 3117.584357]  ? kasan_unpoison_shadow+0x31/0x40
[ 3117.584362]  ? kasan_kmalloc+0xa6/0xd0
[ 3117.584373]  ? memcg_kmem_put_cache+0x16/0x90
[ 3117.584377]  ? __kmalloc_track_caller+0x196/0x210
[ 3117.584383]  ? _copy_from_user+0x61/0x90
[ 3117.584396]  ? memdup_user+0x3e/0x60
[ 3117.584401]  ksys_mount+0x7e/0xd0
[ 3117.584405]  __x64_sys_mount+0x62/0x70
[ 3117.584427]  do_syscall_64+0x73/0x160
[ 3117.584440]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.584455] RIP: 0033:0x7f5693f14b9a
[ 3117.584456] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.584505] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.584510] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.584512] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.584514] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.584516] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.584519] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.584523] ---[ end trace a8e0d899985faf31 ]---
[ 3117.685663] F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run fsck to fix.
[ 3117.685673] F2FS-fs (loop0): recover_data: ino = 2 (i_size: recover) recovered = 1, err = 0
[ 3117.685707] ==================================================================
[ 3117.685955] BUG: KASAN: slab-out-of-bounds in __remove_dirty_segment+0xdd/0x1e0
[ 3117.686175] Read of size 8 at addr ffff88018f0a63d0 by task mount/1225

[ 3117.686477] CPU: 0 PID: 1225 Comm: mount Tainted: G        W         4.17.0+ #1
[ 3117.686481] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.686483] Call Trace:
[ 3117.686494]  dump_stack+0x71/0xab
[ 3117.686512]  print_address_description+0x6b/0x290
[ 3117.686517]  kasan_report+0x28e/0x390
[ 3117.686522]  ? __remove_dirty_segment+0xdd/0x1e0
[ 3117.686527]  __remove_dirty_segment+0xdd/0x1e0
[ 3117.686532]  locate_dirty_segment+0x189/0x190
[ 3117.686538]  f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.686543]  recover_data+0x703/0x2c20
[ 3117.686547]  ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.686553]  ? ksys_mount+0x7e/0xd0
[ 3117.686564]  ? policy_nodemask+0x1a/0x90
[ 3117.686567]  ? policy_node+0x56/0x70
[ 3117.686571]  ? add_fsync_inode+0xf0/0xf0
[ 3117.686592]  ? blk_finish_plug+0x44/0x60
[ 3117.686597]  ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.686602]  ? find_inode_fast+0xac/0xc0
[ 3117.686606]  ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.686618]  ? __radix_tree_lookup+0x150/0x150
[ 3117.686633]  ? dqget+0x670/0x670
[ 3117.686648]  ? pagecache_get_page+0x29/0x410
[ 3117.686656]  ? kmem_cache_alloc+0x176/0x1e0
[ 3117.686660]  ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.686664]  f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.686670]  ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.686674]  ? rb_insert_color+0x323/0x3d0
[ 3117.686678]  ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.686683]  ? proc_register+0x153/0x1d0
[ 3117.686686]  ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.686695]  ? f2fs_attr_store+0x50/0x50
[ 3117.686700]  ? proc_create_single_data+0x52/0x60
[ 3117.686707]  f2fs_fill_super+0x1d06/0x2b40
[ 3117.686728]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686735]  ? sget_userns+0x65e/0x690
[ 3117.686740]  ? set_blocksize+0x88/0x130
[ 3117.686745]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686748]  mount_bdev+0x1c0/0x200
[ 3117.686753]  mount_fs+0x5c/0x190
[ 3117.686758]  vfs_kern_mount+0x64/0x190
[ 3117.686762]  do_mount+0x2e4/0x1450
[ 3117.686769]  ? lockref_put_return+0x130/0x130
[ 3117.686773]  ? copy_mount_string+0x20/0x20
[ 3117.686777]  ? kasan_unpoison_shadow+0x31/0x40
[ 3117.686780]  ? kasan_kmalloc+0xa6/0xd0
[ 3117.686786]  ? memcg_kmem_put_cache+0x16/0x90
[ 3117.686790]  ? __kmalloc_track_caller+0x196/0x210
[ 3117.686795]  ? _copy_from_user+0x61/0x90
[ 3117.686801]  ? memdup_user+0x3e/0x60
[ 3117.686804]  ksys_mount+0x7e/0xd0
[ 3117.686809]  __x64_sys_mount+0x62/0x70
[ 3117.686816]  do_syscall_64+0x73/0x160
[ 3117.686824]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.686829] RIP: 0033:0x7f5693f14b9a
[ 3117.686830] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.686887] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.686892] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.686894] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.686896] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.686899] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.686901] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003

[ 3117.687005] Allocated by task 1225:
[ 3117.687152]  kasan_kmalloc+0xa6/0xd0
[ 3117.687157]  kmem_cache_alloc_trace+0xfd/0x200
[ 3117.687161]  f2fs_build_segment_manager+0x2d09/0x3190
[ 3117.687165]  f2fs_fill_super+0x1948/0x2b40
[ 3117.687168]  mount_bdev+0x1c0/0x200
[ 3117.687171]  mount_fs+0x5c/0x190
[ 3117.687174]  vfs_kern_mount+0x64/0x190
[ 3117.687177]  do_mount+0x2e4/0x1450
[ 3117.687180]  ksys_mount+0x7e/0xd0
[ 3117.687182]  __x64_sys_mount+0x62/0x70
[ 3117.687186]  do_syscall_64+0x73/0x160
[ 3117.687190]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 3117.687285] Freed by task 19:
[ 3117.687412]  __kasan_slab_free+0x137/0x190
[ 3117.687416]  kfree+0x8b/0x1b0
[ 3117.687460]  ttm_bo_man_put_node+0x61/0x80 [ttm]
[ 3117.687476]  ttm_bo_cleanup_refs+0x15f/0x250 [ttm]
[ 3117.687492]  ttm_bo_delayed_delete+0x2f0/0x300 [ttm]
[ 3117.687507]  ttm_bo_delayed_workqueue+0x17/0x50 [ttm]
[ 3117.687528]  process_one_work+0x2f9/0x740
[ 3117.687531]  worker_thread+0x78/0x6b0
[ 3117.687541]  kthread+0x177/0x1c0
[ 3117.687545]  ret_from_fork+0x35/0x40

[ 3117.687638] The buggy address belongs to the object at ffff88018f0a6300
                which belongs to the cache kmalloc-192 of size 192
[ 3117.688014] The buggy address is located 16 bytes to the right of
                192-byte region [ffff88018f0a6300, ffff88018f0a63c0)
[ 3117.688382] The buggy address belongs to the page:
[ 3117.688554] page:ffffea00063c2980 count:1 mapcount:0 mapping:ffff8801f3403180 index:0x0
[ 3117.688788] flags: 0x17fff8000000100(slab)
[ 3117.688944] raw: 017fff8000000100 ffffea00063c2840 0000000e0000000e ffff8801f3403180
[ 3117.689166] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[ 3117.689386] page dumped because: kasan: bad access detected

[ 3117.689653] Memory state around the buggy address:
[ 3117.689816]  ffff88018f0a6280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 3117.690027]  ffff88018f0a6300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690239] >ffff88018f0a6380: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.690448]                                                  ^
[ 3117.690644]  ffff88018f0a6400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690868]  ffff88018f0a6480: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.691077] ==================================================================
[ 3117.691290] Disabling lock debugging due to kernel taint
[ 3117.693893] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 3117.694120] PGD 80000001f01bc067 P4D 80000001f01bc067 PUD 1d9638067 PMD 0
[ 3117.694338] Oops: 0002 [#1] SMP KASAN PTI
[ 3117.694490] CPU: 1 PID: 1225 Comm: mount Tainted: G    B   W         4.17.0+ #1
[ 3117.694703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.695073] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.695246] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.695793] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.695969] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.696182] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.696391] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.696604] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.696813] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.697032] FS:  00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.697280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.702357] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.707235] Call Trace:
[ 3117.712077]  locate_dirty_segment+0x189/0x190
[ 3117.716891]  f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.721617]  recover_data+0x703/0x2c20
[ 3117.726316]  ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.730957]  ? ksys_mount+0x7e/0xd0
[ 3117.735573]  ? policy_nodemask+0x1a/0x90
[ 3117.740198]  ? policy_node+0x56/0x70
[ 3117.744829]  ? add_fsync_inode+0xf0/0xf0
[ 3117.749487]  ? blk_finish_plug+0x44/0x60
[ 3117.754152]  ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.758831]  ? find_inode_fast+0xac/0xc0
[ 3117.763448]  ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.768046]  ? __radix_tree_lookup+0x150/0x150
[ 3117.772603]  ? dqget+0x670/0x670
[ 3117.777159]  ? pagecache_get_page+0x29/0x410
[ 3117.781648]  ? kmem_cache_alloc+0x176/0x1e0
[ 3117.786067]  ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.790476]  f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.794790]  ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.799086]  ? rb_insert_color+0x323/0x3d0
[ 3117.803304]  ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.807563]  ? proc_register+0x153/0x1d0
[ 3117.811766]  ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.815947]  ? f2fs_attr_store+0x50/0x50
[ 3117.820087]  ? proc_create_single_data+0x52/0x60
[ 3117.824262]  f2fs_fill_super+0x1d06/0x2b40
[ 3117.828367]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.832432]  ? sget_userns+0x65e/0x690
[ 3117.836500]  ? set_blocksize+0x88/0x130
[ 3117.840501]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.844420]  mount_bdev+0x1c0/0x200
[ 3117.848275]  mount_fs+0x5c/0x190
[ 3117.852053]  vfs_kern_mount+0x64/0x190
[ 3117.855810]  do_mount+0x2e4/0x1450
[ 3117.859441]  ? lockref_put_return+0x130/0x130
[ 3117.862996]  ? copy_mount_string+0x20/0x20
[ 3117.866417]  ? kasan_unpoison_shadow+0x31/0x40
[ 3117.869719]  ? kasan_kmalloc+0xa6/0xd0
[ 3117.872948]  ? memcg_kmem_put_cache+0x16/0x90
[ 3117.876121]  ? __kmalloc_track_caller+0x196/0x210
[ 3117.879333]  ? _copy_from_user+0x61/0x90
[ 3117.882467]  ? memdup_user+0x3e/0x60
[ 3117.885604]  ksys_mount+0x7e/0xd0
[ 3117.888700]  __x64_sys_mount+0x62/0x70
[ 3117.891742]  do_syscall_64+0x73/0x160
[ 3117.894692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.897669] RIP: 0033:0x7f5693f14b9a
[ 3117.900563] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.906922] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.910159] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.913469] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.916764] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.920071] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.923393] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.926680] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.949979] CR2: 0000000000000000
[ 3117.954283] ---[ end trace a8e0d899985faf32 ]---
[ 3117.958575] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.962810] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.971789] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.976333] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.980926] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.985497] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.990098] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.994761] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.999392] FS:  00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3118.004096] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3118.008816] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0

- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/segment.c#L775
		if (test_and_clear_bit(segno, dirty_i->dirty_segmap[t]))
			dirty_i->nr_dirty[t]--;
Here dirty_i->dirty_segmap[t] can be NULL which leads to crash in test_and_clear_bit()

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Jaegeuk Kim
8d714f8aa3 f2fs: avoid f2fs_bug_on() in cp_error case
There is a subtle race condition to invoke f2fs_bug_on() in shutdown tests. I've
confirmed that the last checkpoint is preserved in consistent state, so it'd be
fine to just return error at this moment.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Chao Yu
66110abc4c f2fs: fix to clear PG_checked flag in set_page_dirty()
PG_checked flag will be set on data page during GC, later, we can
recognize such page by the flag and migrate page to cold segment.

But previously, we don't clear this flag when invalidating data page,
after page redirtying, we will write it into wrong log.

Let's clear PG_checked flag in set_page_dirty() to avoid this.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-10 16:19:05 -07:00
Chao Yu
82cf4f132e f2fs: fix to active page in lru list for read path
If config CONFIG_F2FS_FAULT_INJECTION is on, for both read or write path
we will call find_lock_page() to get the page, but for read path, it
missed to passing FGP_ACCESSED to allocator to active the page in LRU
list, result in being reclaimed in advance incorrectly, fix it.

Reported-by: Xianrong Zhou <zhouxianrong@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
18767e6263 f2fs: don't keep meta pages used for block migration
For migration of encrypted inode's block, we load data of encrypted block
into meta inode's page cache, after checkpoint, those all intermediate
pages should be clean, and no one will read them again, so let's just
release them for more memory.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
4ddc1b28aa f2fs: fix to restrict mount condition when without CONFIG_QUOTA
Like quota_ino feature, we need to reject mounting RDWR with image
which enables project_quota feature when there is no CONFIG_QUOTA
be set in kernel.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
00960c2cd8 f2fs: quota: do not mount as RDWR without QUOTA if quota feature enabled
If quota feature is enabled, quota is on by default. However, if
CONFIG_QUOTA is not built in kernel, dquot entries will not get updated,
which leads to quota inconsistency.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
76cf05d79c f2fs: quota: fix incorrect comments
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
955ac6e523 f2fs: quota: decrease the lock granularity of statfs_project
According to fs/quota/dquot.c, `dq_data_lock' protects mem_dqinfo
structures and modifications of dquot pointers in the inode, and
`dquot->dq_dqb_lock' protects data from dq_dqb.

We should use dquot->dq_dqb_lock in statfs_project instead of
dq_dat_lock.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
970e348d98 f2fs: add proc entry to show victim_secmap bitmap
This patch adds a new proc entry to show victim_secmap information in
more detail, which is very helpful to know the get_victim candidate
status clearly, and helpful to debug problems (e.g., some sections can
not gc all of its blocks, since some blocks belong to atomic file,
leaving victim_secmap with section bit setting, in extrem case, this
will lead all bytes of victim_secmap setting with 0xff).

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
fd8c8caf7e f2fs: let checkpoint flush dnode page of regular
Fsyncer will wait on all dnode pages of regular writeback before flushing,
if there are async dnode pages blocked by IO scheduler, it may decrease
fsync's performance.

In this patch, we choose to let f2fs_balance_fs_bg() to trigger checkpoint
to flush these dnode pages of regular, so async IO of dnode page can be
elimitnated, making fsyncer only need to wait for sync IO.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
ad6672bbc5 f2fs: issue discard align to section in LFS mode
For the case when sbi->segs_per_sec > 1 with lfs mode, take
section:segment = 5 for example, if the section prefree_map is
...previous section | current section (1 1 0 1 1) | next section...,
then the start = x, end = x + 1, after start = start_segno +
sbi->segs_per_sec, start = x + 5, then it will skip x + 3 and x + 4, but
their bitmap is still set, which will cause duplicated
f2fs_issue_discard of this same section in the next write_checkpoint:

round 1: section bitmap : 1 1 1 1 1, all valid, prefree_map: 0 0 0 0 0
then rm data block NO.2, block NO.2 becomes invalid, prefree_map: 0 0 1 0 0
write_checkpoint: section bitmap: 1 1 0 1 1, prefree_map: 0 0 0 0 0,
prefree of NO.2 is cleared, and no discard issued

round 2: rm data block NO.0, NO.1, NO.3, NO.4
all invalid, but prefree bit of NO.2 is set and cleared in round 1, then
prefree_map: 1 1 0 1 1
write_checkpoint: section bitmap: 0 0 0 0 0, prefree_map: 0 0 0 1 1, no
valid blocks of this section, so discard issued, but this time prefree
bit of NO.3 and NO.4 is skipped due to start = start_segno + sbi->segs_per_sec;

round 3:
write_checkpoint: section bitmap: 0 0 0 0 0, prefree_map: 0 0 0 1 1 ->
0 0 0 0 0, no valid blocks of this section, so discard issued,
this time prefree bit of NO.3 and NO.4 is cleared, but the discard of
this section is sent again...

To fix this problem, we can align the start and end value to section
boundary for fstrim and real-time discard operation, and decide to issue
discard only when the whole section is invalid, which can issue discard
aligned to section size as much as possible and avoid redundant discard.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Jaegeuk Kim
455e3a5887 f2fs: don't allow any writes on aborted atomic writes
In order to prevent abusing atomic writes by abnormal users, we've added a
threshold, 20% over memory footprint, which disallows further atomic writes.
Previously, however, SQLite doesn't know the files became normal, so that
it could write stale data and commit on revoked normal database file.

Once f2fs detects such the abnormal behavior, this patch tries to avoid further
writes in write_begin().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
797c1cb56b f2fs: restrict setting up inode.i_advise
In order to give advise to f2fs to recognize hot/cold file, it is possible
that we can set specific bit in inode.i_advise through setxattr(), but
there are several bits which are used internally, such as encrypt_bit,
keep_size_bit, they should never be changed through setxattr().

So that this patch 1) adds FADVISE_MODIFIABLE_BITS to filter modifiable
bits user given, 2) supports to clear {hot,cold}_file bits.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlei He
e6b0b159cf f2fs: fix wrong kernel message when recover fsync data on ro fs
This patch fix wrong message info for recover fsync data
on readonly fs.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
059c0648c6 f2fs: clean up ioctl interface naming
Romve redundant prefix 'f2fs_' in the middle of f2fs_ioc_f2fs_write_checkpoint().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
2079f115e7 f2fs: clean up with f2fs_is_{atomic,volatile}_file()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
5b72d5e0df f2fs: clean up with f2fs_encrypted_inode()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
80551d1773 f2fs: clean up with get_current_nat_page
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
6122003a1a f2fs: kill EXT_TREE_VEC_SIZE
Since commit 201ef5e080 ("f2fs: improve shrink performance of extent nodes"),
there is no user of EXT_TREE_VEC_SIZE, just kill it for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Hyunchul Lee
5d3ce4f701 f2fs: avoid duplicated permission check for "trusted." xattrs
Because xattr_permission already checks CAP_SYS_ADMIN
capability, we don't need to check it.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
7735730d39 f2fs: fix to propagate error from __get_meta_page()
If caller of __get_meta_page() can handle error, let's propagate error
from __get_meta_page().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
18dd6470c2 f2fs: fix to do sanity check with i_extra_isize
If inode.i_extra_isize was fuzzed to an abnormal value, when
calculating inline data size, the result will overflow, result
in accessing invalid memory area when operating inline data.

Let's do sanity check with i_extra_isize during inode loading
for fixing.

https://bugzilla.kernel.org/show_bug.cgi?id=200421

- Reproduce

- POC (poc.c)
    #define _GNU_SOURCE
    #include <sys/types.h>
    #include <sys/mount.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/xattr.h>

    #include <dirent.h>
    #include <errno.h>
    #include <error.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #include <linux/falloc.h>
    #include <linux/loop.h>

    static void activity(char *mpoint) {

      char *foo_bar_baz;
      char *foo_baz;
      char *xattr;
      int err;

      err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);
      err = asprintf(&foo_baz, "%s/foo/baz", mpoint);
      err = asprintf(&xattr, "%s/foo/bar/xattr", mpoint);

      rename(foo_bar_baz, foo_baz);

      char buf2[113];
      memset(buf2, 0, sizeof(buf2));
      listxattr(xattr, buf2, sizeof(buf2));
      removexattr(xattr, "user.mime_type");

    }

    int main(int argc, char *argv[]) {
      activity(argv[1]);
      return 0;
    }

- Kernel message
Umount the image will leave the following message
[ 2910.995489] F2FS-fs (loop0): Mounted with checkpoint version = 2
[ 2918.416465] ==================================================================
[ 2918.416807] BUG: KASAN: slab-out-of-bounds in f2fs_iget+0xcb9/0x1a80
[ 2918.417009] Read of size 4 at addr ffff88018efc2068 by task a.out/1229

[ 2918.417311] CPU: 1 PID: 1229 Comm: a.out Not tainted 4.17.0+ #1
[ 2918.417314] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2918.417323] Call Trace:
[ 2918.417366]  dump_stack+0x71/0xab
[ 2918.417401]  print_address_description+0x6b/0x290
[ 2918.417407]  kasan_report+0x28e/0x390
[ 2918.417411]  ? f2fs_iget+0xcb9/0x1a80
[ 2918.417415]  f2fs_iget+0xcb9/0x1a80
[ 2918.417422]  ? f2fs_lookup+0x2e7/0x580
[ 2918.417425]  f2fs_lookup+0x2e7/0x580
[ 2918.417433]  ? __recover_dot_dentries+0x400/0x400
[ 2918.417447]  ? legitimize_path.isra.29+0x5a/0xa0
[ 2918.417453]  __lookup_slow+0x11c/0x220
[ 2918.417457]  ? may_delete+0x2a0/0x2a0
[ 2918.417475]  ? deref_stack_reg+0xe0/0xe0
[ 2918.417479]  ? __lookup_hash+0xb0/0xb0
[ 2918.417483]  lookup_slow+0x3e/0x60
[ 2918.417488]  walk_component+0x3ac/0x990
[ 2918.417492]  ? generic_permission+0x51/0x1e0
[ 2918.417495]  ? inode_permission+0x51/0x1d0
[ 2918.417499]  ? pick_link+0x3e0/0x3e0
[ 2918.417502]  ? link_path_walk+0x4b1/0x770
[ 2918.417513]  ? _raw_spin_lock_irqsave+0x25/0x50
[ 2918.417518]  ? walk_component+0x990/0x990
[ 2918.417522]  ? path_init+0x2e6/0x580
[ 2918.417526]  path_lookupat+0x13f/0x430
[ 2918.417531]  ? trailing_symlink+0x3a0/0x3a0
[ 2918.417534]  ? do_renameat2+0x270/0x7b0
[ 2918.417538]  ? __kasan_slab_free+0x14c/0x190
[ 2918.417541]  ? do_renameat2+0x270/0x7b0
[ 2918.417553]  ? kmem_cache_free+0x85/0x1e0
[ 2918.417558]  ? do_renameat2+0x270/0x7b0
[ 2918.417563]  filename_lookup+0x13c/0x280
[ 2918.417567]  ? filename_parentat+0x2b0/0x2b0
[ 2918.417572]  ? kasan_unpoison_shadow+0x31/0x40
[ 2918.417575]  ? kasan_kmalloc+0xa6/0xd0
[ 2918.417593]  ? strncpy_from_user+0xaa/0x1c0
[ 2918.417598]  ? getname_flags+0x101/0x2b0
[ 2918.417614]  ? path_listxattr+0x87/0x110
[ 2918.417619]  path_listxattr+0x87/0x110
[ 2918.417623]  ? listxattr+0xc0/0xc0
[ 2918.417637]  ? mm_fault_error+0x1b0/0x1b0
[ 2918.417654]  do_syscall_64+0x73/0x160
[ 2918.417660]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2918.417676] RIP: 0033:0x7f2f3a3480d7
[ 2918.417677] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[ 2918.417732] RSP: 002b:00007fff4095b7d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000c2
[ 2918.417744] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2f3a3480d7
[ 2918.417746] RDX: 0000000000000071 RSI: 00007fff4095b810 RDI: 000000000126a0c0
[ 2918.417749] RBP: 00007fff4095b890 R08: 000000000126a010 R09: 0000000000000000
[ 2918.417751] R10: 00000000000001ab R11: 0000000000000206 R12: 00000000004005e0
[ 2918.417753] R13: 00007fff4095b990 R14: 0000000000000000 R15: 0000000000000000

[ 2918.417853] Allocated by task 329:
[ 2918.418002]  kasan_kmalloc+0xa6/0xd0
[ 2918.418007]  kmem_cache_alloc+0xc8/0x1e0
[ 2918.418023]  mempool_init_node+0x194/0x230
[ 2918.418027]  mempool_init+0x12/0x20
[ 2918.418042]  bioset_init+0x2bd/0x380
[ 2918.418052]  blk_alloc_queue_node+0xe9/0x540
[ 2918.418075]  dm_create+0x2c0/0x800
[ 2918.418080]  dev_create+0xd2/0x530
[ 2918.418083]  ctl_ioctl+0x2a3/0x5b0
[ 2918.418087]  dm_ctl_ioctl+0xa/0x10
[ 2918.418092]  do_vfs_ioctl+0x13e/0x8c0
[ 2918.418095]  ksys_ioctl+0x66/0x70
[ 2918.418098]  __x64_sys_ioctl+0x3d/0x50
[ 2918.418102]  do_syscall_64+0x73/0x160
[ 2918.418106]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 2918.418204] Freed by task 0:
[ 2918.418301] (stack is not available)

[ 2918.418521] The buggy address belongs to the object at ffff88018efc0000
                which belongs to the cache biovec-max of size 8192
[ 2918.418894] The buggy address is located 104 bytes to the right of
                8192-byte region [ffff88018efc0000, ffff88018efc2000)
[ 2918.419257] The buggy address belongs to the page:
[ 2918.419431] page:ffffea00063bf000 count:1 mapcount:0 mapping:ffff8801f2242540 index:0x0 compound_mapcount: 0
[ 2918.419702] flags: 0x17fff8000008100(slab|head)
[ 2918.419879] raw: 017fff8000008100 dead000000000100 dead000000000200 ffff8801f2242540
[ 2918.420101] raw: 0000000000000000 0000000000030003 00000001ffffffff 0000000000000000
[ 2918.420322] page dumped because: kasan: bad access detected

[ 2918.420599] Memory state around the buggy address:
[ 2918.420764]  ffff88018efc1f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.420975]  ffff88018efc1f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.421194] >ffff88018efc2000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 2918.421406]                                                           ^
[ 2918.421627]  ffff88018efc2080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 2918.421838]  ffff88018efc2100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.422046] ==================================================================
[ 2918.422264] Disabling lock debugging due to kernel taint
[ 2923.901641] BUG: unable to handle kernel paging request at ffff88018f0db000
[ 2923.901884] PGD 22226a067 P4D 22226a067 PUD 222273067 PMD 18e642063 PTE 800000018f0db061
[ 2923.902120] Oops: 0003 [#1] SMP KASAN PTI
[ 2923.902274] CPU: 1 PID: 1231 Comm: umount Tainted: G    B             4.17.0+ #1
[ 2923.902490] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2923.902761] RIP: 0010:__memset+0x24/0x30
[ 2923.902906] Code: 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[ 2923.903446] RSP: 0018:ffff88018ddf7ae0 EFLAGS: 00010206
[ 2923.903622] RAX: 0000000000000000 RBX: ffff8801d549d888 RCX: 1ffffffffffdaffb
[ 2923.903833] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88018f0daffc
[ 2923.904062] RBP: ffff88018efc206c R08: 1ffff10031df840d R09: ffff88018efc206c
[ 2923.904273] R10: ffffffffffffe1ee R11: ffffed0031df65fa R12: 0000000000000000
[ 2923.904485] R13: ffff8801d549dc98 R14: 00000000ffffc3db R15: ffffea00063bec80
[ 2923.904693] FS:  00007fa8b2f8a840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 2923.904937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2923.910080] CR2: ffff88018f0db000 CR3: 000000018f892000 CR4: 00000000000006e0
[ 2923.914930] Call Trace:
[ 2923.919724]  f2fs_truncate_inline_inode+0x114/0x170
[ 2923.924487]  f2fs_truncate_blocks+0x11b/0x7c0
[ 2923.929178]  ? f2fs_truncate_data_blocks+0x10/0x10
[ 2923.933834]  ? dqget+0x670/0x670
[ 2923.938437]  ? f2fs_destroy_extent_tree+0xd6/0x270
[ 2923.943107]  ? __radix_tree_lookup+0x2f/0x150
[ 2923.947772]  f2fs_truncate+0xd4/0x1a0
[ 2923.952491]  f2fs_evict_inode+0x5ab/0x610
[ 2923.957204]  evict+0x15f/0x280
[ 2923.961898]  __dentry_kill+0x161/0x250
[ 2923.966634]  shrink_dentry_list+0xf3/0x250
[ 2923.971897]  shrink_dcache_parent+0xa9/0x100
[ 2923.976561]  ? shrink_dcache_sb+0x1f0/0x1f0
[ 2923.981177]  ? wait_for_completion+0x8a/0x210
[ 2923.985781]  ? migrate_swap_stop+0x2d0/0x2d0
[ 2923.990332]  do_one_tree+0xe/0x40
[ 2923.994735]  shrink_dcache_for_umount+0x3a/0xa0
[ 2923.999077]  generic_shutdown_super+0x3e/0x1c0
[ 2924.003350]  kill_block_super+0x4b/0x70
[ 2924.007619]  deactivate_locked_super+0x65/0x90
[ 2924.011812]  cleanup_mnt+0x5c/0xa0
[ 2924.015995]  task_work_run+0xce/0xf0
[ 2924.020174]  exit_to_usermode_loop+0x115/0x120
[ 2924.024293]  do_syscall_64+0x12f/0x160
[ 2924.028479]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2924.032709] RIP: 0033:0x7fa8b2868487
[ 2924.036888] Code: 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 c9 2b 00 f7 d8 64 89 01 48
[ 2924.045750] RSP: 002b:00007ffc39824d58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2924.050190] RAX: 0000000000000000 RBX: 00000000008ea030 RCX: 00007fa8b2868487
[ 2924.054604] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000008f4360
[ 2924.058940] RBP: 00000000008f4360 R08: 0000000000000000 R09: 0000000000000014
[ 2924.063186] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007fa8b2d7183c
[ 2924.067418] R13: 0000000000000000 R14: 00000000008ea210 R15: 00007ffc39824fe0
[ 2924.071534] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 2924.098044] CR2: ffff88018f0db000
[ 2924.102520] ---[ end trace a8e0d899985faf31 ]---
[ 2924.107012] RIP: 0010:__memset+0x24/0x30
[ 2924.111448] Code: 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[ 2924.120724] RSP: 0018:ffff88018ddf7ae0 EFLAGS: 00010206
[ 2924.125312] RAX: 0000000000000000 RBX: ffff8801d549d888 RCX: 1ffffffffffdaffb
[ 2924.129931] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88018f0daffc
[ 2924.134537] RBP: ffff88018efc206c R08: 1ffff10031df840d R09: ffff88018efc206c
[ 2924.139175] R10: ffffffffffffe1ee R11: ffffed0031df65fa R12: 0000000000000000
[ 2924.143825] R13: ffff8801d549dc98 R14: 00000000ffffc3db R15: ffffea00063bec80
[ 2924.148500] FS:  00007fa8b2f8a840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 2924.153247] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2924.158003] CR2: ffff88018f0db000 CR3: 000000018f892000 CR4: 00000000000006e0
[ 2924.164641] BUG: Bad rss-counter state mm:00000000fa04621e idx:0 val:4
[ 2924.170007] BUG: Bad rss-counter
tate mm:00000000fa04621e idx:1 val:2

- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/inline.c#L78
	memset(addr + from, 0, MAX_INLINE_DATA(inode) - from);
Here the length can be negative.

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
66415cee3d f2fs: blk_finish_plug of submit_bio in lfs mode
Expand the blk_finish_plug action from blkzoned to normal lfs mode,
since plug will cause the out-of-order IO submission, which is not
friendly to flash in lfs mode.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
3611ce9911 f2fs: do not set free of current section
For the case when sbi->segs_per_sec > 1, take section:segment = 5 for
example, if segment 1 is just used and allocate new segment 2, and the
blocks of segment 1 is invalidated, at this time, the previous code will
use __set_test_and_free to free the free_secmap and free_sections++,
this is not correct since it is still a current section, so fix it.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Daniel Rosenberg
36b877af79 f2fs: Keep alloc_valid_block_count in sync
If we attempt to request more blocks than we have room for, we try to
instead request as much as we can, however, alloc_valid_block_count
is not decremented to match the new value, allowing it to drift higher
until the next checkpoint. This always decrements it when the requested
amount cannot be fulfilled.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
20ee438232 f2fs: issue small discard by LBA order
For small granularity discard which size is smaller than 64KB, if we
issue those kind of discards orderly by size, their IOs will be spread
into entire logical address, so that in FTL, L2P table will be updated
randomly, result bad wear rate in the table.

In this patch, we choose to issue small discard by LBA order, by this
way, we can expect that L2P table updates from adjacent discard IOs can
be merged in the cache, so it can reduce lifetime wearing of flash.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
522d1711d6 f2fs: stop issuing discard immediately if there is queued IO
For background discard policy, even if there is queued user IO, still
we will check max_requests times for next discard entry, it is unneeded,
let's just stop this round submission immediately.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
4c6b56c002 f2fs: clean up with IS_INODE()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
2482c4325d f2fs: detect bug_on in f2fs_wait_discard_bios
Add bug_on to detect potential non-empty discard wait list.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Randy Dunlap
cb15d1e43d f2fs: fix defined but not used build warnings
Fix build warnings in f2fs when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.

../fs/f2fs/sysfs.c:519:12: warning: 'segment_info_seq_show' defined but not used [-Wunused-function]
../fs/f2fs/sysfs.c:546:12: warning: 'segment_bits_seq_show' defined but not used [-Wunused-function]
../fs/f2fs/sysfs.c:570:12: warning: 'iostat_info_seq_show' defined but not used [-Wunused-function]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
a39e536583 f2fs: enable real-time discard by default
f2fs is focused on flash based storage, so let's enable real-time
discard by default, if user don't want to enable it, 'nodiscard'
mount option should be used on mount.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
82902c06bd f2fs: fix to detect looped node chain correctly
Below dmesg was printed when testing generic/388 of fstest:

F2FS-fs (zram1): find_fsync_dnodes: detect looped node chain, blkaddr:526615, next:526616
F2FS-fs (zram1): Cannot recover all fsync data errno=-22
F2FS-fs (zram1): Mounted with checkpoint version = 22300d0e
F2FS-fs (zram1): find_fsync_dnodes: detect looped node chain, blkaddr:526615, next:526616
F2FS-fs (zram1): Cannot recover all fsync data errno=-22

The reason is that we initialize free_blocks with free blocks of
filesystem, so if filesystem is full, free_blocks can be zero,
below condition will be true, so that, it will fail recovery.

if (++loop_cnt >= free_blocks ||
	blkaddr == next_blkaddr_of_node(page))

To fix this issue, initialize free_blocks with correct value which
includes over-privision blocks.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
c9b60788fc f2fs: fix to do sanity check with block address in main area
This patch add to do sanity check with below field:
- cp_pack_total_block_count
- blkaddr of data/node
- extent info

- Overview
BUG() in verify_block_addr() when writing to a corrupted f2fs image

- Reproduce (4.18 upstream kernel)

- POC (poc.c)

static void activity(char *mpoint) {

  char *foo_bar_baz;
  int err;

  static int buf[8192];
  memset(buf, 0, sizeof(buf));

  err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);

  int fd = open(foo_bar_baz, O_RDWR | O_TRUNC, 0777);
  if (fd >= 0) {
    write(fd, (char *)buf, sizeof(buf));
    fdatasync(fd);
    close(fd);
  }
}

int main(int argc, char *argv[]) {
  activity(argv[1]);
  return 0;
}

- Kernel message
[  689.349473] F2FS-fs (loop0): Mounted with checkpoint version = 3
[  699.728662] WARNING: CPU: 0 PID: 1309 at fs/f2fs/segment.c:2860 f2fs_inplace_write_data+0x232/0x240
[  699.728670] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  699.729056] CPU: 0 PID: 1309 Comm: a.out Not tainted 4.18.0-rc1+ #4
[  699.729064] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.729074] RIP: 0010:f2fs_inplace_write_data+0x232/0x240
[  699.729076] Code: ff e9 cf fe ff ff 49 8d 7d 10 e8 39 45 ad ff 4d 8b 7d 10 be 04 00 00 00 49 8d 7f 48 e8 07 49 ad ff 45 8b 7f 48 e9 fb fe ff ff <0f> 0b f0 41 80 4d 48 04 e9 65 fe ff ff 90 66 66 66 66 90 55 48 8d
[  699.729130] RSP: 0018:ffff8801f43af568 EFLAGS: 00010202
[  699.729139] RAX: 000000000000003f RBX: ffff8801f43af7b8 RCX: ffffffffb88c9113
[  699.729142] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8802024e5540
[  699.729144] RBP: ffff8801f43af590 R08: 0000000000000009 R09: ffffffffffffffe8
[  699.729147] R10: 0000000000000001 R11: ffffed0039b0596a R12: ffff8802024e5540
[  699.729149] R13: ffff8801f0335500 R14: ffff8801e3e7a700 R15: ffff8801e1ee4450
[  699.729154] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.729156] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.729159] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.729171] Call Trace:
[  699.729192]  f2fs_do_write_data_page+0x2e2/0xe00
[  699.729203]  ? f2fs_should_update_outplace+0xd0/0xd0
[  699.729238]  ? memcg_drain_all_list_lrus+0x280/0x280
[  699.729269]  ? __radix_tree_replace+0xa3/0x120
[  699.729276]  __write_data_page+0x5c7/0xe30
[  699.729291]  ? kasan_check_read+0x11/0x20
[  699.729310]  ? page_mapped+0x8a/0x110
[  699.729321]  ? page_mkclean+0xe9/0x160
[  699.729327]  ? f2fs_do_write_data_page+0xe00/0xe00
[  699.729331]  ? invalid_page_referenced_vma+0x130/0x130
[  699.729345]  ? clear_page_dirty_for_io+0x332/0x450
[  699.729351]  f2fs_write_cache_pages+0x4ca/0x860
[  699.729358]  ? __write_data_page+0xe30/0xe30
[  699.729374]  ? percpu_counter_add_batch+0x22/0xa0
[  699.729380]  ? kasan_check_write+0x14/0x20
[  699.729391]  ? _raw_spin_lock+0x17/0x40
[  699.729403]  ? f2fs_mark_inode_dirty_sync.part.18+0x16/0x30
[  699.729413]  ? iov_iter_advance+0x113/0x640
[  699.729418]  ? f2fs_write_end+0x133/0x2e0
[  699.729423]  ? balance_dirty_pages_ratelimited+0x239/0x640
[  699.729428]  f2fs_write_data_pages+0x329/0x520
[  699.729433]  ? generic_perform_write+0x250/0x320
[  699.729438]  ? f2fs_write_cache_pages+0x860/0x860
[  699.729454]  ? current_time+0x110/0x110
[  699.729459]  ? f2fs_preallocate_blocks+0x1ef/0x370
[  699.729464]  do_writepages+0x37/0xb0
[  699.729468]  ? f2fs_write_cache_pages+0x860/0x860
[  699.729472]  ? do_writepages+0x37/0xb0
[  699.729478]  __filemap_fdatawrite_range+0x19a/0x1f0
[  699.729483]  ? delete_from_page_cache_batch+0x4e0/0x4e0
[  699.729496]  ? __vfs_write+0x2b2/0x410
[  699.729501]  file_write_and_wait_range+0x66/0xb0
[  699.729506]  f2fs_do_sync_file+0x1f9/0xd90
[  699.729511]  ? truncate_partial_data_page+0x290/0x290
[  699.729521]  ? __sb_end_write+0x30/0x50
[  699.729526]  ? vfs_write+0x20f/0x260
[  699.729530]  f2fs_sync_file+0x9a/0xb0
[  699.729534]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.729548]  vfs_fsync_range+0x68/0x100
[  699.729554]  ? __fget_light+0xc9/0xe0
[  699.729558]  do_fsync+0x3d/0x70
[  699.729562]  __x64_sys_fdatasync+0x24/0x30
[  699.729585]  do_syscall_64+0x78/0x170
[  699.729595]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  699.729613] RIP: 0033:0x7f9bf930d800
[  699.729615] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 49 bf 2c 00 00 75 10 b8 4b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 78 01 00 48 89 04 24
[  699.729668] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.729673] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.729675] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.729678] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.729680] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.729683] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000
[  699.729687] ---[ end trace 4ce02f25ff7d3df5 ]---
[  699.729782] ------------[ cut here ]------------
[  699.729785] kernel BUG at fs/f2fs/segment.h:654!
[  699.731055] invalid opcode: 0000 [#1] SMP KASAN PTI
[  699.732104] CPU: 0 PID: 1309 Comm: a.out Tainted: G        W         4.18.0-rc1+ #4
[  699.733684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.735611] RIP: 0010:f2fs_submit_page_bio+0x29b/0x730
[  699.736649] Code: 54 49 8d bd 18 04 00 00 e8 b2 59 af ff 41 8b 8d 18 04 00 00 8b 45 b8 41 d3 e6 44 01 f0 4c 8d 73 14 41 39 c7 0f 82 37 fe ff ff <0f> 0b 65 8b 05 2c 04 77 47 89 c0 48 0f a3 05 52 c1 d5 01 0f 92 c0
[  699.740524] RSP: 0018:ffff8801f43af508 EFLAGS: 00010283
[  699.741573] RAX: 0000000000000000 RBX: ffff8801f43af7b8 RCX: ffffffffb88a7cef
[  699.743006] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8801e3e7a64c
[  699.744426] RBP: ffff8801f43af558 R08: ffffed003e066b55 R09: ffffed003e066b55
[  699.745833] R10: 0000000000000001 R11: ffffed003e066b54 R12: ffffea0007876940
[  699.747256] R13: ffff8801f0335500 R14: ffff8801e3e7a600 R15: 0000000000000001
[  699.748683] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.750293] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.751462] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.752874] Call Trace:
[  699.753386]  ? f2fs_inplace_write_data+0x93/0x240
[  699.754341]  f2fs_inplace_write_data+0xd2/0x240
[  699.755271]  f2fs_do_write_data_page+0x2e2/0xe00
[  699.756214]  ? f2fs_should_update_outplace+0xd0/0xd0
[  699.757215]  ? memcg_drain_all_list_lrus+0x280/0x280
[  699.758209]  ? __radix_tree_replace+0xa3/0x120
[  699.759164]  __write_data_page+0x5c7/0xe30
[  699.760002]  ? kasan_check_read+0x11/0x20
[  699.760823]  ? page_mapped+0x8a/0x110
[  699.761573]  ? page_mkclean+0xe9/0x160
[  699.762345]  ? f2fs_do_write_data_page+0xe00/0xe00
[  699.763332]  ? invalid_page_referenced_vma+0x130/0x130
[  699.764374]  ? clear_page_dirty_for_io+0x332/0x450
[  699.765347]  f2fs_write_cache_pages+0x4ca/0x860
[  699.766276]  ? __write_data_page+0xe30/0xe30
[  699.767161]  ? percpu_counter_add_batch+0x22/0xa0
[  699.768112]  ? kasan_check_write+0x14/0x20
[  699.768951]  ? _raw_spin_lock+0x17/0x40
[  699.769739]  ? f2fs_mark_inode_dirty_sync.part.18+0x16/0x30
[  699.770885]  ? iov_iter_advance+0x113/0x640
[  699.771743]  ? f2fs_write_end+0x133/0x2e0
[  699.772569]  ? balance_dirty_pages_ratelimited+0x239/0x640
[  699.773680]  f2fs_write_data_pages+0x329/0x520
[  699.774603]  ? generic_perform_write+0x250/0x320
[  699.775544]  ? f2fs_write_cache_pages+0x860/0x860
[  699.776510]  ? current_time+0x110/0x110
[  699.777299]  ? f2fs_preallocate_blocks+0x1ef/0x370
[  699.778279]  do_writepages+0x37/0xb0
[  699.779026]  ? f2fs_write_cache_pages+0x860/0x860
[  699.779978]  ? do_writepages+0x37/0xb0
[  699.780755]  __filemap_fdatawrite_range+0x19a/0x1f0
[  699.781746]  ? delete_from_page_cache_batch+0x4e0/0x4e0
[  699.782820]  ? __vfs_write+0x2b2/0x410
[  699.783597]  file_write_and_wait_range+0x66/0xb0
[  699.784540]  f2fs_do_sync_file+0x1f9/0xd90
[  699.785381]  ? truncate_partial_data_page+0x290/0x290
[  699.786415]  ? __sb_end_write+0x30/0x50
[  699.787204]  ? vfs_write+0x20f/0x260
[  699.787941]  f2fs_sync_file+0x9a/0xb0
[  699.788694]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.789572]  vfs_fsync_range+0x68/0x100
[  699.790360]  ? __fget_light+0xc9/0xe0
[  699.791128]  do_fsync+0x3d/0x70
[  699.791779]  __x64_sys_fdatasync+0x24/0x30
[  699.792614]  do_syscall_64+0x78/0x170
[  699.793371]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  699.794406] RIP: 0033:0x7f9bf930d800
[  699.795134] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 49 bf 2c 00 00 75 10 b8 4b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 78 01 00 48 89 04 24
[  699.798960] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.800483] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.801923] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.803373] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.804798] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.806233] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000
[  699.807667] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  699.817079] ---[ end trace 4ce02f25ff7d3df6 ]---
[  699.818068] RIP: 0010:f2fs_submit_page_bio+0x29b/0x730
[  699.819114] Code: 54 49 8d bd 18 04 00 00 e8 b2 59 af ff 41 8b 8d 18 04 00 00 8b 45 b8 41 d3 e6 44 01 f0 4c 8d 73 14 41 39 c7 0f 82 37 fe ff ff <0f> 0b 65 8b 05 2c 04 77 47 89 c0 48 0f a3 05 52 c1 d5 01 0f 92 c0
[  699.822919] RSP: 0018:ffff8801f43af508 EFLAGS: 00010283
[  699.823977] RAX: 0000000000000000 RBX: ffff8801f43af7b8 RCX: ffffffffb88a7cef
[  699.825436] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8801e3e7a64c
[  699.826881] RBP: ffff8801f43af558 R08: ffffed003e066b55 R09: ffffed003e066b55
[  699.828292] R10: 0000000000000001 R11: ffffed003e066b54 R12: ffffea0007876940
[  699.829750] R13: ffff8801f0335500 R14: ffff8801e3e7a600 R15: 0000000000000001
[  699.831192] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.832793] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.833981] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.835556] ==================================================================
[  699.837029] BUG: KASAN: stack-out-of-bounds in update_stack_state+0x38c/0x3e0
[  699.838462] Read of size 8 at addr ffff8801f43af970 by task a.out/1309

[  699.840086] CPU: 0 PID: 1309 Comm: a.out Tainted: G      D W         4.18.0-rc1+ #4
[  699.841603] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.843475] Call Trace:
[  699.843982]  dump_stack+0x7b/0xb5
[  699.844661]  print_address_description+0x70/0x290
[  699.845607]  kasan_report+0x291/0x390
[  699.846351]  ? update_stack_state+0x38c/0x3e0
[  699.853831]  __asan_load8+0x54/0x90
[  699.854569]  update_stack_state+0x38c/0x3e0
[  699.855428]  ? __read_once_size_nocheck.constprop.7+0x20/0x20
[  699.856601]  ? __save_stack_trace+0x5e/0x100
[  699.857476]  unwind_next_frame.part.5+0x18e/0x490
[  699.858448]  ? unwind_dump+0x290/0x290
[  699.859217]  ? clear_page_dirty_for_io+0x332/0x450
[  699.860185]  __unwind_start+0x106/0x190
[  699.860974]  __save_stack_trace+0x5e/0x100
[  699.861808]  ? __save_stack_trace+0x5e/0x100
[  699.862691]  ? unlink_anon_vmas+0xba/0x2c0
[  699.863525]  save_stack_trace+0x1f/0x30
[  699.864312]  save_stack+0x46/0xd0
[  699.864993]  ? __alloc_pages_slowpath+0x1420/0x1420
[  699.865990]  ? flush_tlb_mm_range+0x15e/0x220
[  699.866889]  ? kasan_check_write+0x14/0x20
[  699.867724]  ? __dec_node_state+0x92/0xb0
[  699.868543]  ? lock_page_memcg+0x85/0xf0
[  699.869350]  ? unlock_page_memcg+0x16/0x80
[  699.870185]  ? page_remove_rmap+0x198/0x520
[  699.871048]  ? mark_page_accessed+0x133/0x200
[  699.871930]  ? _cond_resched+0x1a/0x50
[  699.872700]  ? unmap_page_range+0xcd4/0xe50
[  699.873551]  ? rb_next+0x58/0x80
[  699.874217]  ? rb_next+0x58/0x80
[  699.874895]  __kasan_slab_free+0x13c/0x1a0
[  699.875734]  ? unlink_anon_vmas+0xba/0x2c0
[  699.876563]  kasan_slab_free+0xe/0x10
[  699.877315]  kmem_cache_free+0x89/0x1e0
[  699.878095]  unlink_anon_vmas+0xba/0x2c0
[  699.878913]  free_pgtables+0x101/0x1b0
[  699.879677]  exit_mmap+0x146/0x2a0
[  699.880378]  ? __ia32_sys_munmap+0x50/0x50
[  699.881214]  ? kasan_check_read+0x11/0x20
[  699.882052]  ? mm_update_next_owner+0x322/0x380
[  699.882985]  mmput+0x8b/0x1d0
[  699.883602]  do_exit+0x43a/0x1390
[  699.884288]  ? mm_update_next_owner+0x380/0x380
[  699.885212]  ? f2fs_sync_file+0x9a/0xb0
[  699.885995]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.886877]  ? vfs_fsync_range+0x68/0x100
[  699.887694]  ? __fget_light+0xc9/0xe0
[  699.888442]  ? do_fsync+0x3d/0x70
[  699.889118]  ? __x64_sys_fdatasync+0x24/0x30
[  699.889996]  rewind_stack_do_exit+0x17/0x20
[  699.890860] RIP: 0033:0x7f9bf930d800
[  699.891585] Code: Bad RIP value.
[  699.892268] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.893781] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.895220] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.896643] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.898069] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.899505] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000

[  699.901241] The buggy address belongs to the page:
[  699.902215] page:ffffea0007d0ebc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  699.903811] flags: 0x2ffff0000000000()
[  699.904585] raw: 02ffff0000000000 0000000000000000 ffffffff07d00101 0000000000000000
[  699.906125] raw: 0000000000000000 0000000000240000 00000000ffffffff 0000000000000000
[  699.907673] page dumped because: kasan: bad access detected

[  699.909108] Memory state around the buggy address:
[  699.910077]  ffff8801f43af800: 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00 00
[  699.911528]  ffff8801f43af880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  699.912953] >ffff8801f43af900: 00 00 00 00 00 00 00 00 f1 01 f4 f4 f4 f2 f2 f2
[  699.914392]                                                              ^
[  699.915758]  ffff8801f43af980: f2 00 f4 f4 00 00 00 00 f2 00 00 00 00 00 00 00
[  699.917193]  ffff8801f43afa00: 00 00 00 00 00 00 00 00 00 f3 f3 f3 00 00 00 00
[  699.918634] ==================================================================

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.h#L644

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:32 -07:00
Chao Yu
10d255c354 f2fs: fix to skip GC if type in SSA and SIT is inconsistent
If segment type in SSA and SIT is inconsistent, we will encounter below
BUG_ON during GC, to avoid this panic, let's just skip doing GC on such
segment.

The bug is triggered with image reported in below link:

https://bugzilla.kernel.org/show_bug.cgi?id=200223

[  388.060262] ------------[ cut here ]------------
[  388.060268] kernel BUG at /home/y00370721/git/devf2fs/gc.c:989!
[  388.061172] invalid opcode: 0000 [#1] SMP
[  388.061773] Modules linked in: f2fs(O) bluetooth ecdh_generic xt_tcpudp iptable_filter ip_tables x_tables lp ttm drm_kms_helper drm intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel fb_sys_fops ppdev aes_x86_64 syscopyarea crypto_simd sysfillrect parport_pc joydev sysimgblt glue_helper parport cryptd i2c_piix4 serio_raw mac_hid btrfs hid_generic usbhid hid raid6_pq psmouse pata_acpi floppy
[  388.064247] CPU: 7 PID: 4151 Comm: f2fs_gc-7:0 Tainted: G           O    4.13.0-rc1+ #26
[  388.065306] Hardware name: Xen HVM domU, BIOS 4.1.2_115-900.260_ 11/06/2015
[  388.066058] task: ffff880201583b80 task.stack: ffffc90004d7c000
[  388.069948] RIP: 0010:do_garbage_collect+0xcc8/0xcd0 [f2fs]
[  388.070766] RSP: 0018:ffffc90004d7fc68 EFLAGS: 00010202
[  388.071783] RAX: ffff8801ed227000 RBX: 0000000000000001 RCX: ffffea0007b489c0
[  388.072700] RDX: ffff880000000000 RSI: 0000000000000001 RDI: ffffea0007b489c0
[  388.073607] RBP: ffffc90004d7fd58 R08: 0000000000000003 R09: ffffea0007b489dc
[  388.074619] R10: 0000000000000000 R11: 0052782ab317138d R12: 0000000000000018
[  388.075625] R13: 0000000000000018 R14: ffff880211ceb000 R15: ffff880211ceb000
[  388.076687] FS:  0000000000000000(0000) GS:ffff880214fc0000(0000) knlGS:0000000000000000
[  388.083277] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  388.084536] CR2: 0000000000e18c60 CR3: 00000001ecf2e000 CR4: 00000000001406e0
[  388.085748] Call Trace:
[  388.086690]  ? find_next_bit+0xb/0x10
[  388.088091]  f2fs_gc+0x1a8/0x9d0 [f2fs]
[  388.088888]  ? lock_timer_base+0x7d/0xa0
[  388.090213]  ? try_to_del_timer_sync+0x44/0x60
[  388.091698]  gc_thread_func+0x342/0x4b0 [f2fs]
[  388.092892]  ? wait_woken+0x80/0x80
[  388.094098]  kthread+0x109/0x140
[  388.095010]  ? f2fs_gc+0x9d0/0x9d0 [f2fs]
[  388.096043]  ? kthread_park+0x60/0x60
[  388.097281]  ret_from_fork+0x25/0x30
[  388.098401] Code: ff ff 48 83 e8 01 48 89 44 24 58 e9 27 f8 ff ff 48 83 e8 01 e9 78 fc ff ff 48 8d 78 ff e9 17 fb ff ff 48 83 ef 01 e9 4d f4 ff ff <0f> 0b 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55
[  388.100864] RIP: do_garbage_collect+0xcc8/0xcd0 [f2fs] RSP: ffffc90004d7fc68
[  388.101810] ---[ end trace 81c73d6e6b7da61d ]---

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Chao Yu
4b270a8cc5 f2fs: try grabbing node page lock aggressively in sync scenario
In synchronous scenario, like in checkpoint(), we are going to flush
dirty node pages to device synchronously, we can easily failed
writebacking node page due to trylock_page() failure, especially in
condition of intensive lock competition, which can cause long latency
of checkpoint(). So let's use lock_page() in synchronous scenario to
avoid this issue.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Sahitya Tummala
dc1328027b f2fs: show the fsync_mode=nobarrier mount option
This patch shows the fsync_mode=nobarrier mount option in
f2fs_show_options().

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Yunlei He
68c43a235e f2fs: check the right return value of memory alloc function
This patch check the right return value of memory alloc function

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Guenter Roeck
b138547818 f2fs: Replace strncpy with memcpy
gcc 8.1.0 complains:

fs/f2fs/namei.c: In function 'f2fs_update_extension_list':
fs/f2fs/namei.c:257:3: warning:
	'strncpy' output truncated before terminating nul copying
	as many bytes from a string as its length
fs/f2fs/namei.c:249:3: warning:
	'strncpy' output truncated before terminating nul copying
	as many bytes from a string as its length

Using strncpy() is indeed less than perfect since the length of data to
be copied has already been determined with strlen(). Replace strncpy()
with memcpy() to address the warning and optimize the code a little.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Gao Xiang
2d3a58566f f2fs: avoid the global name 'fault_name'
Non-prefix global name 'fault_name' will pollute global
namespace, fix it.

Refer to:
https://lists.01.org/pipermail/kbuild-all/2018-June/049660.html

To: Jaegeuk Kim <jaegeuk@kernel.org>
To: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Chao Yu
4dbe38dc38 f2fs: fix to do sanity check with reserved blkaddr of inline inode
As Wen Xu reported in bugzilla, after image was injected with random data
by fuzzing, inline inode would contain invalid reserved blkaddr, then
during inline conversion, we will encounter illegal memory accessing
reported by KASAN, the root cause of this is when writing out converted
inline page, we will use invalid reserved blkaddr to update sit bitmap,
result in accessing memory beyond sit bitmap boundary.

In order to fix this issue, let's do sanity check with reserved block
address of inline inode to avoid above condition.

https://bugzilla.kernel.org/show_bug.cgi?id=200179

[ 1428.846352] BUG: KASAN: use-after-free in update_sit_entry+0x80/0x7f0
[ 1428.846618] Read of size 4 at addr ffff880194483540 by task a.out/2741

[ 1428.846855] CPU: 0 PID: 2741 Comm: a.out Tainted: G        W         4.17.0+ #1
[ 1428.846858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 1428.846860] Call Trace:
[ 1428.846868]  dump_stack+0x71/0xab
[ 1428.846875]  print_address_description+0x6b/0x290
[ 1428.846881]  kasan_report+0x28e/0x390
[ 1428.846888]  ? update_sit_entry+0x80/0x7f0
[ 1428.846898]  update_sit_entry+0x80/0x7f0
[ 1428.846906]  f2fs_allocate_data_block+0x6db/0xc70
[ 1428.846914]  ? f2fs_get_node_info+0x14f/0x590
[ 1428.846920]  do_write_page+0xc8/0x150
[ 1428.846928]  f2fs_outplace_write_data+0xfe/0x210
[ 1428.846935]  ? f2fs_do_write_node_page+0x170/0x170
[ 1428.846941]  ? radix_tree_tag_clear+0xff/0x130
[ 1428.846946]  ? __mod_node_page_state+0x22/0xa0
[ 1428.846951]  ? inc_zone_page_state+0x54/0x100
[ 1428.846956]  ? __test_set_page_writeback+0x336/0x5d0
[ 1428.846964]  f2fs_convert_inline_page+0x407/0x6d0
[ 1428.846971]  ? f2fs_read_inline_data+0x3b0/0x3b0
[ 1428.846978]  ? __get_node_page+0x335/0x6b0
[ 1428.846987]  f2fs_convert_inline_inode+0x41b/0x500
[ 1428.846994]  ? f2fs_convert_inline_page+0x6d0/0x6d0
[ 1428.847000]  ? kasan_unpoison_shadow+0x31/0x40
[ 1428.847005]  ? kasan_kmalloc+0xa6/0xd0
[ 1428.847024]  f2fs_file_mmap+0x79/0xc0
[ 1428.847029]  mmap_region+0x58b/0x880
[ 1428.847037]  ? arch_get_unmapped_area+0x370/0x370
[ 1428.847042]  do_mmap+0x55b/0x7a0
[ 1428.847048]  vm_mmap_pgoff+0x16f/0x1c0
[ 1428.847055]  ? vma_is_stack_for_current+0x50/0x50
[ 1428.847062]  ? __fsnotify_update_child_dentry_flags.part.1+0x160/0x160
[ 1428.847068]  ? do_sys_open+0x206/0x2a0
[ 1428.847073]  ? __fget+0xb4/0x100
[ 1428.847079]  ksys_mmap_pgoff+0x278/0x360
[ 1428.847085]  ? find_mergeable_anon_vma+0x50/0x50
[ 1428.847091]  do_syscall_64+0x73/0x160
[ 1428.847098]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1428.847102] RIP: 0033:0x7fb1430766ba
[ 1428.847103] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d 89 f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
[ 1428.847162] RSP: 002b:00007ffc651d9388 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
[ 1428.847167] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fb1430766ba
[ 1428.847170] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
[ 1428.847173] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000000
[ 1428.847176] R10: 0000000000008002 R11: 0000000000000246 R12: 0000000000000000
[ 1428.847179] R13: 0000000000001000 R14: 0000000000008002 R15: 0000000000000000

[ 1428.847252] Allocated by task 2683:
[ 1428.847372]  kasan_kmalloc+0xa6/0xd0
[ 1428.847380]  kmem_cache_alloc+0xc8/0x1e0
[ 1428.847385]  getname_flags+0x73/0x2b0
[ 1428.847390]  user_path_at_empty+0x1d/0x40
[ 1428.847395]  vfs_statx+0xc1/0x150
[ 1428.847401]  __do_sys_newlstat+0x7e/0xd0
[ 1428.847405]  do_syscall_64+0x73/0x160
[ 1428.847411]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 1428.847466] Freed by task 2683:
[ 1428.847566]  __kasan_slab_free+0x137/0x190
[ 1428.847571]  kmem_cache_free+0x85/0x1e0
[ 1428.847575]  filename_lookup+0x191/0x280
[ 1428.847580]  vfs_statx+0xc1/0x150
[ 1428.847585]  __do_sys_newlstat+0x7e/0xd0
[ 1428.847590]  do_syscall_64+0x73/0x160
[ 1428.847596]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 1428.847648] The buggy address belongs to the object at ffff880194483300
                which belongs to the cache names_cache of size 4096
[ 1428.847946] The buggy address is located 576 bytes inside of
                4096-byte region [ffff880194483300, ffff880194484300)
[ 1428.848234] The buggy address belongs to the page:
[ 1428.848366] page:ffffea0006512000 count:1 mapcount:0 mapping:ffff8801f3586380 index:0x0 compound_mapcount: 0
[ 1428.848606] flags: 0x17fff8000008100(slab|head)
[ 1428.848737] raw: 017fff8000008100 dead000000000100 dead000000000200 ffff8801f3586380
[ 1428.848931] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
[ 1428.849122] page dumped because: kasan: bad access detected

[ 1428.849305] Memory state around the buggy address:
[ 1428.849436]  ffff880194483400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.849620]  ffff880194483480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.849804] >ffff880194483500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.849985]                                            ^
[ 1428.850120]  ffff880194483580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.850303]  ffff880194483600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.850498] ==================================================================

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Chao Yu
e34438c903 f2fs: fix to do sanity check with node footer and iblocks
This patch adds to do sanity check with below fields of inode to
avoid reported panic.
- node footer
- iblocks

https://bugzilla.kernel.org/show_bug.cgi?id=200223

- Overview
BUG() triggered in f2fs_truncate_inode_blocks() when un-mounting a mounted f2fs image after writing to it

- Reproduce

- POC (poc.c)

static void activity(char *mpoint) {

  char *foo_bar_baz;
  int err;

  static int buf[8192];
  memset(buf, 0, sizeof(buf));

  err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);

  // open / write / read
  int fd = open(foo_bar_baz, O_RDWR | O_TRUNC, 0777);
  if (fd >= 0) {
    write(fd, (char *)buf, 517);
    write(fd, (char *)buf, sizeof(buf));
    close(fd);
  }

}

int main(int argc, char *argv[]) {
  activity(argv[1]);
  return 0;
}

- Kernel meesage
[  552.479723] F2FS-fs (loop0): Mounted with checkpoint version = 2
[  556.451891] ------------[ cut here ]------------
[  556.451899] kernel BUG at fs/f2fs/node.c:987!
[  556.452920] invalid opcode: 0000 [#1] SMP KASAN PTI
[  556.453936] CPU: 1 PID: 1310 Comm: umount Not tainted 4.18.0-rc1+ #4
[  556.455213] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  556.457140] RIP: 0010:f2fs_truncate_inode_blocks+0x4a7/0x6f0
[  556.458280] Code: e8 ae ea ff ff 41 89 c7 c1 e8 1f 84 c0 74 0a 41 83 ff fe 0f 85 35 ff ff ff 81 85 b0 fe ff ff fb 03 00 00 e9 f7 fd ff ff 0f 0b <0f> 0b e8 62 b7 9a 00 48 8b bd a0 fe ff ff e8 56 54 ae ff 48 8b b5
[  556.462015] RSP: 0018:ffff8801f292f808 EFLAGS: 00010286
[  556.463068] RAX: ffffed003e73242d RBX: ffff8801f292f958 RCX: ffffffffb88b81bc
[  556.464479] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8801f3992164
[  556.465901] RBP: ffff8801f292f980 R08: ffffed003e73242d R09: ffffed003e73242d
[  556.467311] R10: 0000000000000001 R11: ffffed003e73242c R12: 00000000fffffc64
[  556.468706] R13: ffff8801f3992000 R14: 0000000000000058 R15: 00000000ffff8801
[  556.470117] FS:  00007f8029297840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  556.471702] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  556.472838] CR2: 000055f5f57305d8 CR3: 00000001f18b0000 CR4: 00000000000006e0
[  556.474265] Call Trace:
[  556.474782]  ? f2fs_alloc_nid_failed+0xf0/0xf0
[  556.475686]  ? truncate_nodes+0x980/0x980
[  556.476516]  ? pagecache_get_page+0x21f/0x2f0
[  556.477412]  ? __asan_loadN+0xf/0x20
[  556.478153]  ? __get_node_page+0x331/0x5b0
[  556.478992]  ? reweight_entity+0x1e6/0x3b0
[  556.479826]  f2fs_truncate_blocks+0x55e/0x740
[  556.480709]  ? f2fs_truncate_data_blocks+0x20/0x20
[  556.481689]  ? __radix_tree_lookup+0x34/0x160
[  556.482630]  ? radix_tree_lookup+0xd/0x10
[  556.483445]  f2fs_truncate+0xd4/0x1a0
[  556.484206]  f2fs_evict_inode+0x5ce/0x630
[  556.485032]  evict+0x16f/0x290
[  556.485664]  iput+0x280/0x300
[  556.486300]  dentry_unlink_inode+0x165/0x1e0
[  556.487169]  __dentry_kill+0x16a/0x260
[  556.487936]  dentry_kill+0x70/0x250
[  556.488651]  shrink_dentry_list+0x125/0x260
[  556.489504]  shrink_dcache_parent+0xc1/0x110
[  556.490379]  ? shrink_dcache_sb+0x200/0x200
[  556.491231]  ? bit_wait_timeout+0xc0/0xc0
[  556.492047]  do_one_tree+0x12/0x40
[  556.492743]  shrink_dcache_for_umount+0x3f/0xa0
[  556.493656]  generic_shutdown_super+0x43/0x1c0
[  556.494561]  kill_block_super+0x52/0x80
[  556.495341]  kill_f2fs_super+0x62/0x70
[  556.496105]  deactivate_locked_super+0x6f/0xa0
[  556.497004]  deactivate_super+0x5e/0x80
[  556.497785]  cleanup_mnt+0x61/0xa0
[  556.498492]  __cleanup_mnt+0x12/0x20
[  556.499218]  task_work_run+0xc8/0xf0
[  556.499949]  exit_to_usermode_loop+0x125/0x130
[  556.500846]  do_syscall_64+0x138/0x170
[  556.501609]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  556.502659] RIP: 0033:0x7f8028b77487
[  556.503384] Code: 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 c9 2b 00 f7 d8 64 89 01 48
[  556.507137] RSP: 002b:00007fff9f2e3598 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  556.508637] RAX: 0000000000000000 RBX: 0000000000ebd030 RCX: 00007f8028b77487
[  556.510069] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000ec41e0
[  556.511481] RBP: 0000000000ec41e0 R08: 0000000000000000 R09: 0000000000000014
[  556.512892] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f802908083c
[  556.514320] R13: 0000000000000000 R14: 0000000000ebd210 R15: 00007fff9f2e3820
[  556.515745] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  556.529276] ---[ end trace 4ce02f25ff7d3df5 ]---
[  556.530340] RIP: 0010:f2fs_truncate_inode_blocks+0x4a7/0x6f0
[  556.531513] Code: e8 ae ea ff ff 41 89 c7 c1 e8 1f 84 c0 74 0a 41 83 ff fe 0f 85 35 ff ff ff 81 85 b0 fe ff ff fb 03 00 00 e9 f7 fd ff ff 0f 0b <0f> 0b e8 62 b7 9a 00 48 8b bd a0 fe ff ff e8 56 54 ae ff 48 8b b5
[  556.535330] RSP: 0018:ffff8801f292f808 EFLAGS: 00010286
[  556.536395] RAX: ffffed003e73242d RBX: ffff8801f292f958 RCX: ffffffffb88b81bc
[  556.537824] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8801f3992164
[  556.539290] RBP: ffff8801f292f980 R08: ffffed003e73242d R09: ffffed003e73242d
[  556.540709] R10: 0000000000000001 R11: ffffed003e73242c R12: 00000000fffffc64
[  556.542131] R13: ffff8801f3992000 R14: 0000000000000058 R15: 00000000ffff8801
[  556.543579] FS:  00007f8029297840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  556.545180] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  556.546338] CR2: 000055f5f57305d8 CR3: 00000001f18b0000 CR4: 00000000000006e0
[  556.547809] ==================================================================
[  556.549248] BUG: KASAN: stack-out-of-bounds in arch_tlb_gather_mmu+0x52/0x170
[  556.550672] Write of size 8 at addr ffff8801f292fd10 by task umount/1310

[  556.552338] CPU: 1 PID: 1310 Comm: umount Tainted: G      D           4.18.0-rc1+ #4
[  556.553886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  556.555756] Call Trace:
[  556.556264]  dump_stack+0x7b/0xb5
[  556.556944]  print_address_description+0x70/0x290
[  556.557903]  kasan_report+0x291/0x390
[  556.558649]  ? arch_tlb_gather_mmu+0x52/0x170
[  556.559537]  __asan_store8+0x57/0x90
[  556.560268]  arch_tlb_gather_mmu+0x52/0x170
[  556.561110]  tlb_gather_mmu+0x12/0x40
[  556.561862]  exit_mmap+0x123/0x2a0
[  556.562555]  ? __ia32_sys_munmap+0x50/0x50
[  556.563384]  ? exit_aio+0x98/0x230
[  556.564079]  ? __x32_compat_sys_io_submit+0x260/0x260
[  556.565099]  ? taskstats_exit+0x1f4/0x640
[  556.565925]  ? kasan_check_read+0x11/0x20
[  556.566739]  ? mm_update_next_owner+0x322/0x380
[  556.567652]  mmput+0x8b/0x1d0
[  556.568260]  do_exit+0x43a/0x1390
[  556.568937]  ? mm_update_next_owner+0x380/0x380
[  556.569855]  ? deactivate_super+0x5e/0x80
[  556.570668]  ? cleanup_mnt+0x61/0xa0
[  556.571395]  ? __cleanup_mnt+0x12/0x20
[  556.572156]  ? task_work_run+0xc8/0xf0
[  556.572917]  ? exit_to_usermode_loop+0x125/0x130
[  556.573861]  rewind_stack_do_exit+0x17/0x20
[  556.574707] RIP: 0033:0x7f8028b77487
[  556.575428] Code: Bad RIP value.
[  556.576106] RSP: 002b:00007fff9f2e3598 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  556.577599] RAX: 0000000000000000 RBX: 0000000000ebd030 RCX: 00007f8028b77487
[  556.579020] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000ec41e0
[  556.580422] RBP: 0000000000ec41e0 R08: 0000000000000000 R09: 0000000000000014
[  556.581833] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f802908083c
[  556.583252] R13: 0000000000000000 R14: 0000000000ebd210 R15: 00007fff9f2e3820

[  556.584983] The buggy address belongs to the page:
[  556.585961] page:ffffea0007ca4bc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  556.587540] flags: 0x2ffff0000000000()
[  556.588296] raw: 02ffff0000000000 0000000000000000 dead000000000200 0000000000000000
[  556.589822] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  556.591359] page dumped because: kasan: bad access detected

[  556.592786] Memory state around the buggy address:
[  556.593753]  ffff8801f292fc00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  556.595191]  ffff8801f292fc80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00
[  556.596613] >ffff8801f292fd00: 00 00 f3 00 00 00 00 f3 f3 00 00 00 00 f4 f4 f4
[  556.598044]                          ^
[  556.598797]  ffff8801f292fd80: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
[  556.600225]  ffff8801f292fe00: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4
[  556.601647] ==================================================================

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/node.c#L987
		case NODE_DIND_BLOCK:
			err = truncate_nodes(&dn, nofs, offset[1], 3);
			cont = 0;
			break;

		default:
			BUG(); <---
		}

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:06 -07:00
Yunlei He
e15d54d500 f2fs: Allocate and stat mem used by free nid bitmap more accurately
This patch used f2fs_bitmap_size macro to calculate mem used by
free nid bitmap, and stat used mem including aligned part.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:23:26 -07:00
Chao Yu
9dc956b2c8 f2fs: fix to do sanity check with user_block_count
This patch fixs to do sanity check with user_block_count.

- Overview
Divide zero in utilization when mount() a corrupted f2fs image

- Reproduce (4.18 upstream kernel)

- Kernel message
[  564.099503] F2FS-fs (loop0): invalid crc value
[  564.101991] divide error: 0000 [#1] SMP KASAN PTI
[  564.103103] CPU: 1 PID: 1298 Comm: f2fs_discard-7: Not tainted 4.18.0-rc1+ #4
[  564.104584] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  564.106624] RIP: 0010:issue_discard_thread+0x248/0x5c0
[  564.107692] Code: ff ff 48 8b bd e8 fe ff ff 41 8b 9d 4c 04 00 00 e8 cd b8 ad ff 41 8b 85 50 04 00 00 31 d2 48 8d 04 80 48 8d 04 80 48 c1 e0 02 <48> f7 f3 83 f8 50 7e 16 41 c7 86 7c ff ff ff 01 00 00 00 41 c7 86
[  564.111686] RSP: 0018:ffff8801f3117dc0 EFLAGS: 00010206
[  564.112775] RAX: 0000000000000384 RBX: 0000000000000000 RCX: ffffffffb88c1e03
[  564.114250] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff8801e3aa4850
[  564.115706] RBP: ffff8801f3117f00 R08: 1ffffffff751a1d0 R09: fffffbfff751a1d0
[  564.117177] R10: 0000000000000001 R11: fffffbfff751a1d0 R12: 00000000fffffffc
[  564.118634] R13: ffff8801e3aa4400 R14: ffff8801f3117ed8 R15: ffff8801e2050000
[  564.120094] FS:  0000000000000000(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  564.121748] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  564.122923] CR2: 000000000202b078 CR3: 00000001f11ac000 CR4: 00000000000006e0
[  564.124383] Call Trace:
[  564.124924]  ? __issue_discard_cmd+0x480/0x480
[  564.125882]  ? __sched_text_start+0x8/0x8
[  564.126756]  ? __kthread_parkme+0xcb/0x100
[  564.127620]  ? kthread_blkcg+0x70/0x70
[  564.128412]  kthread+0x180/0x1d0
[  564.129105]  ? __issue_discard_cmd+0x480/0x480
[  564.130029]  ? kthread_associate_blkcg+0x150/0x150
[  564.131033]  ret_from_fork+0x35/0x40
[  564.131794] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  564.141798] ---[ end trace 4ce02f25ff7d3df5 ]---
[  564.142773] RIP: 0010:issue_discard_thread+0x248/0x5c0
[  564.143885] Code: ff ff 48 8b bd e8 fe ff ff 41 8b 9d 4c 04 00 00 e8 cd b8 ad ff 41 8b 85 50 04 00 00 31 d2 48 8d 04 80 48 8d 04 80 48 c1 e0 02 <48> f7 f3 83 f8 50 7e 16 41 c7 86 7c ff ff ff 01 00 00 00 41 c7 86
[  564.147776] RSP: 0018:ffff8801f3117dc0 EFLAGS: 00010206
[  564.148856] RAX: 0000000000000384 RBX: 0000000000000000 RCX: ffffffffb88c1e03
[  564.150424] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff8801e3aa4850
[  564.151906] RBP: ffff8801f3117f00 R08: 1ffffffff751a1d0 R09: fffffbfff751a1d0
[  564.153463] R10: 0000000000000001 R11: fffffbfff751a1d0 R12: 00000000fffffffc
[  564.154915] R13: ffff8801e3aa4400 R14: ffff8801f3117ed8 R15: ffff8801e2050000
[  564.156405] FS:  0000000000000000(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  564.158070] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  564.159279] CR2: 000000000202b078 CR3: 00000001f11ac000 CR4: 00000000000006e0
[  564.161043] ==================================================================
[  564.162587] BUG: KASAN: stack-out-of-bounds in from_kuid_munged+0x1d/0x50
[  564.163994] Read of size 4 at addr ffff8801f3117c84 by task f2fs_discard-7:/1298

[  564.165852] CPU: 1 PID: 1298 Comm: f2fs_discard-7: Tainted: G      D           4.18.0-rc1+ #4
[  564.167593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  564.169522] Call Trace:
[  564.170057]  dump_stack+0x7b/0xb5
[  564.170778]  print_address_description+0x70/0x290
[  564.171765]  kasan_report+0x291/0x390
[  564.172540]  ? from_kuid_munged+0x1d/0x50
[  564.173408]  __asan_load4+0x78/0x80
[  564.174148]  from_kuid_munged+0x1d/0x50
[  564.174962]  do_notify_parent+0x1f5/0x4f0
[  564.175808]  ? send_sigqueue+0x390/0x390
[  564.176639]  ? css_set_move_task+0x152/0x340
[  564.184197]  do_exit+0x1290/0x1390
[  564.184950]  ? __issue_discard_cmd+0x480/0x480
[  564.185884]  ? mm_update_next_owner+0x380/0x380
[  564.186829]  ? __sched_text_start+0x8/0x8
[  564.187672]  ? __kthread_parkme+0xcb/0x100
[  564.188528]  ? kthread_blkcg+0x70/0x70
[  564.189333]  ? kthread+0x180/0x1d0
[  564.190052]  ? __issue_discard_cmd+0x480/0x480
[  564.190983]  rewind_stack_do_exit+0x17/0x20

[  564.192190] The buggy address belongs to the page:
[  564.193213] page:ffffea0007cc45c0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  564.194856] flags: 0x2ffff0000000000()
[  564.195644] raw: 02ffff0000000000 0000000000000000 dead000000000200 0000000000000000
[  564.197247] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  564.198826] page dumped because: kasan: bad access detected

[  564.200299] Memory state around the buggy address:
[  564.201306]  ffff8801f3117b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  564.202779]  ffff8801f3117c00: 00 00 00 00 00 00 00 00 00 00 00 f3 f3 f3 f3 f3
[  564.204252] >ffff8801f3117c80: f3 f3 f3 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
[  564.205742]                    ^
[  564.206424]  ffff8801f3117d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  564.207908]  ffff8801f3117d80: f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00
[  564.209389] ==================================================================
[  564.231795] F2FS-fs (loop0): Mounted with checkpoint version = 2

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.h#L586
	return div_u64((u64)valid_user_blocks(sbi) * 100,
					sbi->user_block_count);
Missing checks on sbi->user_block_count.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:23:23 -07:00
Chao Yu
76d56d4ab4 f2fs: fix to do sanity check with extra_attr feature
If FI_EXTRA_ATTR is set in inode by fuzzing, inode.i_addr[0] will be
parsed as inode.i_extra_isize, then in __recover_inline_status, inline
data address will beyond boundary of page, result in accessing invalid
memory.

So in this condition, during reading inode page, let's do sanity check
with EXTRA_ATTR feature of fs and extra_attr bit of inode, if they're
inconsistent, deny to load this inode.

- Overview
Out-of-bound access in f2fs_iget() when mounting a corrupted f2fs image

- Reproduce

The following message will be got in KASAN build of 4.18 upstream kernel.
[  819.392227] ==================================================================
[  819.393901] BUG: KASAN: slab-out-of-bounds in f2fs_iget+0x736/0x1530
[  819.395329] Read of size 4 at addr ffff8801f099c968 by task mount/1292

[  819.397079] CPU: 1 PID: 1292 Comm: mount Not tainted 4.18.0-rc1+ #4
[  819.397082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  819.397088] Call Trace:
[  819.397124]  dump_stack+0x7b/0xb5
[  819.397154]  print_address_description+0x70/0x290
[  819.397159]  kasan_report+0x291/0x390
[  819.397163]  ? f2fs_iget+0x736/0x1530
[  819.397176]  check_memory_region+0x139/0x190
[  819.397182]  __asan_loadN+0xf/0x20
[  819.397185]  f2fs_iget+0x736/0x1530
[  819.397197]  f2fs_fill_super+0x1b4f/0x2b40
[  819.397202]  ? f2fs_fill_super+0x1b4f/0x2b40
[  819.397208]  ? f2fs_commit_super+0x1b0/0x1b0
[  819.397227]  ? set_blocksize+0x90/0x140
[  819.397241]  mount_bdev+0x1c5/0x210
[  819.397245]  ? f2fs_commit_super+0x1b0/0x1b0
[  819.397252]  f2fs_mount+0x15/0x20
[  819.397256]  mount_fs+0x60/0x1a0
[  819.397267]  ? alloc_vfsmnt+0x309/0x360
[  819.397272]  vfs_kern_mount+0x6b/0x1a0
[  819.397282]  do_mount+0x34a/0x18c0
[  819.397300]  ? lockref_put_or_lock+0xcf/0x160
[  819.397306]  ? copy_mount_string+0x20/0x20
[  819.397318]  ? memcg_kmem_put_cache+0x1b/0xa0
[  819.397324]  ? kasan_check_write+0x14/0x20
[  819.397334]  ? _copy_from_user+0x6a/0x90
[  819.397353]  ? memdup_user+0x42/0x60
[  819.397359]  ksys_mount+0x83/0xd0
[  819.397365]  __x64_sys_mount+0x67/0x80
[  819.397388]  do_syscall_64+0x78/0x170
[  819.397403]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  819.397422] RIP: 0033:0x7f54c667cb9a
[  819.397424] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[  819.397483] RSP: 002b:00007ffd8f46cd08 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[  819.397496] RAX: ffffffffffffffda RBX: 0000000000dfa030 RCX: 00007f54c667cb9a
[  819.397498] RDX: 0000000000dfa210 RSI: 0000000000dfbf30 RDI: 0000000000e02ec0
[  819.397501] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[  819.397503] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 0000000000e02ec0
[  819.397505] R13: 0000000000dfa210 R14: 0000000000000000 R15: 0000000000000003

[  819.397866] Allocated by task 139:
[  819.398702]  save_stack+0x46/0xd0
[  819.398705]  kasan_kmalloc+0xad/0xe0
[  819.398709]  kasan_slab_alloc+0x11/0x20
[  819.398713]  kmem_cache_alloc+0xd1/0x1e0
[  819.398717]  dup_fd+0x50/0x4c0
[  819.398740]  copy_process.part.37+0xbed/0x32e0
[  819.398744]  _do_fork+0x16e/0x590
[  819.398748]  __x64_sys_clone+0x69/0x80
[  819.398752]  do_syscall_64+0x78/0x170
[  819.398756]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  819.399097] Freed by task 159:
[  819.399743]  save_stack+0x46/0xd0
[  819.399747]  __kasan_slab_free+0x13c/0x1a0
[  819.399750]  kasan_slab_free+0xe/0x10
[  819.399754]  kmem_cache_free+0x89/0x1e0
[  819.399757]  put_files_struct+0x132/0x150
[  819.399761]  exit_files+0x62/0x70
[  819.399766]  do_exit+0x47b/0x1390
[  819.399770]  do_group_exit+0x86/0x130
[  819.399774]  __x64_sys_exit_group+0x2c/0x30
[  819.399778]  do_syscall_64+0x78/0x170
[  819.399782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  819.400115] The buggy address belongs to the object at ffff8801f099c680
                which belongs to the cache files_cache of size 704
[  819.403234] The buggy address is located 40 bytes to the right of
                704-byte region [ffff8801f099c680, ffff8801f099c940)
[  819.405689] The buggy address belongs to the page:
[  819.406709] page:ffffea0007c26700 count:1 mapcount:0 mapping:ffff8801f69a3340 index:0xffff8801f099d380 compound_mapcount: 0
[  819.408984] flags: 0x2ffff0000008100(slab|head)
[  819.409932] raw: 02ffff0000008100 ffffea00077fb600 0000000200000002 ffff8801f69a3340
[  819.411514] raw: ffff8801f099d380 0000000080130000 00000001ffffffff 0000000000000000
[  819.413073] page dumped because: kasan: bad access detected

[  819.414539] Memory state around the buggy address:
[  819.415521]  ffff8801f099c800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  819.416981]  ffff8801f099c880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  819.418454] >ffff8801f099c900: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[  819.419921]                                                           ^
[  819.421265]  ffff8801f099c980: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[  819.422745]  ffff8801f099ca00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  819.424206] ==================================================================
[  819.425668] Disabling lock debugging due to kernel taint
[  819.457463] F2FS-fs (loop0): Mounted with checkpoint version = 3

The kernel still mounts the image. If you run the following program on the mounted folder mnt,

(poc.c)

static void activity(char *mpoint) {

  char *foo_bar_baz;
  int err;

  static int buf[8192];
  memset(buf, 0, sizeof(buf));

  err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);
    int fd = open(foo_bar_baz, O_RDONLY, 0);
  if (fd >= 0) {
      read(fd, (char *)buf, 11);
      close(fd);
  }
}

int main(int argc, char *argv[]) {
  activity(argv[1]);
  return 0;
}

You can get kernel crash:
[  819.457463] F2FS-fs (loop0): Mounted with checkpoint version = 3
[  918.028501] BUG: unable to handle kernel paging request at ffffed0048000d82
[  918.044020] PGD 23ffee067 P4D 23ffee067 PUD 23fbef067 PMD 0
[  918.045207] Oops: 0000 [#1] SMP KASAN PTI
[  918.046048] CPU: 0 PID: 1309 Comm: poc Tainted: G    B             4.18.0-rc1+ #4
[  918.047573] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  918.049552] RIP: 0010:check_memory_region+0x5e/0x190
[  918.050565] Code: f8 49 c1 e8 03 49 89 db 49 c1 eb 03 4d 01 cb 4d 01 c1 4d 8d 63 01 4c 89 c8 4d 89 e2 4d 29 ca 49 83 fa 10 7f 3d 4d 85 d2 74 32 <41> 80 39 00 75 23 48 b8 01 00 00 00 00 fc ff df 4d 01 d1 49 01 c0
[  918.054322] RSP: 0018:ffff8801e3a1f258 EFLAGS: 00010202
[  918.055400] RAX: ffffed0048000d82 RBX: ffff880240006c11 RCX: ffffffffb8867d14
[  918.056832] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880240006c10
[  918.058253] RBP: ffff8801e3a1f268 R08: 1ffff10048000d82 R09: ffffed0048000d82
[  918.059717] R10: 0000000000000001 R11: ffffed0048000d82 R12: ffffed0048000d83
[  918.061159] R13: ffff8801e3a1f390 R14: 0000000000000000 R15: ffff880240006c08
[  918.062614] FS:  00007fac9732c700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  918.064246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  918.065412] CR2: ffffed0048000d82 CR3: 00000001df77a000 CR4: 00000000000006f0
[  918.066882] Call Trace:
[  918.067410]  __asan_loadN+0xf/0x20
[  918.068149]  f2fs_find_target_dentry+0xf4/0x270
[  918.069083]  ? __get_node_page+0x331/0x5b0
[  918.069925]  f2fs_find_in_inline_dir+0x24b/0x310
[  918.070881]  ? f2fs_recover_inline_data+0x4c0/0x4c0
[  918.071905]  ? unwind_next_frame.part.5+0x34f/0x490
[  918.072901]  ? unwind_dump+0x290/0x290
[  918.073695]  ? is_bpf_text_address+0xe/0x20
[  918.074566]  __f2fs_find_entry+0x599/0x670
[  918.075408]  ? kasan_unpoison_shadow+0x36/0x50
[  918.076315]  ? kasan_kmalloc+0xad/0xe0
[  918.077100]  ? memcg_kmem_put_cache+0x55/0xa0
[  918.077998]  ? f2fs_find_target_dentry+0x270/0x270
[  918.079006]  ? d_set_d_op+0x30/0x100
[  918.079749]  ? __d_lookup_rcu+0x69/0x2e0
[  918.080556]  ? __d_alloc+0x275/0x450
[  918.081297]  ? kasan_check_write+0x14/0x20
[  918.082135]  ? memset+0x31/0x40
[  918.082820]  ? fscrypt_setup_filename+0x1ec/0x4c0
[  918.083782]  ? d_alloc_parallel+0x5bb/0x8c0
[  918.084640]  f2fs_find_entry+0xe9/0x110
[  918.085432]  ? __f2fs_find_entry+0x670/0x670
[  918.086308]  ? kasan_check_write+0x14/0x20
[  918.087163]  f2fs_lookup+0x297/0x590
[  918.087902]  ? f2fs_link+0x2b0/0x2b0
[  918.088646]  ? legitimize_path.isra.29+0x61/0xa0
[  918.089589]  __lookup_slow+0x12e/0x240
[  918.090371]  ? may_delete+0x2b0/0x2b0
[  918.091123]  ? __nd_alloc_stack+0xa0/0xa0
[  918.091944]  lookup_slow+0x44/0x60
[  918.092642]  walk_component+0x3ee/0xa40
[  918.093428]  ? is_bpf_text_address+0xe/0x20
[  918.094283]  ? pick_link+0x3e0/0x3e0
[  918.095047]  ? in_group_p+0xa5/0xe0
[  918.095771]  ? generic_permission+0x53/0x1e0
[  918.096666]  ? security_inode_permission+0x1d/0x70
[  918.097646]  ? inode_permission+0x7a/0x1f0
[  918.098497]  link_path_walk+0x2a2/0x7b0
[  918.099298]  ? apparmor_capget+0x3d0/0x3d0
[  918.100140]  ? walk_component+0xa40/0xa40
[  918.100958]  ? path_init+0x2e6/0x580
[  918.101695]  path_openat+0x1bb/0x2160
[  918.102471]  ? __save_stack_trace+0x92/0x100
[  918.103352]  ? save_stack+0xb5/0xd0
[  918.104070]  ? vfs_unlink+0x250/0x250
[  918.104822]  ? save_stack+0x46/0xd0
[  918.105538]  ? kasan_slab_alloc+0x11/0x20
[  918.106370]  ? kmem_cache_alloc+0xd1/0x1e0
[  918.107213]  ? getname_flags+0x76/0x2c0
[  918.107997]  ? getname+0x12/0x20
[  918.108677]  ? do_sys_open+0x14b/0x2c0
[  918.109450]  ? __x64_sys_open+0x4c/0x60
[  918.110255]  ? do_syscall_64+0x78/0x170
[  918.111083]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  918.112148]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  918.113204]  ? f2fs_empty_inline_dir+0x1e0/0x1e0
[  918.114150]  ? timespec64_trunc+0x5c/0x90
[  918.114993]  ? wb_io_lists_depopulated+0x1a/0xc0
[  918.115937]  ? inode_io_list_move_locked+0x102/0x110
[  918.116949]  do_filp_open+0x12b/0x1d0
[  918.117709]  ? may_open_dev+0x50/0x50
[  918.118475]  ? kasan_kmalloc+0xad/0xe0
[  918.119246]  do_sys_open+0x17c/0x2c0
[  918.119983]  ? do_sys_open+0x17c/0x2c0
[  918.120751]  ? filp_open+0x60/0x60
[  918.121463]  ? task_work_run+0x4d/0xf0
[  918.122237]  __x64_sys_open+0x4c/0x60
[  918.123001]  do_syscall_64+0x78/0x170
[  918.123759]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  918.124802] RIP: 0033:0x7fac96e3e040
[  918.125537] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 09 27 2d 00 00 75 10 b8 02 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 7e e0 01 00 48 89 04 24
[  918.129341] RSP: 002b:00007fff1b37f848 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  918.130870] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fac96e3e040
[  918.132295] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000122d080
[  918.133748] RBP: 00007fff1b37f9b0 R08: 00007fac9710bbd8 R09: 0000000000000001
[  918.135209] R10: 000000000000069d R11: 0000000000000246 R12: 0000000000400c20
[  918.136650] R13: 00007fff1b37fab0 R14: 0000000000000000 R15: 0000000000000000
[  918.138093] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  918.147924] CR2: ffffed0048000d82
[  918.148619] ---[ end trace 4ce02f25ff7d3df5 ]---
[  918.149563] RIP: 0010:check_memory_region+0x5e/0x190
[  918.150576] Code: f8 49 c1 e8 03 49 89 db 49 c1 eb 03 4d 01 cb 4d 01 c1 4d 8d 63 01 4c 89 c8 4d 89 e2 4d 29 ca 49 83 fa 10 7f 3d 4d 85 d2 74 32 <41> 80 39 00 75 23 48 b8 01 00 00 00 00 fc ff df 4d 01 d1 49 01 c0
[  918.154360] RSP: 0018:ffff8801e3a1f258 EFLAGS: 00010202
[  918.155411] RAX: ffffed0048000d82 RBX: ffff880240006c11 RCX: ffffffffb8867d14
[  918.156833] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880240006c10
[  918.158257] RBP: ffff8801e3a1f268 R08: 1ffff10048000d82 R09: ffffed0048000d82
[  918.159722] R10: 0000000000000001 R11: ffffed0048000d82 R12: ffffed0048000d83
[  918.161149] R13: ffff8801e3a1f390 R14: 0000000000000000 R15: ffff880240006c08
[  918.162587] FS:  00007fac9732c700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  918.164203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  918.165356] CR2: ffffed0048000d82 CR3: 00000001df77a000 CR4: 00000000000006f0

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
01f9cf6db7 f2fs: fix to correct return value of f2fs_trim_fs
We should account trimmed block number from __wait_all_discard_cmd
in __issue_discard_cmd_range, otherwise trimmed blocks returned
by f2fs_trim_fs will be wrong, this patch fixes it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
c77ec61ca0 f2fs: fix to do sanity check with {sit,nat}_ver_bitmap_bytesize
This patch adds to do sanity check with {sit,nat}_ver_bitmap_bytesize
during mount, in order to avoid accessing across cache boundary with
this abnormal bitmap size.

- Overview
buffer overrun in build_sit_info() when mounting a crafted f2fs image

- Reproduce

- Kernel message
[  548.580867] F2FS-fs (loop0): Invalid log blocks per segment (8201)

[  548.580877] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[  548.584979] ==================================================================
[  548.586568] BUG: KASAN: use-after-free in kmemdup+0x36/0x50
[  548.587715] Read of size 64 at addr ffff8801e9c265ff by task mount/1295

[  548.589428] CPU: 1 PID: 1295 Comm: mount Not tainted 4.18.0-rc1+ #4
[  548.589432] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  548.589438] Call Trace:
[  548.589474]  dump_stack+0x7b/0xb5
[  548.589487]  print_address_description+0x70/0x290
[  548.589492]  kasan_report+0x291/0x390
[  548.589496]  ? kmemdup+0x36/0x50
[  548.589509]  check_memory_region+0x139/0x190
[  548.589514]  memcpy+0x23/0x50
[  548.589518]  kmemdup+0x36/0x50
[  548.589545]  f2fs_build_segment_manager+0x8fa/0x3410
[  548.589551]  ? __asan_loadN+0xf/0x20
[  548.589560]  ? f2fs_sanity_check_ckpt+0x1be/0x240
[  548.589566]  ? f2fs_flush_sit_entries+0x10c0/0x10c0
[  548.589587]  ? __put_user_ns+0x40/0x40
[  548.589604]  ? find_next_bit+0x57/0x90
[  548.589610]  f2fs_fill_super+0x194b/0x2b40
[  548.589617]  ? f2fs_commit_super+0x1b0/0x1b0
[  548.589637]  ? set_blocksize+0x90/0x140
[  548.589651]  mount_bdev+0x1c5/0x210
[  548.589655]  ? f2fs_commit_super+0x1b0/0x1b0
[  548.589667]  f2fs_mount+0x15/0x20
[  548.589672]  mount_fs+0x60/0x1a0
[  548.589683]  ? alloc_vfsmnt+0x309/0x360
[  548.589688]  vfs_kern_mount+0x6b/0x1a0
[  548.589699]  do_mount+0x34a/0x18c0
[  548.589710]  ? lockref_put_or_lock+0xcf/0x160
[  548.589716]  ? copy_mount_string+0x20/0x20
[  548.589728]  ? memcg_kmem_put_cache+0x1b/0xa0
[  548.589734]  ? kasan_check_write+0x14/0x20
[  548.589740]  ? _copy_from_user+0x6a/0x90
[  548.589744]  ? memdup_user+0x42/0x60
[  548.589750]  ksys_mount+0x83/0xd0
[  548.589755]  __x64_sys_mount+0x67/0x80
[  548.589781]  do_syscall_64+0x78/0x170
[  548.589797]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  548.589820] RIP: 0033:0x7f76fc331b9a
[  548.589821] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[  548.589880] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[  548.589890] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[  548.589892] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[  548.589895] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[  548.589897] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[  548.589900] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003

[  548.590242] The buggy address belongs to the page:
[  548.591243] page:ffffea0007a70980 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  548.592886] flags: 0x2ffff0000000000()
[  548.593665] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[  548.595258] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  548.603713] page dumped because: kasan: bad access detected

[  548.605203] Memory state around the buggy address:
[  548.606198]  ffff8801e9c26480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.607676]  ffff8801e9c26500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.609157] >ffff8801e9c26580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.610629]                                                                 ^
[  548.612088]  ffff8801e9c26600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.613674]  ffff8801e9c26680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.615141] ==================================================================
[  548.616613] Disabling lock debugging due to kernel taint
[  548.622871] WARNING: CPU: 1 PID: 1295 at mm/page_alloc.c:4065 __alloc_pages_slowpath+0xe4a/0x1420
[  548.622878] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  548.623217] CPU: 1 PID: 1295 Comm: mount Tainted: G    B             4.18.0-rc1+ #4
[  548.623219] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  548.623226] RIP: 0010:__alloc_pages_slowpath+0xe4a/0x1420
[  548.623227] Code: ff ff 01 89 85 c8 fe ff ff e9 91 fc ff ff 41 89 c5 e9 5c fc ff ff 0f 0b 89 f8 25 ff ff f7 ff 89 85 8c fe ff ff e9 d5 f2 ff ff <0f> 0b e9 65 f2 ff ff 65 8b 05 38 81 d2 47 f6 c4 01 74 1c 65 48 8b
[  548.623281] RSP: 0018:ffff8801f28c7678 EFLAGS: 00010246
[  548.623284] RAX: 0000000000000000 RBX: 00000000006040c0 RCX: ffffffffb82f73b7
[  548.623287] RDX: 1ffff1003e518eeb RSI: 000000000000000c RDI: 0000000000000000
[  548.623290] RBP: ffff8801f28c7880 R08: 0000000000000000 R09: ffffed0047fff2c5
[  548.623292] R10: 0000000000000001 R11: ffffed0047fff2c4 R12: ffff8801e88de040
[  548.623295] R13: 00000000006040c0 R14: 000000000000000c R15: ffff8801f28c7938
[  548.623299] FS:  00007f76fca51840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  548.623302] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  548.623304] CR2: 00007f19b9171760 CR3: 00000001ed952000 CR4: 00000000000006e0
[  548.623317] Call Trace:
[  548.623325]  ? kasan_check_read+0x11/0x20
[  548.623330]  ? __zone_watermark_ok+0x92/0x240
[  548.623336]  ? get_page_from_freelist+0x1c3/0x1d90
[  548.623347]  ? _raw_spin_lock_irqsave+0x2a/0x60
[  548.623353]  ? warn_alloc+0x250/0x250
[  548.623358]  ? save_stack+0x46/0xd0
[  548.623361]  ? kasan_kmalloc+0xad/0xe0
[  548.623366]  ? __isolate_free_page+0x2a0/0x2a0
[  548.623370]  ? mount_fs+0x60/0x1a0
[  548.623374]  ? vfs_kern_mount+0x6b/0x1a0
[  548.623378]  ? do_mount+0x34a/0x18c0
[  548.623383]  ? ksys_mount+0x83/0xd0
[  548.623387]  ? __x64_sys_mount+0x67/0x80
[  548.623391]  ? do_syscall_64+0x78/0x170
[  548.623396]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  548.623401]  __alloc_pages_nodemask+0x3c5/0x400
[  548.623407]  ? __alloc_pages_slowpath+0x1420/0x1420
[  548.623412]  ? __mutex_lock_slowpath+0x20/0x20
[  548.623417]  ? kvmalloc_node+0x31/0x80
[  548.623424]  alloc_pages_current+0x75/0x110
[  548.623436]  kmalloc_order+0x24/0x60
[  548.623442]  kmalloc_order_trace+0x24/0xb0
[  548.623448]  __kmalloc_track_caller+0x207/0x220
[  548.623455]  ? f2fs_build_node_manager+0x399/0xbb0
[  548.623460]  kmemdup+0x20/0x50
[  548.623465]  f2fs_build_node_manager+0x399/0xbb0
[  548.623470]  f2fs_fill_super+0x195e/0x2b40
[  548.623477]  ? f2fs_commit_super+0x1b0/0x1b0
[  548.623481]  ? set_blocksize+0x90/0x140
[  548.623486]  mount_bdev+0x1c5/0x210
[  548.623489]  ? f2fs_commit_super+0x1b0/0x1b0
[  548.623495]  f2fs_mount+0x15/0x20
[  548.623498]  mount_fs+0x60/0x1a0
[  548.623503]  ? alloc_vfsmnt+0x309/0x360
[  548.623508]  vfs_kern_mount+0x6b/0x1a0
[  548.623513]  do_mount+0x34a/0x18c0
[  548.623518]  ? lockref_put_or_lock+0xcf/0x160
[  548.623523]  ? copy_mount_string+0x20/0x20
[  548.623528]  ? memcg_kmem_put_cache+0x1b/0xa0
[  548.623533]  ? kasan_check_write+0x14/0x20
[  548.623537]  ? _copy_from_user+0x6a/0x90
[  548.623542]  ? memdup_user+0x42/0x60
[  548.623547]  ksys_mount+0x83/0xd0
[  548.623552]  __x64_sys_mount+0x67/0x80
[  548.623557]  do_syscall_64+0x78/0x170
[  548.623562]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  548.623566] RIP: 0033:0x7f76fc331b9a
[  548.623567] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[  548.623632] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[  548.623636] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[  548.623639] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[  548.623641] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[  548.623643] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[  548.623646] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[  548.623650] ---[ end trace 4ce02f25ff7d3df5 ]---
[  548.623656] F2FS-fs (loop0): Failed to initialize F2FS node manager
[  548.627936] F2FS-fs (loop0): Invalid log blocks per segment (8201)

[  548.627940] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[  548.635835] F2FS-fs (loop0): Failed to initialize F2FS node manager

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.c#L3578

	sit_i->sit_bitmap = kmemdup(src_bitmap, bitmap_size, GFP_KERNEL);

Buffer overrun happens when doing memcpy. I suspect there is missing (inconsistent) checks on bitmap_size.

Reported by Wen Xu (wen.xu@gatech.edu) from SSLab, Gatech.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
42bf546c1f f2fs: fix to do sanity check with secs_per_zone
As Wen Xu reported in below link:

https://bugzilla.kernel.org/show_bug.cgi?id=200183

- Overview
Divide zero in reset_curseg() when mounting a crafted f2fs image

- Reproduce

- Kernel message
[  588.281510] divide error: 0000 [#1] SMP KASAN PTI
[  588.282701] CPU: 0 PID: 1293 Comm: mount Not tainted 4.18.0-rc1+ #4
[  588.284000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  588.286178] RIP: 0010:reset_curseg+0x94/0x1a0
[  588.298166] RSP: 0018:ffff8801e88d7940 EFLAGS: 00010246
[  588.299360] RAX: 0000000000000014 RBX: ffff8801e1d46d00 RCX: ffffffffb88bf60b
[  588.300809] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff8801e1d46d64
[  588.305272] R13: 0000000000000000 R14: 0000000000000014 R15: 0000000000000000
[  588.306822] FS:  00007fad85008840(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  588.308456] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  588.309623] CR2: 0000000001705078 CR3: 00000001f30f8000 CR4: 00000000000006f0
[  588.311085] Call Trace:
[  588.311637]  f2fs_build_segment_manager+0x103f/0x3410
[  588.316136]  ? f2fs_commit_super+0x1b0/0x1b0
[  588.317031]  ? set_blocksize+0x90/0x140
[  588.319473]  f2fs_mount+0x15/0x20
[  588.320166]  mount_fs+0x60/0x1a0
[  588.320847]  ? alloc_vfsmnt+0x309/0x360
[  588.321647]  vfs_kern_mount+0x6b/0x1a0
[  588.322432]  do_mount+0x34a/0x18c0
[  588.323175]  ? strndup_user+0x46/0x70
[  588.323937]  ? copy_mount_string+0x20/0x20
[  588.324793]  ? memcg_kmem_put_cache+0x1b/0xa0
[  588.325702]  ? kasan_check_write+0x14/0x20
[  588.326562]  ? _copy_from_user+0x6a/0x90
[  588.327375]  ? memdup_user+0x42/0x60
[  588.328118]  ksys_mount+0x83/0xd0
[  588.328808]  __x64_sys_mount+0x67/0x80
[  588.329607]  do_syscall_64+0x78/0x170
[  588.330400]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  588.331461] RIP: 0033:0x7fad848e8b9a
[  588.336022] RSP: 002b:00007ffd7c5b6be8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[  588.337547] RAX: ffffffffffffffda RBX: 00000000016f8030 RCX: 00007fad848e8b9a
[  588.338999] RDX: 00000000016f8210 RSI: 00000000016f9f30 RDI: 0000000001700ec0
[  588.340442] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[  588.341887] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001700ec0
[  588.343341] R13: 00000000016f8210 R14: 0000000000000000 R15: 0000000000000003
[  588.354891] ---[ end trace 4ce02f25ff7d3df5 ]---
[  588.355862] RIP: 0010:reset_curseg+0x94/0x1a0
[  588.360742] RSP: 0018:ffff8801e88d7940 EFLAGS: 00010246
[  588.361812] RAX: 0000000000000014 RBX: ffff8801e1d46d00 RCX: ffffffffb88bf60b
[  588.363485] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff8801e1d46d64
[  588.365213] RBP: ffff8801e88d7968 R08: ffffed003c32266f R09: ffffed003c32266f
[  588.366661] R10: 0000000000000001 R11: ffffed003c32266e R12: ffff8801f0337700
[  588.368110] R13: 0000000000000000 R14: 0000000000000014 R15: 0000000000000000
[  588.370057] FS:  00007fad85008840(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  588.372099] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  588.373291] CR2: 0000000001705078 CR3: 00000001f30f8000 CR4: 00000000000006f0

- Location
https://elixir.bootlin.com/linux/latest/source/fs/f2fs/segment.c#L2147
        curseg->zone = GET_ZONE_FROM_SEG(sbi, curseg->segno);

If secs_per_zone is corrupted due to fuzzing test, it will cause divide
zero operation when using GET_ZONE_FROM_SEG macro, so we should do more
sanity check with secs_per_zone during mount to avoid this issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
67fce70ba3 f2fs: disable f2fs_check_rb_tree_consistence
If there is millions of discard entries cached in rb tree, each
sanity check of it can cause very long latency as held cmd_lock
blocking other lock grabbers.

In other aspect, we have enabled the check very long time, as
we see, there is no such inconsistent condition caused by bugs.

But still we do not choose to kill it directly, instead, adding
an flag to disable the check now, if there is related code change,
we can reuse it to detect bugs.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
e1da7872f6 f2fs: introduce and spread verify_blkaddr
This patch introduces verify_blkaddr to check meta/data block address
with valid range to detect bug earlier.

In addition, once we encounter an invalid blkaddr, notice user to run
fsck to fix, and let the kernel panic.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Arnd Bergmann
24b81dfcb7 f2fs: use timespec64 for inode timestamps
The on-disk representation and the vfs both use 64-bit tv_sec values,
so let's change the last missing piece in the middle.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
6aead1617b f2fs: fix to wait on page writeback before updating page
In error path of f2fs_move_rehashed_dirents, inode page could be writeback
state, so we should wait on inode page writeback before updating it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Jaegeuk Kim
e2e59414aa f2fs: assign REQ_RAHEAD to bio for ->readpages
As Jens reported, we'd better assign REQ_RAHEAD to bio by the fact that
->readpages is called only from read-ahead.

In Documentation/filesystems/vfs.txt,

readpages: called by the VM to read pages associated with the address_space
  	object. This is essentially just a vector version of
  	readpage.  Instead of just one page, several pages are
  	requested.
	readpages is only used for read-ahead, so read errors are
  	ignored.  If anything goes wrong, feel free to give up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Yunlei He
2a63531a61 f2fs: fix a hungtask problem caused by congestion_wait
This patch fix hungtask problem which can be reproduced as follow:

Thread 0~3:
while true
do
        touch /xxx/test/file_xxx
done

Thread 4 write a new checkpoint every three seconds.

In the meantime, fio start 16 threads for randwrite.

With my debug info, cycles num will exceed 1000 in function
f2fs_sync_dirty_inodes, and most of cycle will be dropped
into congestion_wait() and sleep more than 20ms. Cycles num
reduced to 3 with this patch.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Dan Carpenter
2a96d8ad94 f2fs: Fix uninitialized return in f2fs_ioc_shutdown()
"ret" can be uninitialized on the success path when "in ==
F2FS_GOING_DOWN_FULLSYNC".

Fixes: 60b2b4ee2b ("f2fs: Fix deadlock in shutdown ioctl")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Jaegeuk Kim
5a6154920f f2fs: don't issue discard commands in online discard is on
Actually, we don't need to issue discard commands, if discard is on, as
mentioned in the comment.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
e2374015f2 f2fs: fix to propagate return value of scan_nat_page()
As Anatoly Trosinenko reported in bugzilla:

How to reproduce:
1. Compile the 73fcb1a370 version of the kernel using the config attached
2. Unpack and mount the attached filesystem image as F2FS
3. The kernel will BUG() on mount (BUGs are explicitly enabled in config)

[    2.233612] F2FS-fs (sda): Found nat_bits in checkpoint
[    2.248422] ------------[ cut here ]------------
[    2.248857] kernel BUG at fs/f2fs/node.c:1967!
[    2.249760] invalid opcode: 0000 [#1] SMP NOPTI
[    2.250219] Modules linked in:
[    2.251848] CPU: 0 PID: 944 Comm: mount Not tainted 4.17.0-rc5+ #1
[    2.252331] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[    2.253305] RIP: 0010:build_free_nids+0x337/0x3f0
[    2.253672] RSP: 0018:ffffae7fc0857c50 EFLAGS: 00000246
[    2.254080] RAX: 00000000ffffffff RBX: 0000000000000123 RCX: 0000000000000001
[    2.254638] RDX: ffff9aa7063d5c00 RSI: 0000000000000122 RDI: ffff9aa705852e00
[    2.255190] RBP: ffff9aa705852e00 R08: 0000000000000001 R09: ffff9aa7059090c0
[    2.255719] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9aa705852e00
[    2.256242] R13: ffff9aa7063ad000 R14: ffff9aa705919000 R15: 0000000000000123
[    2.256809] FS:  00000000023078c0(0000) GS:ffff9aa707800000(0000) knlGS:0000000000000000
[    2.258654] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.259153] CR2: 00000000005511ae CR3: 0000000005872000 CR4: 00000000000006f0
[    2.259801] Call Trace:
[    2.260583]  build_node_manager+0x5cd/0x600
[    2.260963]  f2fs_fill_super+0x66a/0x17c0
[    2.261300]  ? f2fs_commit_super+0xe0/0xe0
[    2.261622]  mount_bdev+0x16e/0x1a0
[    2.261899]  mount_fs+0x30/0x150
[    2.262398]  vfs_kern_mount.part.28+0x4f/0xf0
[    2.262743]  do_mount+0x5d0/0xc60
[    2.263010]  ? _copy_from_user+0x37/0x60
[    2.263313]  ? memdup_user+0x39/0x60
[    2.263692]  ksys_mount+0x7b/0xd0
[    2.263960]  __x64_sys_mount+0x1c/0x20
[    2.264268]  do_syscall_64+0x43/0xf0
[    2.264560]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    2.265095] RIP: 0033:0x48d31a
[    2.265502] RSP: 002b:00007ffc6fe60a08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[    2.266089] RAX: ffffffffffffffda RBX: 0000000000008000 RCX: 000000000048d31a
[    2.266607] RDX: 00007ffc6fe62fa5 RSI: 00007ffc6fe62f9d RDI: 00007ffc6fe62f94
[    2.267130] RBP: 00000000023078a0 R08: 0000000000000000 R09: 0000000000000000
[    2.267670] R10: 0000000000008000 R11: 0000000000000246 R12: 0000000000000000
[    2.268192] R13: 0000000000000000 R14: 00007ffc6fe60c78 R15: 0000000000000000
[    2.268767] Code: e8 5f c3 ff ff 83 c3 01 41 83 c7 01 81 fb c7 01 00 00 74 48 44 39 7d 04 76 42 48 63 c3 48 8d 04 c0 41 8b 44 06 05 83 f8 ff 75 c1 <0f> 0b 49 8b 45 50 48 8d b8 b0 00 00 00 e8 37 59 69 00 b9 01 00
[    2.270434] RIP: build_free_nids+0x337/0x3f0 RSP: ffffae7fc0857c50
[    2.271426] ---[ end trace ab20c06cd3c8fde4 ]---

During loading NAT entries, we will do sanity check, once the entry info
is corrupted, it will cause BUG_ON directly to protect user data from
being overwrited.

In this case, it will be better to just return failure on mount() instead
of panic, so that user can get hint from kmsg and try fsck for recovery
immediately rather than after an abnormal reboot.

https://bugzilla.kernel.org/show_bug.cgi?id=199769

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Weichao Guo
54c55c4e4f f2fs: support in-memory inode checksum when checking consistency
Enable in-memory inode checksum to protect metadata blocks from
in-memory scribbles when checking consistency, which has no
performance requirements.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
4e423832a6 f2fs: fix error path of fill_super
In fill_super, if root inode's attribute is incorrect, we need to
call f2fs_destroy_stats to release stats memory.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
4cac90d549 f2fs: relocate readdir_ra configure initialization
readdir_ra is sysfs configuration instead of mount option, so it should
not be initialized in default_options(), otherwise after remount, it can
be reset to be enabled which may not as user wish, so let's move it to
f2fs_tuning_parameters().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
0aa7e0f8c0 f2fs: move s_res{u,g}id initialization to default_options()
Let default_options() initialize s_res{u,g}id with default value like
other options.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
76a45e3c45 f2fs: don't acquire orphan ino during recovery
During orphan inode recovery, checkpoint should never succeed due to
SBI_POR_DOING flag, so we don't need acquire orphan ino which only be
used by checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Jaegeuk Kim
a1933c09ef f2fs: avoid potential deadlock in f2fs_sbi_store
[  155.018460] ======================================================
[  155.021431] WARNING: possible circular locking dependency detected
[  155.024339] 4.18.0-rc3+ #5 Tainted: G           OE
[  155.026879] ------------------------------------------------------
[  155.029783] umount/2901 is trying to acquire lock:
[  155.032187] 00000000c4282f1f (kn->count#130){++++}, at: kernfs_remove+0x1f/0x30
[  155.035439]
[  155.035439] but task is already holding lock:
[  155.038892] 0000000056e4307b (&type->s_umount_key#41){++++}, at: deactivate_super+0x33/0x50
[  155.042602]
[  155.042602] which lock already depends on the new lock.
[  155.042602]
[  155.047465]
[  155.047465] the existing dependency chain (in reverse order) is:
[  155.051354]
[  155.051354] -> #1 (&type->s_umount_key#41){++++}:
[  155.054768]        f2fs_sbi_store+0x61/0x460 [f2fs]
[  155.057083]        kernfs_fop_write+0x113/0x1a0
[  155.059277]        __vfs_write+0x36/0x180
[  155.061250]        vfs_write+0xbe/0x1b0
[  155.063179]        ksys_write+0x55/0xc0
[  155.065068]        do_syscall_64+0x60/0x1b0
[  155.067071]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  155.069529]
[  155.069529] -> #0 (kn->count#130){++++}:
[  155.072421]        __kernfs_remove+0x26f/0x2e0
[  155.074452]        kernfs_remove+0x1f/0x30
[  155.076342]        kobject_del.part.5+0xe/0x40
[  155.078354]        f2fs_put_super+0x12d/0x290 [f2fs]
[  155.080500]        generic_shutdown_super+0x6c/0x110
[  155.082655]        kill_block_super+0x21/0x50
[  155.084634]        kill_f2fs_super+0x9c/0xc0 [f2fs]
[  155.086726]        deactivate_locked_super+0x3f/0x70
[  155.088826]        cleanup_mnt+0x3b/0x70
[  155.090584]        task_work_run+0x93/0xc0
[  155.092367]        exit_to_usermode_loop+0xf0/0x100
[  155.094466]        do_syscall_64+0x162/0x1b0
[  155.096312]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  155.098603]
[  155.098603] other info that might help us debug this:
[  155.098603]
[  155.102418]  Possible unsafe locking scenario:
[  155.102418]
[  155.105134]        CPU0                    CPU1
[  155.107037]        ----                    ----
[  155.108910]   lock(&type->s_umount_key#41);
[  155.110674]                                lock(kn->count#130);
[  155.113010]                                lock(&type->s_umount_key#41);
[  155.115608]   lock(kn->count#130);

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Jaegeuk Kim
83a3bfdb5a f2fs: indicate shutdown f2fs to allow unmount successfully
Once we shutdown f2fs, we have to flush stale pages in order to unmount
the system. In order to make stable, we need to stop fault injection as well.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:56 +09:00
Jaegeuk Kim
af697c0f5c f2fs: keep meta pages in cp_error state
It turns out losing meta pages in shutdown period makes f2fs very unstable
so that I could see many unexpected error conditions.

Let's keep meta pages for fault injection and sudden power-off tests.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:39 +09:00
Michael Callahan
dbae2c5513 block: Define and use STAT_READ and STAT_WRITE
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:18 -06:00
Jaegeuk Kim
1cb50f87e1 f2fs: do checkpoint in kill_sb
When unmounting f2fs in force mode, we can get it stuck by io_schedule()
by some pending IOs in meta_inode.

io_schedule+0xd/0x30
wait_on_page_bit_common+0xc6/0x130
__filemap_fdatawait_range+0xbd/0x100
filemap_fdatawait_keep_errors+0x15/0x40
sync_inodes_sb+0x1cf/0x240
sync_filesystem+0x52/0x90
generic_shutdown_super+0x1d/0x110
kill_f2fs_super+0x28/0x80 [f2fs]
deactivate_locked_super+0x35/0x60
cleanup_mnt+0x36/0x70
task_work_run+0x79/0xa0
exit_to_usermode_loop+0x62/0x70
do_syscall_64+0xdb/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
0xffffffffffffffff

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-15 12:14:04 +09:00
Jaegeuk Kim
8a56dd9685 f2fs: allow wrong configured dio to buffered write
This fixes to support dio having unaligned buffers as buffered writes.

xfs_io -f -d -c "pwrite 0 512" $testfile
 -> okay

xfs_io -f -d -c "pwrite 1 512" $testfile
 -> EINVAL

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-15 12:14:04 +09:00
Jaegeuk Kim
7f2ecdd837 f2fs: flush journal nat entries for nat_bits during unmount
Let's flush journal nat entries for speed up in the next run.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-11 19:54:51 -07:00
Linus Torvalds
7a932516f5 vfs/y2038: inode timestamps conversion to timespec64
This is a late set of changes from Deepa Dinamani doing an automated
 treewide conversion of the inode and iattr structures from 'timespec'
 to 'timespec64', to push the conversion from the VFS layer into the
 individual file systems.
 
 There were no conflicts between this and the contents of linux-next
 until just before the merge window, when we saw multiple problems:
 
 - A minor conflict with my own y2038 fixes, which I could address
   by adding another patch on top here.
 - One semantic conflict with late changes to the NFS tree. I addressed
   this by merging Deepa's original branch on top of the changes that
   now got merged into mainline and making sure the merge commit includes
   the necessary changes as produced by coccinelle.
 - A trivial conflict against the removal of staging/lustre.
 - Multiple conflicts against the VFS changes in the overlayfs tree.
   These are still part of linux-next, but apparently this is no longer
   intended for 4.18 [1], so I am ignoring that part.
 
 As Deepa writes:
 
   The series aims to switch vfs timestamps to use struct timespec64.
   Currently vfs uses struct timespec, which is not y2038 safe.
 
   The series involves the following:
   1. Add vfs helper functions for supporting struct timepec64 timestamps.
   2. Cast prints of vfs timestamps to avoid warnings after the switch.
   3. Simplify code using vfs timestamps so that the actual
      replacement becomes easy.
   4. Convert vfs timestamps to use struct timespec64 using a script.
      This is a flag day patch.
 
   Next steps:
   1. Convert APIs that can handle timespec64, instead of converting
      timestamps at the boundaries.
   2. Update internal data structures to avoid timestamp conversions.
 
 Thomas Gleixner adds:
 
   I think there is no point to drag that out for the next merge window.
   The whole thing needs to be done in one go for the core changes which
   means that you're going to play that catchup game forever. Let's get
   over with it towards the end of the merge window.
 
 [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
 MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
 L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
 Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
 LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
 nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
 wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
 c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
 tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
 WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
 A3HkjIBrKW5AgQDxfgvm
 =CZX2
 -----END PGP SIGNATURE-----

Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
 "This is a late set of changes from Deepa Dinamani doing an automated
  treewide conversion of the inode and iattr structures from 'timespec'
  to 'timespec64', to push the conversion from the VFS layer into the
  individual file systems.

  As Deepa writes:

   'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
       timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
       becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
       This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
       timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

  Thomas Gleixner adds:

   'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  pstore: Remove bogus format string definition
  vfs: change inode times to use struct timespec64
  pstore: Convert internal records to timespec64
  udf: Simplify calls to udf_disk_stamp_to_time
  fs: nfs: get rid of memcpys for inode times
  ceph: make inode time prints to be long long
  lustre: Use long long type to print inode time
  fs: add timespec64_truncate()
2018-06-15 07:31:07 +09:00
Arnd Bergmann
15eefe2a99 Merge branch 'vfs_timespec64' of https://github.com/deepa-hub/vfs into vfs-timespec64
Pull the timespec64 conversion from Deepa Dinamani:
 "The series aims to switch vfs timestamps to use
  struct timespec64. Currently vfs uses struct timespec,
  which is not y2038 safe.

  The flag patch applies cleanly. I've not seen the timestamps
  update logic change often. The series applies cleanly on 4.17-rc6
  and linux-next tip (top commit: next-20180517).

  I'm not sure how to merge this kind of a series with a flag patch.
  We are targeting 4.18 for this.
  Let me know if you have other suggestions.

  The series involves the following:
  1. Add vfs helper functions for supporting struct timepec64 timestamps.
  2. Cast prints of vfs timestamps to avoid warnings after the switch.
  3. Simplify code using vfs timestamps so that the actual
     replacement becomes easy.
  4. Convert vfs timestamps to use struct timespec64 using a script.
     This is a flag day patch.

  I've tried to keep the conversions with the script simple, to
  aid in the reviews. I've kept all the internal filesystem data
  structures and function signatures the same.

  Next steps:
  1. Convert APIs that can handle timespec64, instead of converting
     timestamps at the boundaries.
  2. Update internal data structures to avoid timestamp conversions."

I've pulled it into a branch based on top of the NFS changes that
are now in mainline, so I could resolve the non-obvious conflict
between the two while merging.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-06-14 14:54:00 +02:00
Kees Cook
9d2a789c1d treewide: Use array_size in f2fs_kvzalloc()
The f2fs_kvzalloc() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:

        f2fs_kvzalloc(handle, a * b, gfp)

with:
        f2fs_kvzalloc(handle, array_size(a, b), gfp)

as well as handling cases of:

        f2fs_kvzalloc(handle, a * b * c, gfp)

with:

        f2fs_kvzalloc(handle, array3_size(a, b, c), gfp)

This does, however, attempt to ignore constant size factors like:

        f2fs_kvzalloc(handle, 4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
expression HANDLE;
type TYPE;
expression THING, E;
@@

(
  f2fs_kvzalloc(HANDLE,
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression HANDLE;
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  f2fs_kvzalloc(HANDLE,
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
expression HANDLE;
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
expression HANDLE;
identifier SIZE, COUNT;
@@

  f2fs_kvzalloc(HANDLE,
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression HANDLE;
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression HANDLE;
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
expression HANDLE;
identifier STRIDE, SIZE, COUNT;
@@

(
  f2fs_kvzalloc(HANDLE,
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression HANDLE;
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  f2fs_kvzalloc(HANDLE, C1 * C2 * C3, ...)
|
  f2fs_kvzalloc(HANDLE,
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression HANDLE;
expression E1, E2;
constant C1, C2;
@@

(
  f2fs_kvzalloc(HANDLE, C1 * C2, ...)
|
  f2fs_kvzalloc(HANDLE,
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
026f05079b treewide: Use array_size() in f2fs_kzalloc()
The f2fs_kzalloc() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:

        f2fs_kzalloc(handle, a * b, gfp)

with:
        f2fs_kzalloc(handle, array_size(a, b), gfp)

as well as handling cases of:

        f2fs_kzalloc(handle, a * b * c, gfp)

with:

        f2fs_kzalloc(handle, array3_size(a, b, c), gfp)

This does, however, attempt to ignore constant size factors like:

        f2fs_kzalloc(handle, 4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
expression HANDLE;
type TYPE;
expression THING, E;
@@

(
  f2fs_kzalloc(HANDLE,
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression HANDLE;
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  f2fs_kzalloc(HANDLE,
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
expression HANDLE;
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
expression HANDLE;
identifier SIZE, COUNT;
@@

  f2fs_kzalloc(HANDLE,
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression HANDLE;
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression HANDLE;
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
expression HANDLE;
identifier STRIDE, SIZE, COUNT;
@@

(
  f2fs_kzalloc(HANDLE,
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression HANDLE;
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  f2fs_kzalloc(HANDLE, C1 * C2 * C3, ...)
|
  f2fs_kzalloc(HANDLE,
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression HANDLE;
expression E1, E2;
constant C1, C2;
@@

(
  f2fs_kzalloc(HANDLE, C1 * C2, ...)
|
  f2fs_kzalloc(HANDLE,
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
c86065938a treewide: Use array_size() in f2fs_kmalloc()
The f2fs_kmalloc() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:

        f2fs_kmalloc(handle, a * b, gfp)

with:
        f2fs_kmalloc(handle, array_size(a, b), gfp)

as well as handling cases of:

        f2fs_kmalloc(handle, a * b * c, gfp)

with:

        f2fs_kmalloc(handle, array3_size(a, b, c), gfp)

This does, however, attempt to ignore constant size factors like:

        f2fs_kmalloc(handle, 4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
expression HANDLE;
type TYPE;
expression THING, E;
@@

(
  f2fs_kmalloc(HANDLE,
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression HANDLE;
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  f2fs_kmalloc(HANDLE,
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
expression HANDLE;
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
expression HANDLE;
identifier SIZE, COUNT;
@@

  f2fs_kmalloc(HANDLE,
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression HANDLE;
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression HANDLE;
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
expression HANDLE;
identifier STRIDE, SIZE, COUNT;
@@

(
  f2fs_kmalloc(HANDLE,
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression HANDLE;
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  f2fs_kmalloc(HANDLE, C1 * C2 * C3, ...)
|
  f2fs_kmalloc(HANDLE,
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression HANDLE;
expression E1, E2;
constant C1, C2;
@@

(
  f2fs_kmalloc(HANDLE, C1 * C2, ...)
|
  f2fs_kmalloc(HANDLE,
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
d54d35c501 f2fs-for-4.18-rc1
In this round, we've mainly focused on discard, aka unmap, control along with
 fstrim for Android-specific usage model. In addition, we've fixed writepage flow
 which returned EAGAIN previously resulting in EIO of fsync(2) due to mapping's
 error state. In order to avoid old MM bug [1], we decided not to use __GFP_ZERO
 for the mapping for node and meta page caches. As always, we've cleaned up many
 places for future fsverity and symbol conflicts.
 
 Enhancement:
  - do discard/fstrim in lower priority considering fs utilization
  - split large discard commands into smaller ones for better responsiveness
  - add more sanity checks to address syzbot reports
  - add a mount option, fsync_mode=nobarrier, which can reduce # of cache flushes
  - clean up symbol namespace with modified function names
  - be strict on block allocation and IO control in corner cases
 
 Bug fix:
  - don't use __GFP_ZERO for mappings
  - fix error reports in writepage to avoid fsync() failure
  - avoid selinux denial on CAP_RESOURCE on resgid/resuid
  - fix some subtle race conditions in GC/atomic writes/shutdown
  - fix overflow bugs in sanity_check_raw_super
  - fix missing bits on get_flags
 
 Clean-up:
  - prepare the generic flow for future fsverity integration
  - fix some broken coding standard
 
 [1] https://lkml.org/lkml/2018/4/8/661
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlsepb8ACgkQQBSofoJI
 UNJdSw/+IhrYJFkJEN/pV4M5xSjYirl/P2WJ4AGi6HcpjEGmaDiBi2whod1Jw2NE
 1auSMiby7K91VAmPvxMmmLhOdC8XgJ8jwY1nEaZMfmMXohlaD3FDY5bzYf5rJDF4
 J184P6xUZ2IKlFVA4prwNQgYi3awPthVu1lxbFPp8GUHDbmr5ZXEysxPDzz2O0Em
 oE7WmklmyCHJPhmg/EcVXfF/Ekf3zMOVR+EI2otcDjnWIQioVetIK8CKi0MM4bkG
 X8Z318ANjGTd42woupXIzsiTrMRONY7zzkUvE+S6tfUjKZoIdofDM5OIXMdOxpxL
 DZ53WrwfeB74igD8jDZgqD6OaonIfDfCuKrwUASFAC2Ou4h3apj3ckUzoHtAhEUL
 z5yTSKTrtfuoSufhBp+nKKs3ijDgms76arw8x/pPdN6D6xDwIJtBPxC2sObPaj35
 damv4GyM4+sbhGO/Gbie2q6za55IvYFZc7JNCC2D2K5tnBmUaa7/XdvxcyigniGk
 AZgkaddHePkAZpa5AYYirZR8bd7IFds0+m6VcybG0/pYb0qPEcI6U4mujBSCIwVy
 kXuD7su3jNjj6hWnCl5PSQo8yBWS5H8c6/o+5XHozzYA91dsLAmD8entuCreg6Hp
 NaIFio0qKULweLK86f66qQTsRPMpYRAtqPS0Ew0+3llKMcrlRp4=
 =JrW7
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've mainly focused on discard, aka unmap, control
  along with fstrim for Android-specific usage model. In addition, we've
  fixed writepage flow which returned EAGAIN previously resulting in EIO
  of fsync(2) due to mapping's error state. In order to avoid old MM bug
  [1], we decided not to use __GFP_ZERO for the mapping for node and
  meta page caches. As always, we've cleaned up many places for future
  fsverity and symbol conflicts.

  Enhancements:
   - do discard/fstrim in lower priority considering fs utilization
   - split large discard commands into smaller ones for better responsiveness
   - add more sanity checks to address syzbot reports
   - add a mount option, fsync_mode=nobarrier, which can reduce # of cache flushes
   - clean up symbol namespace with modified function names
   - be strict on block allocation and IO control in corner cases

  Bug fixes:
   - don't use __GFP_ZERO for mappings
   - fix error reports in writepage to avoid fsync() failure
   - avoid selinux denial on CAP_RESOURCE on resgid/resuid
   - fix some subtle race conditions in GC/atomic writes/shutdown
   - fix overflow bugs in sanity_check_raw_super
   - fix missing bits on get_flags

  Clean-ups:
   - prepare the generic flow for future fsverity integration
   - fix some broken coding standard"

[1] https://lkml.org/lkml/2018/4/8/661

* tag 'f2fs-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (79 commits)
  f2fs: fix to clear FI_VOLATILE_FILE correctly
  f2fs: let sync node IO interrupt async one
  f2fs: don't change wbc->sync_mode
  f2fs: fix to update mtime correctly
  fs: f2fs: insert space around that ':' and ', '
  fs: f2fs: add missing blank lines after declarations
  fs: f2fs: changed variable type of offset "unsigned" to "loff_t"
  f2fs: clean up symbol namespace
  f2fs: make set_de_type() static
  f2fs: make __f2fs_write_data_pages() static
  f2fs: fix to avoid accessing cross the boundary
  f2fs: fix to let caller retry allocating block address
  disable loading f2fs module on PAGE_SIZE > 4KB
  f2fs: fix error path of move_data_page
  f2fs: don't drop dentry pages after fs shutdown
  f2fs: fix to avoid race during access gc_thread pointer
  f2fs: clean up with clear_radix_tree_dirty_tag
  f2fs: fix to don't trigger writeback during recovery
  f2fs: clear discard_wake earlier
  f2fs: let discard thread wait a little longer if dev is busy
  ...
2018-06-11 10:16:13 -07:00
Deepa Dinamani
95582b0083 vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
  current_time ( ... )
  {
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
  ...
- return timespec_trunc(
+ return timespec64_trunc(
  ... );
  }

@ depends on patch @
identifier xtime;
@@
 struct \( iattr \| inode \| kstat \) {
 ...
-       struct timespec xtime;
+       struct timespec64 xtime;
 ...
 }

@ depends on patch @
identifier t;
@@
 struct inode_operations {
 ...
int (*update_time) (...,
-       struct timespec t,
+       struct timespec64 t,
...);
 ...
 }

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
 fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
 ...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
  ) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
 fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
|
 node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 stat->xtime = node2->i_xtime1;
|
 stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
 node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-06-05 16:57:31 -07:00
Linus Torvalds
fd59ccc530 Add bunch of cleanups, and add support for the Speck128/256
algorithms.  Yes, Speck is contrversial, but the intention is to use
 them only for the lowest end Android devices, where the alternative
 *really* is no encryption at all for data stored at rest.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlsW/RAACgkQ8vlZVpUN
 gaO1pwf/WOusoXBK5sUuiC8d9I5s+OlPhTKhrh+BcL7/xhOkyh2xDv2FEwsjhwUf
 qo26AMf7DsWKWgJ6wDQ1z+PIuPSNeQy5dCKbz2hbfNjET3vdk2NuvPWnIbFrmIek
 LB6Ii9jKlPJRO4T3nMrE9JzJZLsoX5OKRSgYTT3EviuW/wSXaFyi7onFnyKXBnF/
 e689tE50P42PgTEDKs4RDw43PwEGbcl5Vtj+Lnoh6VGX/pYvL/9ZbEYlKrgqSOU4
 DmckR8D8UU/Gy6G5bvMsVuJpLEU7vBxupOOHI/nJFwR6tuYi0Q1j7C/zH8BvWp5e
 o8P5GpOWk7Gm346FaUlkAZ+25bCU+A==
 =EBeE
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Add bunch of cleanups, and add support for the Speck128/256
  algorithms.

  Yes, Speck is contrversial, but the intention is to use them only for
  the lowest end Android devices, where the alternative *really* is no
  encryption at all for data stored at rest"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: log the crypto algorithm implementations
  fscrypt: add Speck128/256 support
  fscrypt: only derive the needed portion of the key
  fscrypt: separate key lookup from key derivation
  fscrypt: use a common logging function
  fscrypt: remove internal key size constants
  fscrypt: remove unnecessary check for non-logon key type
  fscrypt: make fscrypt_operations.max_namelen an integer
  fscrypt: drop empty name check from fname_decrypt()
  fscrypt: drop max_namelen check from fname_decrypt()
  fscrypt: don't special-case EOPNOTSUPP from fscrypt_get_encryption_info()
  fscrypt: don't clear flags on crypto transform
  fscrypt: remove stale comment from fscrypt_d_revalidate()
  fscrypt: remove error messages for skcipher_request_alloc() failure
  fscrypt: remove unnecessary NULL check when allocating skcipher
  fscrypt: clean up after fscrypt_prepare_lookup() conversions
  fs, fscrypt: only define ->s_cop when FS_ENCRYPTION is enabled
  fscrypt: use unbound workqueue for decryption
2018-06-05 15:15:32 -07:00
Chao Yu
dfa742803f f2fs: fix to clear FI_VOLATILE_FILE correctly
Thread A			Thread B
- f2fs_release_file
 - clear_inode_flag(FI_VOLATILE_FILE)
				- wb_writeback
				 - writeback_sb_inodes
				  - __writeback_single_inode
				   - do_writepages
				    - f2fs_write_data_pages
				     - __write_data_page
				     all volatile file's pages
				     are writebacked to storage
 - set_inode_flag(FI_DROP_CACHE)
 - filemap_fdatawrite

There is a hole that mm can flush all dirty pages of volatile file as
inode is not tagged with both FI_VOLATILE_FILE and FI_DROP_CACHE flags,
we should never writeback the page #0 and also it's unneeded to writeback
other pages.

This patch adjusts to relocate clear_inode_flag(FI_VOLATILE_FILE), so that
FI_VOLATILE_FILE flag can be remained before all dirty pages were dropped
to avoid issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 14:33:27 -07:00
Chao Yu
c29fd0c0e2 f2fs: let sync node IO interrupt async one
Although mixed sync/async IOs can have continuous LBA, as they have
different IO priority, block IO scheduler will add them into different
queues and commit them separately, result in splited IOs which causes
wrose performance.

This patch gives high priority to synchronous IO of nodes, means that
once synchronous flow starts, it can interrupt asynchronous writeback
flow of system flusher, so more big IOs can be expected.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 14:33:20 -07:00
Chao Yu
aae764ece6 f2fs: don't change wbc->sync_mode
We should never falsify wbc->sync_mode passed from mm, otherwise
mm can trigger writeback with wrong IO priority.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 14:32:44 -07:00
Chao Yu
a1f72ac2c0 f2fs: fix to update mtime correctly
If we change system time to the past, get_mtime() will return a
overflowed time, and SIT_I(sbi)->max_mtime will be udpated
incorrectly, this patch fixes the two issues.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 14:31:11 -07:00
Linus Torvalds
cf626b0da7 Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull procfs updates from Al Viro:
 "Christoph's proc_create_... cleanups series"

* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
  xfs, proc: hide unused xfs procfs helpers
  isdn/gigaset: add back gigaset_procinfo assignment
  proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
  tty: replace ->proc_fops with ->proc_show
  ide: replace ->proc_fops with ->proc_show
  ide: remove ide_driver_proc_write
  isdn: replace ->proc_fops with ->proc_show
  atm: switch to proc_create_seq_private
  atm: simplify procfs code
  bluetooth: switch to proc_create_seq_data
  netfilter/x_tables: switch to proc_create_seq_private
  netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
  neigh: switch to proc_create_seq_data
  hostap: switch to proc_create_{seq,single}_data
  bonding: switch to proc_create_seq_data
  rtc/proc: switch to proc_create_single_data
  drbd: switch to proc_create_single
  resource: switch to proc_create_seq_data
  staging/rtl8192u: simplify procfs code
  jfs: simplify procfs code
  ...
2018-06-04 10:00:01 -07:00
youngjun yoo
1061fd484b fs: f2fs: insert space around that ':' and ', '
clean up checkpatch error:
ERROR: space required after that ':'
ERROR: space required after that ','

Signed-off-by: youngjun yoo <youngjun.willow@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:54 -07:00
youngjun yoo
f11e98bdbe fs: f2fs: add missing blank lines after declarations
clean up checkpatch warning:
WARNING: Missing a blank line after declarations

Signed-off-by: youngjun yoo <youngjun.willow@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:54 -07:00
youngjun yoo
193bea1dd8 fs: f2fs: changed variable type of offset "unsigned" to "loff_t"
clean up checkpatch warning:
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'

Signed-off-by: youngjun yoo <youngjun.willow@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
4d57b86dd8 f2fs: clean up symbol namespace
As Ted reported:

"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix.  There's well over a hundred (see attached below).

As one example, in fs/f2fs/dir.c there is:

unsigned char get_de_type(struct f2fs_dir_entry *de)

This function is clearly only useful for f2fs, but it has a generic
name.  This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build.  It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.

You might want to fix this at some point.  Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.

acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."

This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
2e79d951ff f2fs: make set_de_type() static
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
fc99fe27b9 f2fs: make __f2fs_write_data_pages() static
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
9fd62605bb f2fs: fix to avoid accessing cross the boundary
Configure io_bits with 2 and enable LFS mode, generic/017 reports below dmesg:

BUG: unable to handle kernel NULL pointer dereference at 00000039
*pdpt = 000000002fcb2001 *pde = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: crc32_generic zram f2fs(O) bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi pcbc snd_seq joydev aesni_intel aes_i586 snd_seq_device snd_timer crypto_simd cryptd snd soundcore i2c_piix4 serio_raw mac_hid video parport_pc ppdev lp parport hid_generic usbhid psmouse hid e1000
CPU: 2 PID: 20779 Comm: xfs_io Tainted: G           O      4.17.0-rc2 #38
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: is_checkpointed_data+0x84/0xd0 [f2fs]
EFLAGS: 00010207 CPU: 2
EAX: 00000000 EBX: f5cd7000 ECX: fffffe32 EDX: 00000039
ESI: 000001cd EDI: ec95fb6c EBP: e264bd80 ESP: e264bd6c
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 00000039 CR3: 2fe55660 CR4: 000406f0
Call Trace:
 __exchange_data_block+0xb3f/0x1000 [f2fs]
 f2fs_fallocate+0xab9/0x16b0 [f2fs]
 vfs_fallocate+0x17c/0x2d0
 ksys_fallocate+0x42/0x70
 sys_fallocate+0x31/0x40
 do_fast_syscall_32+0xaa/0x22c
 entry_SYSENTER_32+0x4c/0x7b
EIP: 0xb7f98c51
EFLAGS: 00000293 CPU: 2
EAX: ffffffda EBX: 00000003 ECX: 00000008 EDX: 01001000
ESI: 00000000 EDI: 00001000 EBP: 00000000 ESP: bfc0357c
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Code: 00 00 d3 e8 8b 4d ec 2b 02 8b 55 f0 6b c0 1c 03 41 70 29 d6 8b 93 d0 06 00 00 8b 40 0c 83 ea 01 21 d6 89 f2 89 f1 c1 ea 03 f7 d1 <0f> be 14 10 83 e1 07 b8 01 00 00 00 d3 e0 85 c2 89 f8 0f 95 c3
EIP: is_checkpointed_data+0x84/0xd0 [f2fs] SS:ESP: 0068:e264bd6c
CR2: 0000000000000039
---[ end trace 9a4d4087cce6080a ]---

This is because in recovery flow of __exchange_data_block, we didn't pass olen to
__roll_back_blkaddrs, instead we passed len, which indicates wrong array size, result
in copying random block address into dnode page.

Later, once that random block address was accessed by is_checkpointed_data, it can
cause NULL pointer dereference.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
fe16efe6a7 f2fs: fix to let caller retry allocating block address
Configure io_bits with 2 and enable LFS mode, generic/013 reports below dmesg:

BUG: unable to handle kernel NULL pointer dereference at 00000104
*pdpt = 0000000029b7b001 *pde = 0000000000000000
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: crc32_generic zram f2fs(O) rfcomm bnep bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev snd_seq_device aesni_intel snd_timer aes_i586 snd crypto_simd cryptd soundcore i2c_piix4 serio_raw mac_hid video parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000
CPU: 0 PID: 11161 Comm: fsstress Tainted: G           O      4.17.0-rc2 #38
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: f2fs_submit_page_write+0x28d/0x550 [f2fs]
EFLAGS: 00010206 CPU: 0
EAX: e863dcd8 EBX: 00000000 ECX: 00000100 EDX: 00000200
ESI: e863dcf4 EDI: f6f82768 EBP: e863dbb0 ESP: e863db74
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 00000104 CR3: 29a62020 CR4: 000406f0
Call Trace:
 do_write_page+0x6f/0xc0 [f2fs]
 write_data_page+0x4a/0xd0 [f2fs]
 do_write_data_page+0x327/0x630 [f2fs]
 __write_data_page+0x34b/0x820 [f2fs]
 __f2fs_write_data_pages+0x42d/0x8c0 [f2fs]
 f2fs_write_data_pages+0x27/0x30 [f2fs]
 do_writepages+0x1a/0x70
 __filemap_fdatawrite_range+0x94/0xd0
 filemap_write_and_wait_range+0x3d/0xa0
 __generic_file_write_iter+0x11a/0x1f0
 f2fs_file_write_iter+0xdd/0x3b0 [f2fs]
 __vfs_write+0xd2/0x150
 vfs_write+0x9b/0x190
 ksys_write+0x45/0x90
 sys_write+0x16/0x20
 do_fast_syscall_32+0xaa/0x22c
 entry_SYSENTER_32+0x4c/0x7b
EIP: 0xb7fc8c51
EFLAGS: 00000246 CPU: 0
EAX: ffffffda EBX: 00000003 ECX: 09cde000 EDX: 00001000
ESI: 00000003 EDI: 00001000 EBP: 00000000 ESP: bfbded38
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Code: e8 f9 77 34 c9 8b 45 e0 8b 80 b8 00 00 00 39 45 d8 0f 84 bb 02 00 00 8b 45 e0 8b 80 b8 00 00 00 8d 50 d8 8b 08 89 55 f0 8b 50 04 <89> 51 04 89 0a c7 00 00 01 00 00 c7 40 04 00 02 00 00 8b 45 dc
EIP: f2fs_submit_page_write+0x28d/0x550 [f2fs] SS:ESP: 0068:e863db74
CR2: 0000000000000104
---[ end trace 4cac79c0d1305ee6 ]---

allocate_data_block will submit all sequential pending IOs sorted by a
FIFO list, If we failed to submit other user's IO due to unaligned write,
we will retry to allocate new block address for current IO, then it will
initialize fio.list again, if fio was in the list before, it can break
FIFO list, result in above panic.

Thread A			Thread B
- do_write_page
 - allocate_data_block
  - list_add_tail
  : fioA cached in FIFO list.
				- do_write_page
				 - allocate_data_block
				  - list_add_tail
				  : fioB cached in FIFO list.
				 - f2fs_submit_page_write
				 : fail to submit IO
				 - allocate_data_block
				  - INIT_LIST_HEAD
 - f2fs_submit_page_write
  - list_del  <-- NULL pointer dereference

This patch adds fio.retry parameter to indicate failure status for each
IO, and avoid bailing out if there is still pending IO in FIFO list for
fixing.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Anatoly Pugachev
4071e67cff disable loading f2fs module on PAGE_SIZE > 4KB
The following patch disables loading of f2fs module on architectures
which have PAGE_SIZE > 4096 , since it is impossible to mount f2fs on
such architectures , log messages are:

mount: /mnt: wrong fs type, bad option, bad superblock on
/dev/vdiskb1, missing codepage or helper program, or other error.
/dev/vdiskb1: F2FS filesystem,
UUID=1d8b9ca4-2389-4910-af3b-10998969f09c, volume name ""

May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS
filesystem in 1th superblock
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS
filesystem in 2th superblock
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB

which was introduced by git commit 5c9b469295

tested on git kernel 4.17.0-rc6-00309-gec30dcf7f425

with patch applied:

modprobe: ERROR: could not insert 'f2fs': Invalid argument
May 28 01:40:28 v215 kernel: F2FS not supported on PAGE_SIZE(8192) != 4096

Signed-off-by: Anatoly Pugachev <matorola@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
14a28559f4 f2fs: fix error path of move_data_page
This patch fixes error path of move_data_page:
- clear cold data flag if it fails to write page.
- redirty page for non-ENOMEM case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
1174abfd83 f2fs: don't drop dentry pages after fs shutdown
As description in commit "f2fs: don't drop any page on f2fs_cp_error()
case":

"We still provide readdir() after shtudown, so we should keep pages to
avoid additional IOs."

In order to provider lastest directory structure, let's keep dentry
pages in cache after fs shutdown.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
250dbf5151 f2fs: fix to avoid race during access gc_thread pointer
Thread A			Thread B
- f2fs_remount
 - stop_gc_thread
				- f2fs_sbi_store
   sbi->gc_thread = NULL;
				  access sbi->gc_thread->gc_*

Previously, we allocate memory for sbi->gc_thread based on background
gc thread mount option, the memory can be released if we turn off
that mount option, but still there are several places access gc_thread
pointer without considering race condition, result in NULL point
dereference.

In order to fix this issue, use sb->s_umount to exclude those operations.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
aec2f729fc f2fs: clean up with clear_radix_tree_dirty_tag
Introduce clear_radix_tree_dirty_tag to include common codes for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
64c74a7ab5 f2fs: fix to don't trigger writeback during recovery
- f2fs_fill_super
 - recover_fsync_data
  - recover_data
   - del_fsync_inode
    - iput
     - iput_final
      - write_inode_now
       - f2fs_write_inode
        - f2fs_balance_fs
         - f2fs_balance_fs_bg
          - sync_dirty_inodes

With data_flush mount option, during recovery, in order to avoid entering
above writeback flow, let's detect recovery status and do skip in
f2fs_balance_fs_bg.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Sheng Yong
35a9a766a1 f2fs: clear discard_wake earlier
If SBI_NEED_FSCK is set, discard_wake will never be cleared. As a
result, the condition of wait_event_interruptible_timeout() is always
true, which gets discard thread run too frequently.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Yunlei He
f9d1dced75 f2fs: let discard thread wait a little longer if dev is busy
This patch modify discard thread wait policy as below:
	issued       io_interrupted     wait time(ms)
1.        8                 0               50
2.      (0,8)               1               50
3.        0                 1              500 (dev is busy)
4.        0                 0            60000 (no candidates)

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Chao Yu
2ef79ecb5e f2fs: avoid stucking GC due to atomic write
f2fs doesn't allow abuse on atomic write class interface, so except
limiting in-mem pages' total memory usage capacity, we need to limit
atomic-write usage as well when filesystem is seriously fragmented,
otherwise we may run into infinite loop during foreground GC because
target blocks in victim segment are belong to atomic opened file for
long time.

Now, we will detect failure due to atomic write in foreground GC, if
the count exceeds threshold, we will drop all atomic written data in
cache, by this, I expect it can keep our system running safely to
prevent Dos attack.

In addition, his patch adds to show GC skip information in debugfs,
now it just shows count of skipped caused by atomic write.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Jaegeuk Kim
5b0e95398e f2fs: introduce sbi->gc_mode to determine the policy
This is to avoid sbi->gc_thread pointer access.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Chao Yu
107a805de8 f2fs: keep migration IO order in LFS mode
For non-migration IO, we will keep order of data/node blocks' submitting
as allocation sequence by sorting IOs in per log io_list list, but for
migration IO, it could be out-of-order.

In LFS mode, we should keep all IOs including migration IO be ordered,
so that this patch fixes to add an additional lock to keep submitting
order.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Chao Yu
e5e5732d81 f2fs: fix to wait page writeback during revoking atomic write
After revoking atomic write, related LBA can be reused by others, so we
need to wait page writeback before reusing the LBA, in order to avoid
interference between old atomic written in-flight IO and new IO.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Sahitya Tummala
60b2b4ee2b f2fs: Fix deadlock in shutdown ioctl
f2fs_ioc_shutdown() ioctl gets stuck in the below path
when issued with F2FS_GOING_DOWN_FULLSYNC option.

__switch_to+0x90/0xc4
percpu_down_write+0x8c/0xc0
freeze_super+0xec/0x1e4
freeze_bdev+0xc4/0xcc
f2fs_ioctl+0xc0c/0x1ce0
f2fs_compat_ioctl+0x98/0x1f0

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Chao Yu
f8de433122 f2fs: detect synchronous writeback more earlier
This patch changes to detect synchronous writeback more earlier before,
in order to avoid unnecessary page writeback before exiting asynchronous
writeback.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
7b525dd013 f2fs: clean up with is_valid_blkaddr()
- rename is_valid_blkaddr() to is_valid_meta_blkaddr() for readability.
- introduce is_valid_blkaddr() for cleanup.

No logic change in this patch.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
5ad25442b6 f2fs: fix to initialize min_mtime with ULLONG_MAX
Since sit_i.min_mtime's type is unsigned long long, so we should
initialize it with max value of the type ULLONG_MAX instead of
LLONG_MAX.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
e7a4feb0ab f2fs: fix to let checkpoint guarantee atomic page persistence
1. thread A: commit_inmem_pages submit data into block layer, but
haven't waited it writeback.
2. thread A: commit_inmem_pages update related node.
3. thread B: do checkpoint, flush all nodes to disk.
4. SPOR

Then, atomic file becomes corrupted since nodes is flushed before data.

This patch fixes to treat atomic page as checkpoint guaranteed one,
then in checkpoint, we can make sure all atomic page can be writebacked
with metadata of atomic file.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
1c41e6808e f2fs: fix to initialize i_current_depth according to inode type
i_current_depth is used only for directory inode, but its space is
shared with i_gc_failures field used for regular inode, in order to
avoid affecting i_gc_failures' value, this patch fixes to initialize
the union's fields according to inode type.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
299254d85d Revert "f2fs: add ovp valid_blocks check for bg gc victim to fg_gc"
For extreme case:
10 section, op = 10%, no_fggc_threshold = 90%
All section usage: 85% 85% 85% 85% 90% 90% 95% 95% 95% 95%

During foreground GC, if we skip select dirty section whose usage
is larger than no_fggc_threshold, we can only recycle 80% invalid
space from four 85% usage sections and two 90% usage sections,
result in encountering out-of-space issue.

This reverts commit e93b986525 to
fix this issue, besides, we keep the logic that we scan all dirty
section when searching a victim, so that GC can select victim with
least valid blocks.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Jaegeuk Kim
868de6135f f2fs: don't drop any page on f2fs_cp_error() case
We still provide readdir() after shtudown, so we should keep pages to avoid
additional IOs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Colin Ian King
4580038e2d f2fs: fix spelling mistake: "extenstion" -> "extension"
Trivial fix to spelling mistake in extension list text

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Jaegeuk Kim
0cfe75c5b0 f2fs: enhance sanity_check_raw_super() to avoid potential overflows
In order to avoid the below overflow issue, we should have checked the
boundaries in superblock before reaching out to allocation. As Linus suggested,
the right place should be sanity_check_raw_super().

Dr Silvio Cesare of InfoSect reported:

There are integer overflows with using the cp_payload superblock field in the
f2fs filesystem potentially leading to memory corruption.

include/linux/f2fs_fs.h

struct f2fs_super_block {
...
        __le32 cp_payload;

fs/f2fs/f2fs.h

typedef u32 block_t;    /*
                         * should not change u32, since it is the on-disk block
                         * address format, __le32.
                         */
...

static inline block_t __cp_payload(struct f2fs_sb_info *sbi)
{
        return le32_to_cpu(F2FS_RAW_SUPER(sbi)->cp_payload);
}

fs/f2fs/checkpoint.c

        block_t start_blk, orphan_blocks, i, j;
...
        start_blk = __start_cp_addr(sbi) + 1 + __cp_payload(sbi);
        orphan_blocks = __start_sum_addr(sbi) - 1 - __cp_payload(sbi);

+++ integer overflows

...
        unsigned int cp_blks = 1 + __cp_payload(sbi);
...
        sbi->ckpt = kzalloc(cp_blks * blk_size, GFP_KERNEL);

+++ integer overflow leading to incorrect heap allocation.

        int cp_payload_blks = __cp_payload(sbi);
...
        ckpt->cp_pack_start_sum = cpu_to_le32(1 + cp_payload_blks +
                        orphan_blocks);

+++ sign bug and integer overflow

...
        for (i = 1; i < 1 + cp_payload_blks; i++)

+++ integer overflow

...

      sbi->max_orphans = (sbi->blocks_per_seg - F2FS_CP_PACKS -
                        NR_CURSEG_TYPE - __cp_payload(sbi)) *
                                F2FS_ORPHANS_PER_BLOCK;

+++ integer overflow

Reported-by: Greg KH <greg@kroah.com>
Reported-by: Silvio Cesare <silvio.cesare@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Chao Yu
b4c3ca8ba9 f2fs: treat volatile file's data as hot one
Volatile file's data will be updated oftenly, so it'd better to place
its data into hot data segment.

In addition, for atomic file, we change to check FI_ATOMIC_FILE instead
of FI_HOT_DATA to make code readability better.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Chao Yu
af8ff65bb8 f2fs: introduce release_discard_addr() for cleanup
Introduce release_discard_addr() to include common codes for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Fengguang Wu: declare static function, reported by kbuild test robot]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Chao Yu
a9af3fdcc4 f2fs: fix potential overflow
In build_sit_entries(), if valid_blocks in SIT block is smaller than
valid_blocks in journal, for below calculation:

sbi->discard_blks += old_valid_blocks - se->valid_blocks;

There will be two times potential overflow:
- old_valid_blocks - se->valid_blocks will overflow, and be a very
large number.
- sbi->discard_blks += result will overflow again, comes out a correct
result accidently.

Anyway, it should be fixed.

Fixes: d600af236d ("f2fs: avoid unneeded loop in build_sit_entries")
Fixes: 1f43e2ad7b ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Chao Yu
b2532c6940 f2fs: rename dio_rwsem to i_gc_rwsem
RW semphore dio_rwsem in struct f2fs_inode_info is introduced to avoid
race between dio and data gc, but now, it is more wildly used to avoid
foreground operation vs data gc. So rename it to i_gc_rwsem to improve
its readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Yunlei He
b82f6e347b f2fs: move mnt_want_write_file after range check
This patch move mnt_want_write_file after range check,
it's needless to check arguments with it.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Yunlei He
cba41be08c f2fs: fix missing clear FI_NO_PREALLOC in some error case
This patch fix missing clear FI_NO_PREALLOC in some error case

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
ade990f95e f2fs: enforce fsync_mode=strict for renamed directory
This is to give a option for user to be able to recover B/foo in the below
case.

mkdir A
sync()
rename(A, B)
creat (B/foo)
fsync (B/foo)
---crash---

Sugessted-by: Velayudhan Pillai <vijay@cs.utexas.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
8a29c1260e f2fs: sanity check for total valid node blocks
This patch enhances sanity check for SIT entries.

syzbot hit the following crash on upstream commit
83beed7b2b (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=bf9253040425feb155ad

syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5692130282438656
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5095924598571008
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+bf9253040425feb155ad@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): invalid crc value
F2FS-fs (loop0): Try to recover 1th superblock, ret: 0
F2FS-fs (loop0): Mounted with checkpoint version = d
F2FS-fs (loop0): Bitmap was wrongly cleared, blk:9740
------------[ cut here ]------------
kernel BUG at fs/f2fs/segment.c:1884!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4508 Comm: syz-executor0 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882
RSP: 0018:ffff8801af526708 EFLAGS: 00010282
RAX: ffffed0035ea4cc0 RBX: ffff8801ad454f90 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff82eeb87e RDI: ffffed0035ea4cb6
RBP: ffff8801af526760 R08: ffff8801ad4a2480 R09: ffffed003b5e4f90
R10: ffffed003b5e4f90 R11: ffff8801daf27c87 R12: ffff8801adb8d380
R13: 0000000000000001 R14: 0000000000000008 R15: 00000000ffffffff
FS:  00000000014af940(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f06bc223000 CR3: 00000001adb02000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 allocate_data_block+0x66f/0x2050 fs/f2fs/segment.c:2663
 do_write_page+0x105/0x1b0 fs/f2fs/segment.c:2727
 write_node_page+0x129/0x350 fs/f2fs/segment.c:2770
 __write_node_page+0x7da/0x1370 fs/f2fs/node.c:1398
 sync_node_pages+0x18cf/0x1eb0 fs/f2fs/node.c:1652
 block_operations+0x429/0xa60 fs/f2fs/checkpoint.c:1088
 write_checkpoint+0x3ba/0x5380 fs/f2fs/checkpoint.c:1405
 f2fs_sync_fs+0x2fb/0x6a0 fs/f2fs/super.c:1077
 __sync_filesystem fs/sync.c:39 [inline]
 sync_filesystem+0x265/0x310 fs/sync.c:67
 generic_shutdown_super+0xd7/0x520 fs/super.c:429
 kill_block_super+0xa4/0x100 fs/super.c:1191
 kill_f2fs_super+0x9f/0xd0 fs/f2fs/super.c:3030
 deactivate_locked_super+0x97/0x100 fs/super.c:316
 deactivate_super+0x188/0x1b0 fs/super.c:347
 cleanup_mnt+0xbf/0x160 fs/namespace.c:1174
 __cleanup_mnt+0x16/0x20 fs/namespace.c:1181
 task_work_run+0x1e4/0x290 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:191 [inline]
 exit_to_usermode_loop+0x2bd/0x310 arch/x86/entry/common.c:166
 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
 do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457d97
RSP: 002b:00007ffd46f9c8e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000457d97
RDX: 00000000014b09a3 RSI: 0000000000000002 RDI: 00007ffd46f9da50
RBP: 00007ffd46f9da50 R08: 0000000000000000 R09: 0000000000000009
R10: 0000000000000005 R11: 0000000000000246 R12: 00000000014b0940
R13: 0000000000000000 R14: 0000000000000002 R15: 000000000000658e
RIP: update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882 RSP: ffff8801af526708
---[ end trace f498328bb02610a2 ]---

Reported-and-tested-by: syzbot+bf9253040425feb155ad@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+7d6d31d3bc702f566ce3@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+0a725420475916460f12@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
b2ca374f33 f2fs: sanity check on sit entry
syzbot hit the following crash on upstream commit
87ef12027b (Wed Apr 18 19:48:17 2018 +0000)
Merge tag 'ceph-for-4.17-rc2' of git://github.com/ceph/ceph-client
syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=83699adeb2d13579c31e

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5805208181407744
syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=6005073343676416
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=6555047731134464
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+83699adeb2d13579c31e@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop0): invalid crc value
BUG: unable to handle kernel paging request at ffffed006b2a50c0
PGD 21ffee067 P4D 21ffee067 PUD 21fbeb067 PMD 0
Oops: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4514 Comm: syzkaller989480 Not tainted 4.17.0-rc1+ #8
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:build_sit_entries fs/f2fs/segment.c:3653 [inline]
RIP: 0010:build_segment_manager+0x7ef7/0xbf70 fs/f2fs/segment.c:3852
RSP: 0018:ffff8801b102e5b0 EFLAGS: 00010a06
RAX: 1ffff1006b2a50c0 RBX: 0000000000000004 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801ac74243e
RBP: ffff8801b102f410 R08: ffff8801acbd46c0 R09: fffffbfff14d9af8
R10: fffffbfff14d9af8 R11: ffff8801acbd46c0 R12: ffff8801ac742a80
R13: ffff8801d9519100 R14: dffffc0000000000 R15: ffff880359528600
FS:  0000000001e04880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffed006b2a50c0 CR3: 00000001ac6ac000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 f2fs_fill_super+0x4095/0x7bf0 fs/f2fs/super.c:2803
 mount_bdev+0x30c/0x3e0 fs/super.c:1165
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1268
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2517 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2847
 ksys_mount+0x12d/0x140 fs/namespace.c:3063
 __do_sys_mount fs/namespace.c:3077 [inline]
 __se_sys_mount fs/namespace.c:3074 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3074
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x443d6a
RSP: 002b:00007ffd312813c8 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000c00 RCX: 0000000000443d6a
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd312813d0
RBP: 0000000000000003 R08: 0000000020016a00 R09: 000000000000000a
R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000004
R13: 0000000000402c60 R14: 0000000000000000 R15: 0000000000000000
RIP: build_sit_entries fs/f2fs/segment.c:3653 [inline] RSP: ffff8801b102e5b0
RIP: build_segment_manager+0x7ef7/0xbf70 fs/f2fs/segment.c:3852 RSP: ffff8801b102e5b0
CR2: ffffed006b2a50c0
---[ end trace a2034989e196ff17 ]---

Reported-and-tested-by: syzbot+83699adeb2d13579c31e@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
5d64600d4f f2fs: avoid bug_on on corrupted inode
syzbot has tested the proposed patch but the reproducer still triggered crash:
kernel BUG at fs/f2fs/inode.c:LINE!

F2FS-fs (loop1): invalid crc value
F2FS-fs (loop5): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop5): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop5): invalid crc value
------------[ cut here ]------------
kernel BUG at fs/f2fs/inode.c:238!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4886 Comm: syz-executor1 Not tainted 4.17.0-rc1+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:do_read_inode fs/f2fs/inode.c:238 [inline]
RIP: 0010:f2fs_iget+0x3307/0x3ca0 fs/f2fs/inode.c:313
RSP: 0018:ffff8801c44a70e8 EFLAGS: 00010293
RAX: ffff8801ce208040 RBX: ffff8801b3621080 RCX: ffffffff82eace18
F2FS-fs (loop2): Magic Mismatch, valid(0xf2f52010) - read(0x0)
RDX: 0000000000000000 RSI: ffffffff82eaf047 RDI: 0000000000000007
RBP: ffff8801c44a7410 R08: ffff8801ce208040 R09: ffffed0039ee4176
R10: ffffed0039ee4176 R11: ffff8801cf720bb7 R12: ffff8801c0efa000
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f753aa9d700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
------------[ cut here ]------------
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel BUG at fs/f2fs/inode.c:238!
CR2: 0000000001b03018 CR3: 00000001c8b74000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 f2fs_fill_super+0x4377/0x7bf0 fs/f2fs/super.c:2842
 mount_bdev+0x30c/0x3e0 fs/super.c:1165
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1268
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2517 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2847
 ksys_mount+0x12d/0x140 fs/namespace.c:3063
 __do_sys_mount fs/namespace.c:3077 [inline]
 __se_sys_mount fs/namespace.c:3074 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3074
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457daa
RSP: 002b:00007f753aa9cba8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000000 RCX: 0000000000457daa
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f753aa9cbf0
RBP: 0000000000000064 R08: 0000000020016a00 R09: 0000000020000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000064 R14: 00000000006fcb80 R15: 0000000000000000
RIP: do_read_inode fs/f2fs/inode.c:238 [inline] RSP: ffff8801c44a70e8
RIP: f2fs_iget+0x3307/0x3ca0 fs/f2fs/inode.c:313 RSP: ffff8801c44a70e8
invalid opcode: 0000 [#2] SMP KASAN
---[ end trace 1cbcbec2156680bc ]---

Reported-and-tested-by: syzbot+41a1b341571f0952badb@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
a4f843bd00 f2fs: give message and set need_fsck given broken node id
syzbot hit the following crash on upstream commit
83beed7b2b (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=d154ec99402c6f628887

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5414336294027264
syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5471683234234368
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5436660795834368
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d154ec99402c6f628887@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop0): invalid crc value
------------[ cut here ]------------
kernel BUG at fs/f2fs/node.c:1185!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4549 Comm: syzkaller704305 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185
RSP: 0018:ffff8801d960e820 EFLAGS: 00010293
RAX: ffff8801d88205c0 RBX: 0000000000000003 RCX: ffffffff82f6cc06
RDX: 0000000000000000 RSI: ffffffff82f6d5e8 RDI: 0000000000000004
RBP: ffff8801d960ec30 R08: ffff8801d88205c0 R09: ffffed003b5e46c2
R10: 0000000000000003 R11: 0000000000000003 R12: ffff8801a86e00c0
R13: 0000000000000001 R14: ffff8801a86e0530 R15: ffff8801d9745240
FS:  000000000072c880(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3d403209b8 CR3: 00000001d8f3f000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 get_node_page fs/f2fs/node.c:1237 [inline]
 truncate_xattr_node+0x152/0x2e0 fs/f2fs/node.c:1014
 remove_inode_page+0x200/0xaf0 fs/f2fs/node.c:1039
 f2fs_evict_inode+0xe86/0x1710 fs/f2fs/inode.c:547
 evict+0x4a6/0x960 fs/inode.c:557
 iput_final fs/inode.c:1519 [inline]
 iput+0x62d/0xa80 fs/inode.c:1545
 f2fs_fill_super+0x5f4e/0x7bf0 fs/f2fs/super.c:2849
 mount_bdev+0x30c/0x3e0 fs/super.c:1164
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1267
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2518 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2848
 ksys_mount+0x12d/0x140 fs/namespace.c:3064
 __do_sys_mount fs/namespace.c:3078 [inline]
 __se_sys_mount fs/namespace.c:3075 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x443dea
RSP: 002b:00007ffcc7882368 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000c00 RCX: 0000000000443dea
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffcc7882370
RBP: 0000000000000003 R08: 0000000020016a00 R09: 000000000000000a
R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000004
R13: 0000000000402ce0 R14: 0000000000000000 R15: 0000000000000000
RIP: __get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185 RSP: ffff8801d960e820
---[ end trace 4edbeb71f002bb76 ]---

Reported-and-tested-by: syzbot+d154ec99402c6f628887@syzkaller.appspotmail.com
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Chao Yu
cf52b27a39 f2fs: clean up commit_inmem_pages()
This patch moves error handling from commit_inmem_pages() into
__commit_inmem_page() for cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Sheng Yong
eff15c2aec f2fs: do not check F2FS_INLINE_DOTS in recover
Only dir may have F2FS_INLINE_DOTS flag, so there is no need to check
the flag in recover flow.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Sheng Yong
a515d12f1b f2fs: remove duplicated dquot_initialize and fix error handling
This patch removes duplicated dquot_initialize in recover_orphan_inode(),
and fix the error handling if dquot_initialize fails.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Chao Yu
c22aecd759 f2fs: fix to detect failure of dquot_initialize
dquot_initialize() can fail due to any exception inside quota subsystem,
f2fs needs to be aware of it, and return correct return value to caller.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Yunlei He
d618477473 f2fs: stop issue discard if something wrong with f2fs
v4->v5: move data corruption check to __submit_discard_cmd, in order to
control discard io submitted more accurately, besides, increase async
thread wait time if data corruption detected.

This patch stop async thread and umount process to issue discard
if something wrong with f2fs, which is similar to fstrim.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Chao Yu
b169c3c560 f2fs: fix return value in f2fs_ioc_commit_atomic_write
In f2fs_ioc_commit_atomic_write, if file is volatile, return -EINVAL to
indicate that commit failure.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Yunlei He
054afda999 f2fs: allocate hot_data for atomic write more strictly
If a file not set type as hot, has dirty pages more than
threshold 64 before starting atomic write, may be lose hot
flag.

v1->v2: move set FI_ATOMIC_FILE flag behind flush dirty pages too,
in case of dirty pages before starting atomic use atomic mode to
write back.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Sheng Yong
d0891e84e1 f2fs: check if inmem_pages list is empty correctly
`cur' will never be NULL, we should check inmem_pages list instead.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Chao Yu
27319ba404 f2fs: fix race in between GC and atomic open
Thread					GC thread
- f2fs_ioc_start_atomic_write
 - get_dirty_pages
 - filemap_write_and_wait_range
					- f2fs_gc
					 - do_garbage_collect
					  - gc_data_segment
					   - move_data_page
					    - f2fs_is_atomic_file
					    - set_page_dirty
 - set_inode_flag(, FI_ATOMIC_FILE)

Dirty data page can still be generated by GC in race condition as
above call stack.

This patch adds fi->dio_rwsem[WRITE] in f2fs_ioc_start_atomic_write
to avoid such race.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Souptick Joarder
ea4d479bb3 fs: f2fs: Adding new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite
and fault handler.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Zhikang Zhang
d6964949e4 f2fs: change le32 to le16 of f2fs_inode->i_extra_size
In the structure of f2fs_inode, i_extra_size's type is __le16,
so we should keep type consistent when using it.

Fixes: 704956ecf5 ("f2fs: support inode checksum")
Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Zhikang Zhang
56b07e7e65 f2fs: check cur_valid_map_mir & raw_sit block count when flush sit entries
We should check valid_map_mir and block count to ensure
the flushed raw_sit is correct.

Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Chao Yu
3d165dc3ae f2fs: correct return value of f2fs_trim_fs
Correct return value in two cases:
- return EINVAL if end boundary is out-of-range.
- return EIO if fs needs off-line check.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Chao Yu
764811581d f2fs: fix to show missing bits in FS_IOC_GETFLAGS
This patch fixes to show missing encrypt/inline_data flag in
FS_IOC_GETFLAGS like ext4 does.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Chao Yu
c807a7cb54 f2fs: remove unneeded F2FS_PROJINHERIT_FL
Now F2FS_FL_USER_VISIBLE and F2FS_FL_USER_MODIFIABLE has included
F2FS_PROJINHERIT_FL, so remove unneeded F2FS_PROJINHERIT_FL when
using visible/modifiable flag macro.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Chao Yu
81114baa83 f2fs: don't use GFP_ZERO for page caches
Related to https://lkml.org/lkml/2018/4/8/661

Sometimes, we need to write meta data to new allocated block address,
then we will allocate a zeroed page in inner inode's address space, and
fill partial data in it, and leave other place with zero value which means
some fields are initial status.

There are two inner inodes (meta inode and node inode) setting __GFP_ZERO,
I have just checked them, for both of them, we can avoid using __GFP_ZERO,
and do initialization by ourselves to avoid unneeded/redundant zeroing
from mm.

Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Yunlei He
241b493d8f f2fs: issue all big range discards in umount process
This patch modify max_requests to UINT_MAX, to issue
all big range discards in umount.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Chao Yu
47cca3d62a f2fs: remove redundant block plug
For buffered IO, we don't need to use block plug to cache bio,
for direct IO, generic f2fs_direct_IO has already added block
plug, so let's remove redundant one in .write_iter.

As Yunlei described in his patch:

-f2fs_file_write_iter
  -blk_start_plug
    -__generic_file_write_iter
	...
	  -do_blockdev_direct_IO
	    -blk_start_plug
		...
	    -blk_finish_plug
	...
  -blk_finish_plug

which may conduct performance decrease in our platform

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Yunlong Song
8045249829 f2fs: remove unmatched zero_user_segment when convert inline dentry
Since the layout of regular dentry block is different from inline dentry
block, zero_user_segment starting from MAX_INLINE_DATA(dir) is not
correct for regular dentry block, besides, bitmap is already copied and
used, so there is no necessary to zero page at all, so just remove the
zero_user_segment is OK.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:44 -07:00
Chao Yu
59c844088b f2fs: introduce private inode status mapping
Previously, we use generic FS_*_FL defined by vfs to indicate inode status
for each bit of i_flags, so f2fs's flag status definition is tied to vfs'
one, it will be hard for f2fs to reuse bits f2fs never used to indicate
new status..

In order to solve this issue, we introduce private inode status mapping,
Note, for these bits have already been persisted into disk, we should
never change their definition, for other ones, we can remap them for
later new coming status.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:44 -07:00
Jaegeuk Kim
e555da9f31 f2fs: run fstrim asynchronously if runtime discard is on
We don't need to wait for whole bunch of discard candidates in fstrim, since
runtime discard will issue them in idle time.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 10:20:48 -07:00
Chao Yu
cba608493d f2fs: turn down IO priority of discard from background
In order to avoid interfering normal r/w IO, let's turn down IO
priority of discard issued from background.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 08:58:59 -07:00
Chao Yu
377224c471 f2fs: don't split checkpoint in fstrim
Now, we issue discard asynchronously in separated thread instead of in
checkpoint, after that, we won't encounter long latency in checkpoint
due to huge number of synchronous discard command handling, so, we don't
need to split checkpoint to do trim in batch, merge it and obsolete
related sysfs entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 08:58:59 -07:00
Jaegeuk Kim
8bb4f2535c f2fs: issue discard commands proactively in high fs utilization
In the high utilization like over 80%, we don't expect huge # of large discard
commands, but do many small pending discards which affects FTL GCs a lot.
Let's issue them in that case.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 08:58:59 -07:00
Jaegeuk Kim
d6290814b0 f2fs: add fsync_mode=nobarrier for non-atomic files
For non-atomic files, this patch adds an option to give nobarrier which
doesn't issue flush commands to the device.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-29 12:04:08 -07:00
Jaegeuk Kim
9a997188ff f2fs: let fstrim issue discard commands in lower priority
The fstrim gathers huge number of large discard commands, and tries to issue
without IO awareness, which results in long user-perceive IO latencies on
READ, WRITE, and FLUSH in UFS. We've observed some of commands take several
seconds due to long discard latency.

This patch limits the maximum size to 2MB per candidate, and check IO congestion
when issuing them to disk.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-29 12:04:08 -07:00
Eric Biggers
e12ee6836a fscrypt: make fscrypt_operations.max_namelen an integer
Now ->max_namelen() is only called to limit the filename length when
adding NUL padding, and only for real filenames -- not symlink targets.
It also didn't give the correct length for symlink targets anyway since
it forgot to subtract 'sizeof(struct fscrypt_symlink_data)'.

Thus, change ->max_namelen from a function to a simple 'unsigned int'
that gives the filesystem's maximum filename length.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-05-20 16:21:03 -04:00
Christoph Hellwig
3f3942aca6 proc: introduce proc_create_single{,_data}
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Al Viro
1e2e547a93 do d_instantiate/unlock_new_inode combinations safely
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
	lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex.  Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

	Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode().  All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org	# 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-11 15:36:37 -04:00
Jaegeuk Kim
5b19d284f5 f2fs: avoid fsync() failure caused by EAGAIN in writepage()
pageout() in MM traslates EAGAIN, so calls handle_write_error()
 -> mapping_set_error() -> set_bit(AS_EIO, ...).
 file_write_and_wait_range() will see EIO error, which is critical
 to return value of fsync() followed by atomic_write failure to user.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-04 10:51:22 -07:00
Jaegeuk Kim
17c500350b f2fs: clear PageError on writepage
This patch clears PageError in some pages tagged by read path, but when we
write the pages with valid contents, writepage should clear the bit likewise
ext4.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-02 14:30:58 -07:00
Jaegeuk Kim
a90a0884ac f2fs: check cap_resource only for data blocks
This patch changes the rule to check cap_resource for data blocks, not inode
or node blocks in order to avoid selinux denial.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-02 14:30:58 -07:00
Jaegeuk Kim
b87078ad3a Revert "f2fs: introduce f2fs_set_page_dirty_nobuffer"
This patch reverts copied f2fs_set_page_dirty_nobuffer to use generic function
for stability.

This reverts commit fe76b796fc.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-02 14:30:57 -07:00
Eric Biggers
ab3835aae6 f2fs: call unlock_new_inode() before d_instantiate()
xfstest generic/429 sometimes hangs on f2fs, caused by a thread being
unable to take a directory's i_rwsem for write in vfs_rmdir().  In the
test, one thread repeatedly creates and removes a directory, and other
threads repeatedly look up a file in the directory.  The bug is that
f2fs_mkdir() calls d_instantiate() before unlock_new_inode(), resulting
in the directory inode being exposed to lookups before it has been fully
initialized.  And with CONFIG_DEBUG_LOCK_ALLOC, unlock_new_inode()
reinitializes ->i_rwsem, corrupting its state when it is already held.

Fix it by calling unlock_new_inode() before d_instantiate().  This
matches what other filesystems do.

Fixes: 57397d86c6 ("f2fs: add inode operations for special inodes")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-02 14:30:57 -07:00
Eric Biggers
6dbb17961f f2fs: refactor read path to allow multiple postprocessing steps
Currently f2fs's ->readpage() and ->readpages() assume that either the
data undergoes no postprocessing, or decryption only.  But with
fs-verity, there will be an additional authenticity verification step,
and it may be needed either by itself, or combined with decryption.

To support this, store a 'struct bio_post_read_ctx' in ->bi_private
which contains a work struct, a bitmask of postprocessing steps that are
enabled, and an indicator of the current step.  The bio completion
routine, if there was no I/O error, enqueues the first postprocessing
step.  When that completes, it continues to the next step.  Pages that
fail any postprocessing step have PageError set.  Once all steps have
completed, pages without PageError set are set Uptodate, and all pages
are unlocked.

Also replace f2fs_encrypted_file() with a new function
f2fs_post_read_required() in places like direct I/O and garbage
collection that really should be testing whether the file needs special
I/O processing, not whether it is encrypted specifically.

This may also be useful for other future f2fs features such as
compression.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-02 14:30:57 -07:00
Eric Biggers
0cb8dae4a0 fscrypt: allow synchronous bio decryption
Currently, fscrypt provides fscrypt_decrypt_bio_pages() which decrypts a
bio's pages asynchronously, then unlocks them afterwards.  But, this
assumes that decryption is the last "postprocessing step" for the bio,
so it's incompatible with additional postprocessing steps such as
authenticity verification after decryption.

Therefore, rename the existing fscrypt_decrypt_bio_pages() to
fscrypt_enqueue_decrypt_bio().  Then, add fscrypt_decrypt_bio() which
decrypts the pages in the bio synchronously without unlocking the pages,
nor setting them Uptodate; and add fscrypt_enqueue_decrypt_work(), which
enqueues work on the fscrypt_read_workqueue.  The new functions will be
used by filesystems that support both fscrypt and fs-verity.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-02 14:30:57 -07:00
Matthew Wilcox
b93b016313 page cache: use xa_lock
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Matthew Wilcox
f6bb2a2c0b xarray: add the xa_lock to the radix_tree_root
This results in no change in structure size on 64-bit machines as it
fits in the padding between the gfp_t and the void *.  32-bit machines
will grow the structure from 8 to 12 bytes.  Almost all radix trees are
protected with (at least) a spinlock, so as they are converted from
radix trees to xarrays, the data structures will shrink again.

Initialising the spinlock requires a name for the benefit of lockdep, so
RADIX_TREE_INIT() now needs to know the name of the radix tree it's
initialising, and so do IDR_INIT() and IDA_INIT().

Also add the xa_lock() and xa_unlock() family of wrappers to make it
easier to use the lock.  If we could rely on -fplan9-extensions in the
compiler, we could avoid all of this syntactic sugar, but that wasn't
added until gcc 4.6.

Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Jaegeuk Kim
214c2461a8 f2fs: remain written times to update inode during fsync
This fixes xfstests/generic/392.

The failure was caused by different times between 1) one marked in the last
fsync(2) call and 2) the other given by roll-forward recovery after power-cut.
The reason was that we skipped updating inode block at 1), since its i_size
was recoverable along with 4KB-aligned data writes, which was fixed by:
  "f2fs: fix a wrong condition in f2fs_skip_inode_update"

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-03 18:52:47 -07:00
Yunlong Song
c79d152094 f2fs: make assignment of t->dentry_bitmap more readable
In make_dentry_ptr_block, it is confused with "&" for t->dentry_bitmap
but without "&" for t->dentry, so delete "&" to make code more readable.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-02 20:09:35 -07:00
Jaegeuk Kim
dc7a10ddee f2fs: truncate preallocated blocks in error case
If write is failed, we must deallocate the blocks that we couldn't write.

Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-02 20:09:34 -07:00
Junling Zheng
235831d7dd f2fs: fix a wrong condition in f2fs_skip_inode_update
Fix commit 97dd26ad83 (f2fs: fix wrong AUTO_RECOVER condition).
We should use ~PAGE_MASK to determine whether i_size is aligned to
the f2fs's block size or not.

Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-02 13:21:51 -07:00
Eric Biggers
53fedcc09c f2fs: reserve bits for fs-verity
Reserve an F2FS feature flag and inode flag for fs-verity.  This is an
in-development feature that is planned be discussed at LSF/MM 2018 [1].
It will provide file-based integrity and authenticity for read-only
files.  Most code will be in a filesystem-independent module, with
smaller changes needed to individual filesystems that opt-in to
supporting the feature.  An early prototype supporting F2FS is available
[2].  Reserving the F2FS on-disk bits for fs-verity will prevent users
of the prototype from conflicting with other new F2FS features.

Note that we're reserving the inode flag in f2fs_inode.i_advise, which
isn't really appropriate since it's not a hint or advice.  But
->i_advise is already being used to hold the 'encrypt' flag; and F2FS's
->i_flags uses the generic FS_* values, so it seems ->i_flags can't be
used for an F2FS-specific flag without additional work to remove the
assumption that ->i_flags uses the generic flags namespace.

[1] https://marc.info/?l=linux-fsdevel&m=151690752225644
[2] https://git.kernel.org/pub/scm/linux/kernel/git/mhalcrow/linux.git/log/?h=fs-verity-dev

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-28 12:56:49 -07:00
Yunlei He
d21b0f238a f2fs: Add a segment type check in inplace write
This patch add a segment type check in IPU, in
case of something wrong with blkadd in dnode.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-27 20:13:54 -07:00
Yunlong Song
2f52052929 f2fs: no need to initialize zero value for GFP_F2FS_ZERO
Since f2fs_inode_info is allocated with flag GFP_F2FS_ZERO, so we do not
need to initialize zero value for its member any more.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-27 20:12:53 -07:00
Chao Yu
780de47cf6 f2fs: don't track new nat entry in nat set
Nat entry set is used only in checkpoint(), and during checkpoint() we
won't flush new nat entry with unallocated address, so we don't need to
add new nat entry into nat set, then nat_entry_set::entry_cnt can
indicate actual entry count we need to flush in checkpoint().

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-27 20:10:29 -07:00
Chao Yu
df033caf51 f2fs: clean up with F2FS_BLK_ALIGN
Clean up F2FS_BYTES_TO_BLK(x + F2FS_BLKSIZE - 1) with F2FS_BLK_ALIGN(x).

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-27 20:09:32 -07:00
Yunlei He
0833721ec3 f2fs: check blkaddr more accuratly before issue a bio
This patch check blkaddr more accuratly before issue a
write or read bio.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-18 23:33:50 -07:00
Ritesh Harjani
02117b8ae9 f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read
Quota code itself is serializing the operations by taking mutex_lock.
It seems a below deadlock can happen if GF_NOFS is not used in
f2fs_quota_read

__switch_to+0x88
__schedule+0x5b0
schedule+0x78
schedule_preempt_disabled+0x20
__mutex_lock_slowpath+0xdc   		//mutex owner is itself
mutex_lock+0x2c
dquot_commit+0x30			//mutex_lock(&dqopt->dqio_mutex);
dqput+0xe0
__dquot_drop+0x80
dquot_drop+0x48
f2fs_evict_inode+0x218
evict+0xa8
dispose_list+0x3c
prune_icache_sb+0x58
super_cache_scan+0xf4
do_shrink_slab+0x208
shrink_slab.part.40+0xac
shrink_zone+0x1b0
do_try_to_free_pages+0x25c
try_to_free_pages+0x164
__alloc_pages_nodemask+0x534
do_read_cache_page+0x6c
read_cache_page+0x14
f2fs_quota_read+0xa4
read_blk+0x54
find_tree_dqentry+0xe4
find_tree_dqentry+0xb8
find_tree_dqentry+0xb8
find_tree_dqentry+0xb8
qtree_read_dquot+0x68
v2_read_dquot+0x24
dquot_acquire+0x5c			// mutex_lock(&dqopt->dqio_mutex);
dqget+0x238
__dquot_initialize+0xd4
dquot_initialize+0x10
dquot_file_open+0x34
f2fs_file_open+0x6c
do_dentry_open+0x1e4
vfs_open+0x6c
path_openat+0xa20
do_filp_open+0x4c
do_sys_open+0x178

Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-18 23:31:28 -07:00
Sheng Yong
ff62af200b f2fs: introduce a new mount option test_dummy_encryption
This patch introduces a new mount option `test_dummy_encryption'
to allow fscrypt to create a fake fscrypt context. This is used
by xfstests.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 14:02:57 +09:00
Sheng Yong
b7c409deda f2fs: introduce F2FS_FEATURE_LOST_FOUND feature
This patch introduces a new feature, F2FS_FEATURE_LOST_FOUND, which
is set by mkfs. mkfs creates a directory named lost+found, which saves
unreachable files. If fsck finds a file which has no parent, or its
parent is removed by fsck, the file will be placed under lost+found
directory by fsck.

lost+found directory could not be encrypted. As a result, the root
directory cannot be encrypted too. So if LOST_FOUND feature is enabled,
let's avoid to encrypt root directory.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 14:02:47 +09:00
Qiuyang Sun
da5ce874f8 f2fs: release locks before return in f2fs_ioc_gc_range()
Currently, we will leave the kernel with locks still held when the gc_range
is invalid. This patch fixes the bug.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:58:16 +09:00
Jaegeuk Kim
bb1105e479 f2fs: align memory boundary for bitops
For example, in arm64, free_nid_bitmap should be aligned to word size in order
to use bit operations.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:39 +09:00
Chao Yu
c56675750d f2fs: remove unneeded set_cold_node()
When setting COLD_BIT_SHIFT flag in node block, we only need to call
set_cold_node() in new_node_page() and recover_inode_page() during
node page initialization. So remove unneeded set_cold_node() in other
places.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:33 +09:00
Hyunchul Lee
b91050a80c f2fs: add nowait aio support
This patch adds nowait aio support[1].

Return EAGAIN if any of the following checks fail for direct I/O:
  - i_rwsem is not lockable
  - Blocks are not allocated at the write location

And xfstests generic/471 is passed.

 [1]: 6be96d "Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT"

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:32 +09:00
Chao Yu
63189b7859 f2fs: wrap all options with f2fs_sb_info.mount_opt
This patch merges miscellaneous mount options into struct f2fs_mount_info,
After this patch, once we add new mount option, we don't need to worry
about recovery of it in remount_fs(), since we will recover the
f2fs_sb_info.mount_opt including all options.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:30 +09:00
Yunlei He
5d7881cadf f2fs: Don't overwrite all types of node to keep node chain
Currently, we enable node SSR by default, and mixed
different types of node segment to do SSR more intensively.
Although reuse warm node is not allowed, warm node chain
will be destroyed by errors introduced by other types
node chain. So we'd better forbid reusing all types
of node to keep warm node chain.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:29 +09:00
Junling Zheng
93cf93f17c f2fs: introduce mount option for fsync mode
Commit "0a007b97aad6"(f2fs: recover directory operations by fsync)
fixed xfstest generic/342 case, but it also increased the written
data and caused the performance degradation. In most cases, there's
no need to do so heavy fsync actually.

So we introduce new mount option "fsync_mode={posix,strict}" to
control the policy of fsync. "fsync_mode=posix" is set by default,
and means that f2fs uses a light fsync, which follows POSIX semantics.
And "fsync_mode=strict" means that it's a heavy fsync, which behaves
in line with xfs, ext4 and btrfs, where generic/342 will pass, but
the performance will regress.

Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:28 +09:00
Chao Yu
eea52882d8 f2fs: fix to restore old mount option in ->remount_fs
This patch fixes to restore old mount option once we encounter failure
in ->remount_fs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:27 +09:00
Chao Yu
deeedd71e3 f2fs: wrap sb_rdonly with f2fs_readonly
Use f2fs_readonly to wrap sb_rdonly for cleanup, and spread it in
all places.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:26 +09:00
Jaegeuk Kim
162b27aec9 f2fs: avoid selinux denial on CAP_SYS_RESOURCE
This fixes CAP_SYS_RESOURCE denial of selinux when using resgid, since it
seems selinux reports it at the first place, but mostly we don't need to
check this condition first.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:55:43 +09:00
Chao Yu
b6a06cbbb5 f2fs: support hot file extension
This patch supports to recognize hot file extension in f2fs, so that we
can allocate proper hot segment location for its data, which can lead to
better hot/cold seperation in filesystem.

In addition, we changes a bit on query/add/del operation method for
extension_list sysfs entry as below:

- Query: cat /sys/fs/f2fs/<disk>/extension_list
- Add: echo 'extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '!extension' > /sys/fs/f2fs/<disk>/extension_list
- Add: echo '[h/c]extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '[h/c]!extension' > /sys/fs/f2fs/<disk>/extension_list
- [h] means add/del hot file extension
- [c] means add/del cold file extension

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:57 +09:00
Chao Yu
1dc0f8991d f2fs: fix to avoid race in between atomic write and background GC
Sqlite user			Background GC
				- move_data_block
				  : move page #1
				 - f2fs_is_atomic_file
- f2fs_ioc_start_atomic_write
- f2fs_ioc_commit_atomic_write
 - commit_inmem_pages
   : commit page #1 & set node #2 dirty
				 - f2fs_submit_page_write
				  - f2fs_update_data_blkaddr
				   - set_page_dirty
				     : set node #2 dirty
 - f2fs_do_sync_file
  - fsync_node_pages
   : commit node #1 & node #2, then sudden power-cut

In a race case, we may check FI_ATOMIC_FILE flag before starting atomic
write flow, then we will commit meta data before data with reversed
order, after a sudden pow-cut, database transaction will be inconsistent.

So we'd better to exclude gc/atomic_write to each other by using lock
instead of flag checking.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:55 +09:00
Jaegeuk Kim
b27bc8091c f2fs: do gc in greedy mode for whole range if gc_urgent mode is set
Otherwise, f2fs conducts GC on 8GB range only based on slow cost-benefit.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:54 +09:00
Jaegeuk Kim
dee02f0d62 f2fs: issue discard aggressively in the gc_urgent mode
This patch avoids to skip discard commands when user sets gc_urgent mode.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:53 +09:00
Jaegeuk Kim
782911f491 f2fs: set readdir_ra by default
It gives general readdir improvement.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:52 +09:00
Jaegeuk Kim
84b89e5d94 f2fs: add auto tuning for small devices
If f2fs is running on top of very small devices, it's worth to avoid abusing
free LBAs. In order to achieve that, this patch introduces some parameter
tuning.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:51 +09:00
Jaegeuk Kim
079396270b f2fs: add mount option for segment allocation policy
This patch adds an mount option, "alloc_mode=%s" having two options, "default"
and "reuse".

In "alloc_mode=reuse" case, f2fs starts to allocate segments from 0'th segment
all the time to reassign segments. It'd be useful for small-sized eMMC parts.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:50 +09:00
Jaegeuk Kim
69babac019 f2fs: don't stop GC if GC is contended
Let's do GC as much as possible, while gc_urgent is set.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:49 +09:00
Chao Yu
846ae671ad f2fs: expose extension_list sysfs entry
This patch adds a sysfs entry 'extension_list' to support
query/add/del item in extension list.

Query:
cat /sys/fs/f2fs/<device>/extension_list

Add:
echo 'extension' > /sys/fs/f2fs/<device>/extension_list

Del:
echo '!extension' > /sys/fs/f2fs/<device>/extension_list

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:48 +09:00
Chao Yu
17cd07ae95 f2fs: fix to set KEEP_SIZE bit in f2fs_zero_range
As Jayashree Mohan reported:

A simple workload to reproduce this would be :
1. create foo
2. Write (8K - 16K)  // foo size = 16K now
3. fsync()
4. falloc zero_range , keep_size (4202496 - 4210688) // foo size must be 16K
5. fdatasync()
Crash now

On recovery, we see that the file size is 4210688 and not 16K, which
violates the semantics of keep_size flag. We have a test case to
reproduce this using CrashMonkey on 4.15 kernel. Try this out by
simply running :
 ./c_harness -f /dev/sda -d /dev/cow_ram0 -t f2fs -e 102400  -P -v
 tests/generic_468_zero.so

The root cause is that we miss to set KEEP_SIZE bit correctly in zero_range
when zeroing block cross EOF with FALLOC_FL_KEEP_SIZE, let's fix this
missing case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:47 +09:00
Chao Yu
d0d3f1b329 f2fs: introduce sb_lock to make encrypt pwsalt update exclusive
f2fs_super_block.encrypt_pw_salt can be udpated and persisted
concurrently, result in getting different pwsalt in separated
threads, so let's introduce sb_lock to exclude concurrent
accessers.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:46 +09:00
Colin Ian King
8fe326cb99 f2fs: remove redundant initialization of pointer 'p'
Pointer p is initialized with a value that is never read and is later
re-assigned a new value, hence the initialization is redundant and can
be removed.

Cleans up clang warning:
fs/f2fs/extent_cache.c:463:19: warning: Value stored to 'p' during
its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:45 +09:00
Gao Xiang
46706d5917 f2fs: flush cp pack except cp pack 2 page at first
Previously, we attempt to flush the whole cp pack in a single bio,
however, when suddenly powering off at this time, we could get into
an extreme scenario that cp pack 1 page and cp pack 2 page are updated
and latest, but payload or current summaries are still partially
outdated. (see reliable write in the UFS specification)

This patch submits the whole cp pack except cp pack 2 page at first,
and then writes the cp pack 2 page with an extra independent
bio with pre-io barrier.

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:43 +09:00
Sheng Yong
ccd31cb28f f2fs: clean up f2fs_sb_has_xxx functions
This patch introduces F2FS_FEATURE_FUNCS to clean up the definitions of
different f2fs_sb_has_xxx functions.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:42 +09:00
Tiezhu Yang
3bb09a0e7e f2fs: remove redundant check of page type when submit bio
This patch removes redundant check of page type when submit bio to
make the logic more clear.

Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:41 +09:00
Chao Yu
fb0e72c8b9 f2fs: fix to handle looped node chain during recovery
There is no checksum in node block now, so bit-transition from hardware
can make node_footer.next_blkaddr being corrupted w/o any detection,
result in node chain becoming looped one.

For this condition, during recovery, in order to avoid running into dead
loop, let's detect it and just skip out.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:40 +09:00
Jaegeuk Kim
0f9ec2a8f6 f2fs: handle quota for orphan inodes
This is to detect dquot_initialize errors early from evict_inode
for orphan inodes.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:38 +09:00
Hyunchul Lee
f2e703f9a3 f2fs: support passing down write hints to block layer with F2FS policy
Add 'whint_mode=fs-based' mount option. In this mode, F2FS passes
down write hints with its policy.

* whint_mode=fs-based. F2FS passes down hints with its policy.

User                  F2FS                     Block
----                  ----                     -----
                      META                     WRITE_LIFE_MEDIUM;
                      HOT_NODE                 WRITE_LIFE_NOT_SET
                      WARM_NODE                "
                      COLD_NODE                WRITE_LIFE_NONE
ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
extension list        "                        "

-- buffered io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_LONG
WRITE_LIFE_NONE       "                        "
WRITE_LIFE_MEDIUM     "                        "
WRITE_LIFE_LONG       "                        "

-- direct io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG

Many thanks to Chao Yu and Jaegeuk Kim for comments to
implement this patch.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:36 +09:00
Hyunchul Lee
0cdd319539 f2fs: support passing down write hints given by users to block layer
Add the 'whint_mode' mount option that controls which write
hints are passed down to block layer. There are "off" and
"user-based" mode. The default mode is "off".

1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.

2) whint_mode=user-based. F2FS tries to pass down hints given
by users.

User                  F2FS                     Block
----                  ----                     -----
                      META                     WRITE_LIFE_NOT_SET
                      HOT_NODE                 "
                      WARM_NODE                "
                      COLD_NODE                "
ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
extension list        "                        "

-- buffered io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        "
WRITE_LIFE_MEDIUM     "                        "
WRITE_LIFE_LONG       "                        "

-- direct io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG

Many thanks to Chao Yu and Jaegeuk Kim for comments to
implement this patch.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid build warning]
[Chao Yu: fix to restore whint_mode in ->remount_fs]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:35 +09:00
Chao Yu
cd36d7a17f f2fs: fix to clear CP_TRIMMED_FLAG
Once CP_TRIMMED_FLAG is set, after a reboot, we will never issue discard
before LBA becomes invalid again, fix it by clearing the flag in
checkpoint without CP_TRIMMED reason.

Fixes: 1f43e2ad7b ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:34 +09:00
Chao Yu
199bc3fef2 f2fs: support large nat bitmap
Previously, we will store all nat version bitmap in checkpoint pack block,
so our total node entry number has a limitation which caused total node
number can not exceed (3900 * 8) block * 455 node/block = 14196000. So
that once user wants to create more nodes in large size image, it becomes
a bottleneck, that's unreasonable.

This patch detects the new layout of nat/sit version bitmap in image in
order to enable supporting large nat bitmap.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:33 +09:00
Chao Yu
bf617f7a92 f2fs: fix to check extent cache in f2fs_drop_extent_tree
If noextent_cache mount option is on, we will never initialize extent tree
in inode, but still we're going to access it in f2fs_drop_extent_tree,
result in kernel panic as below:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
 IP: _raw_write_lock+0xc/0x30
 Call Trace:
  ? f2fs_drop_extent_tree+0x41/0x70 [f2fs]
  f2fs_fallocate+0x5a0/0xdd0 [f2fs]
  ? common_file_perm+0x47/0xc0
  ? apparmor_file_permission+0x1a/0x20
  vfs_fallocate+0x15b/0x290
  SyS_fallocate+0x44/0x70
  do_syscall_64+0x6e/0x160
  entry_SYSCALL64_slow_path+0x25/0x25

This patch fixes to check extent cache status before using in
f2fs_drop_extent_tree.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:32 +09:00
Chao Yu
4d817ae099 f2fs: restrict inline_xattr_size configuration
This patch limits to enable inline_xattr_size mount option only if
both extra_attr and flexible_inline_xattr feature is on in current
image.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:31 +09:00
Yunlong Song
b94929d975 f2fs: fix heap mode to reset it back
Commit 7a20b8a61e ("f2fs: allocate node
and hot data in the beginning of partition") introduces another mount
option, heap, to reset it back. But it does not do anything for heap
mode, so fix it.

Cc: stable@vger.kernel.org
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:30 +09:00
Sheng Yong
0964fc1a82 f2fs: fix potential corruption in area before F2FS_SUPER_OFFSET
sb_getblk does not guarantee the buffer head is uptodate. If bh is not
uptodate, the data (may be used as boot code) in area before
F2FS_SUPER_OFFSET may get corrupted when super block is committed.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:29 +09:00
Yunlong Song
bdbc90fa55 f2fs: don't put dentry page in pagecache into highmem
Previous dentry page uses highmem, which will cause panic in platforms
using highmem (such as arm), since the address space of dentry pages
from highmem directly goes into the decryption path via the function
fscrypt_fname_disk_to_usr. But sg_init_one assumes the address is not
from highmem, and then cause panic since it doesn't call kmap_high but
kunmap_high is triggered at the end. To fix this problem in a simple
way, this patch avoids to put dentry page in pagecache into highmem.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:03 +09:00
Linus Torvalds
3462ac5703 Refactor support for encrypted symlinks to move common code to fscrypt.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlp2R3AACgkQ8vlZVpUN
 gaOIdAgApEdlFR2Gf93z2hMj5HxVL5rjkuPJVtVkKu0eH2HMQJyxNmjymrRfuFmM
 8W1CrEvVKi5Aj6r8q4KHIdVV247Ya0SVEhLwKM0LX4CvlZUXmwgCmZ/MPDTXA1eq
 C4vPVuJAuSNGNVYDlDs3+NiMHINGNVnBVQQFSPBP9P+iNWPD7o486712qaF8maVn
 RbfbQ2rWtOIRdlAOD1U5WqgQku59lOsmHk2pc0+X4LHCZFpMoaO80JVjENPAw+BF
 daRt6TX+WljMyx6DRIaszqau876CJhe/tqlZcCLOkpXZP0jJS13yodp26dVQmjCh
 w8YdiY7uHK2D+S/8eyj7h7DIwzu3vg==
 =ZjQP
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Refactor support for encrypted symlinks to move common code to fscrypt"

Ted also points out about the merge:
 "This makes the f2fs symlink code use the fscrypt_encrypt_symlink()
  from the fscrypt tree. This will end up dropping the kzalloc() ->
  f2fs_kzalloc() change, which means the fscrypt-specific allocation
  won't get tested by f2fs's kmalloc error injection system; which is
  fine"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: (26 commits)
  fscrypt: fix build with pre-4.6 gcc versions
  fscrypt: remove 'ci' parameter from fscrypt_put_encryption_info()
  fscrypt: document symlink length restriction
  fscrypt: fix up fscrypt_fname_encrypted_size() for internal use
  fscrypt: define fscrypt_fname_alloc_buffer() to be for presented names
  fscrypt: calculate NUL-padding length in one place only
  fscrypt: move fscrypt_symlink_data to fscrypt_private.h
  fscrypt: remove fscrypt_fname_usr_to_disk()
  ubifs: switch to fscrypt_get_symlink()
  ubifs: switch to fscrypt ->symlink() helper functions
  ubifs: free the encrypted symlink target
  f2fs: switch to fscrypt_get_symlink()
  f2fs: switch to fscrypt ->symlink() helper functions
  ext4: switch to fscrypt_get_symlink()
  ext4: switch to fscrypt ->symlink() helper functions
  fscrypt: new helper function - fscrypt_get_symlink()
  fscrypt: new helper functions for ->symlink()
  fscrypt: trim down fscrypt.h includes
  fscrypt: move fscrypt_is_dot_dotdot() to fs/crypto/fname.c
  fscrypt: move fscrypt_valid_enc_modes() to fscrypt_private.h
  ...
2018-02-04 10:43:12 -08:00
Linus Torvalds
255442c938 Documentation updates for 4.16. New stuff includes refcount_t
documentation, errseq documentation, kernel-doc support for nested
 structure definitions, the removal of lots of crufty kernel-doc support for
 unused formats, SPDX tag documentation, the beginnings of a manual for
 subsystem maintainers, and lots of fixes and updates.
 
 As usual, some of the changesets reach outside of Documentation/ to effect
 kerneldoc comment fixes.  It also adds the new LICENSES directory, of which
 Thomas promises I do not need to be the maintainer.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJab11TAAoJEI3ONVYwIuV6i1UP/1LgGPHW9Ygq5qaLFbReZd/u
 Mx/orrhHX0PdkbCCE+CbL8Vm1m4UKFDTBdlpk3s542zxeeG0ZBXuTnvq4Kyk+cTN
 p4/vsIEzk/Ih13/glGE5MlV+EjiEK+8hK69TIUj7bAyuHmpzofjRz9/1M6RLDGDC
 HY6UI58AXG0yOQWMWCGRMYpQAFUGij2equ7Doe1ugXRq14dx7V4RsOhI140iRk7t
 bquAq1rS2fXniiuPFmLBUe4dWW28isVa/Vl/aXcaWQDKMyT0OLhjOMW36wWKqtPi
 WdVCpHv1NLZNyZZr9S3kvfOwW+BUqpEzfVwssyBLW4h0tsnIx0U0HVhSTY8/TvFZ
 QD9yCSana4LB/e5CHXIX5lBHbjHxf+rETXqVV4MgwDaMvM3mCo4X6WUTJDmZADo6
 vQISEKeb4su5uWAbc9T9xwRSLhZnFVdJ/QuYdNQ5+EpFJYLhzQ9eBvEz6JstSIXL
 p9ASBiPNY3ulpVZ8q0JOHJRBhq5mHJH6Dy8achzbILy2l/ZI4b8lJ53mw9II04cp
 puF96E6HpvuZ8Tgjjrg9U3ZdxXNrUgc/tjk2ZDkyTglk1XF2jKSq2tiNSZ3oLrJm
 XqJPnpCeyJM5UDvwkIBzgC41WEHwe8uvoNbUnc4X7UJSZegFzcSLQXf5qaprHS5k
 XeQ7sbd+S+jzVVjFi0W5
 =Z15Z
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.16' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "Documentation updates for 4.16.

  New stuff includes refcount_t documentation, errseq documentation,
  kernel-doc support for nested structure definitions, the removal of
  lots of crufty kernel-doc support for unused formats, SPDX tag
  documentation, the beginnings of a manual for subsystem maintainers,
  and lots of fixes and updates.

  As usual, some of the changesets reach outside of Documentation/ to
  effect kerneldoc comment fixes. It also adds the new LICENSES
  directory, of which Thomas promises I do not need to be the
  maintainer"

* tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
  linux-next: docs-rst: Fix typos in kfigure.py
  linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
  Documentation: Fix misconversion of #if
  docs: add index entry for networking/msg_zerocopy
  Documentation: security/credentials.rst: explain need to sort group_list
  LICENSES: Add MPL-1.1 license
  LICENSES: Add the GPL 1.0 license
  LICENSES: Add Linux syscall note exception
  LICENSES: Add the MIT license
  LICENSES: Add the BSD-3-clause "Clear" license
  LICENSES: Add the BSD 3-clause "New" or "Revised" License
  LICENSES: Add the BSD 2-clause "Simplified" license
  LICENSES: Add the LGPL-2.1 license
  LICENSES: Add the LGPL 2.0 license
  LICENSES: Add the GPL 2.0 license
  Documentation: Add license-rules.rst to describe how to properly identify file licenses
  scripts: kernel_doc: better handle show warnings logic
  fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
  doc: md: Fix a file name to md-fault.c in fault-injection.txt
  errseq: Add to documentation tree
  ...
2018-01-31 19:25:25 -08:00
Linus Torvalds
3da90b159b f2fs-for-4.16-rc1
In this round, we've followed up to support some generic features such as
 cgroup, block reservation, linking fscrypt_ops, delivering write_hints,
 and some ioctls. And, we could fix some corner cases in terms of power-cut
 recovery and subtle deadlocks.
 
 Enhancement:
  - bitmap operations to handle NAT blocks
  - readahead to improve readdir speed
  - switch to use fscrypt_*
  - apply write hints for direct IO
  - add reserve_root=%u,resuid=%u,resgid=%u to reserve blocks for root/uid/gid
  - modify b_avail and b_free to consider root reserved blocks
  - support cgroup writeback
  - support FIEMAP_FLAG_XATTR for fibmap
  - add F2FS_IOC_PRECACHE_EXTENTS to pre-cache extents
  - add F2FS_IOC_{GET/SET}_PIN_FILE to pin LBAs for data blocks
  - support inode creation time
 
 Bug fix:
  - sysfile-based quota operations
  - memory footprint accounting
  - allow to write data on partial preallocation case
  - fix deadlock case on fallocate
  - fix to handle fill_super errors
  - fix missing inode updates of fsync'ed file
  - recover renamed file which was fsycn'ed before
  - drop inmemory pages in corner error case
  - keep last_disk_size correctly
  - recover missing i_inline flags during roll-forward
 
 Various clean-up patches were added as well.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlpw7uUACgkQQBSofoJI
 UNIDEA//d0ScVxEWN6s2Qba5wa2XnTcIi3upzB61gYAkeKdy/8zan8Dp7yaZWh+h
 hAMG7iuaNZS6fwkNvF4SRf3zChIDDXofJneQVE5x3eyHpYCJTIBRijV/dtZ1Fzzd
 q+hrl3JxMKx1jMezh2tDxWnznL8Fgu34cz3UmcF/YBniM99nNP7ri4HsDB+r/691
 Yo/Z1nN3VEJRrHiIfTK2eR8LmlvUFWBq21R9mszvPTYpUz3GGJ5bInY1r92nzMC9
 1EDk4e0RFvV2p/CJPmFiOGMDVUb9LJ/J8icXF5FlQ5eE6DNIP6Q4609MJD29sVtE
 mDC11hV15QhKt+huazn/nivcPtwWgjUdyzw6EYJLtUdEaugQarA1iORR2ZNNBxOX
 ZmocX++rby4oHMd+Tl618jcRYOS3hUhHncgw8IxDJH9Kh1vI/4z2wdCfkucH5L/u
 eG5+1qMehE4vnSB2nMvEJdCERR3yHc5qZDUfMZs/e7jHjIUdT399kkAprljdEDSc
 QVlXeM5rdmiILVs9fY6gVgr6BgjzYB+DhrOJ3jQ8xrIOdcoqeN14RduOvZpAdiNr
 IwQgPxNQm8WBJQxomso7ySWotYmGIxWOPjqWtSyfL7TS4Jdiwf7eoo3pUDc8sg7A
 xi6zvDjdT4hmkqLKx71As3V82g6RmY4Ydcyk2XqnBjF26g2Kb68=
 =jZqE
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've followed up to support some generic features such
  as cgroup, block reservation, linking fscrypt_ops, delivering
  write_hints, and some ioctls. And, we could fix some corner cases in
  terms of power-cut recovery and subtle deadlocks.

  Enhancements:
   - bitmap operations to handle NAT blocks
   - readahead to improve readdir speed
   - switch to use fscrypt_*
   - apply write hints for direct IO
   - add reserve_root=%u,resuid=%u,resgid=%u to reserve blocks for root/uid/gid
   - modify b_avail and b_free to consider root reserved blocks
   - support cgroup writeback
   - support FIEMAP_FLAG_XATTR for fibmap
   - add F2FS_IOC_PRECACHE_EXTENTS to pre-cache extents
   - add F2FS_IOC_{GET/SET}_PIN_FILE to pin LBAs for data blocks
   - support inode creation time

  Bug fixs:
   - sysfile-based quota operations
   - memory footprint accounting
   - allow to write data on partial preallocation case
   - fix deadlock case on fallocate
   - fix to handle fill_super errors
   - fix missing inode updates of fsync'ed file
   - recover renamed file which was fsycn'ed before
   - drop inmemory pages in corner error case
   - keep last_disk_size correctly
   - recover missing i_inline flags during roll-forward

  Various clean-up patches were added as well"

* tag 'f2fs-for-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (72 commits)
  f2fs: support inode creation time
  f2fs: rebuild sit page from sit info in mem
  f2fs: stop issuing discard if fs is readonly
  f2fs: clean up duplicated assignment in init_discard_policy
  f2fs: use GFP_F2FS_ZERO for cleanup
  f2fs: allow to recover node blocks given updated checkpoint
  f2fs: recover some i_inline flags
  f2fs: correct removexattr behavior for null valued extended attribute
  f2fs: drop page cache after fs shutdown
  f2fs: stop gc/discard thread after fs shutdown
  f2fs: hanlde error case in f2fs_ioc_shutdown
  f2fs: split need_inplace_update
  f2fs: fix to update last_disk_size correctly
  f2fs: kill F2FS_INLINE_XATTR_ADDRS for cleanup
  f2fs: clean up error path of fill_super
  f2fs: avoid hungtask when GC encrypted block if io_bits is set
  f2fs: allow quota to use reserved blocks
  f2fs: fix to drop all inmem pages correctly
  f2fs: speed up defragment on sparse file
  f2fs: support F2FS_IOC_PRECACHE_EXTENTS
  ...
2018-01-30 19:07:32 -08:00
Chao Yu
1c1d35df71 f2fs: support inode creation time
This patch adds creation time field in inode layout to support showing
kstat.btime in ->statx.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 14:10:39 -08:00
Yunlei He
068c3cd858 f2fs: rebuild sit page from sit info in mem
This patch rebuild sit page from sit info in mem instead
of issue a read io.

I test this method and the result is as below:

Pre:
 mmc_perf_test-12061 [001] ...1   976.819992: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [001] ...1   976.856446: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1   998.976946: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1   999.023269: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1  1022.060772: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1  1022.111034: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [002] ...1  1070.127643: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1  1070.187352: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1  1095.942124: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1  1095.995975: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1  1122.535091: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1  1122.586521: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [001] ...1  1147.897487: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [001] ...1  1147.959438: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1  1177.926951: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [002] ...1  1177.976823: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [002] ...1  1204.176087: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [002] ...1  1204.239046: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit

Some sit flush consume more than 50ms.

Now:
mmc_perf_test-2187  [007] ...1   196.840684: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [007] ...1   196.841258: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [007] ...1   219.430582: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [007] ...1   219.431144: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [002] ...1   243.638678: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [000] ...1   243.638980: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [002] ...1   265.392180: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [002] ...1   265.392245: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [000] ...1   290.309051: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [000] ...1   290.309116: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [003] ...1   317.144209: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [003] ...1   317.145913: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [005] ...1   343.224954: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [005] ...1   343.225574: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [000] ...1   370.239846: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [000] ...1   370.241138: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [001] ...1   397.029043: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [001] ...1   397.030750: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [003] ...1   425.386377: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [003] ...1   425.387735: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit

Most sit flush consume no more than 1ms.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 10:44:34 -08:00
Chao Yu
3b60d802d9 f2fs: stop issuing discard if fs is readonly
If filesystem is readonly, stop to issue discard in daemon.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 10:40:10 -08:00
Chao Yu
6819b884e0 f2fs: clean up duplicated assignment in init_discard_policy
Remove duplicated codes of assignment for .max_requests and .io_aware_gran.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 10:40:01 -08:00
Chao Yu
2882d34310 f2fs: use GFP_F2FS_ZERO for cleanup
Clean up codes with GFP_F2FS_ZERO, no logic changes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 10:39:49 -08:00
Jaegeuk Kim
f236792311 f2fs: allow to recover node blocks given updated checkpoint
If fsck.f2fs changes crc, we have no way to recover some inode blocks by roll-
forward recovery. Let's relax the condition to recover them.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:59 -08:00
Jaegeuk Kim
37a086f015 f2fs: recover some i_inline flags
This fixes lost i_inline flags during roll-forward.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:58 -08:00
Daeho Jeong
b2c4692bc2 f2fs: correct removexattr behavior for null valued extended attribute
__vfs_removexattr() transfers "NULL" value to the setxattr handler of
the f2fs filesystem in order to remove the extended attribute. But,
__f2fs_setxattr() just ignores the removal request when the value of
the extended attribute is already NULL. We have to remove the extended
attribute itself even if the value of that is already NULL.

We can reporduce this bug with the below:

1. touch file
2. setfattr -n "user.foo" file
3. setfattr -x "user.foo" file
4. getfattr -d file
> user.foo

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Tested-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:57 -08:00
Chao Yu
db198ae0f8 f2fs: drop page cache after fs shutdown
Don't remain dirtied page cache in f2fs after shutdown, it can mitigate
memory pressure of whole system, in order to keep other modules working
properly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:56 -08:00
Chao Yu
7950e9ac63 f2fs: stop gc/discard thread after fs shutdown
Once filesystem shuts down, daemons like gc/discard thread should be
aware of it, and do exit, in addtion, drop all cached pending discard
commands and turn off real-time discard mode.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:55 -08:00
Chao Yu
d027c48447 f2fs: hanlde error case in f2fs_ioc_shutdown
This patch makes f2fs_ioc_shutdown handling error case correctly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:54 -08:00
Chao Yu
bb9e3bb8db f2fs: split need_inplace_update
This patch splits need_inplace_update to two functions:
a. should_update_inplace() includes all conditions that we must use IPU.
b. should_update_outplace() includes all conditions that we must use OPU.

So that, in f2fs_ioc_set_pin_file() and f2fs_defragment_range(), we can
use corresponding function to check whether we can trigger OPU/IPU or not.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:53 -08:00
Chao Yu
eb4497975e f2fs: fix to update last_disk_size correctly
This patch fixes to update last_disk_size only when writing out page
successfully.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:52 -08:00
Chao Yu
b323fd28bb f2fs: kill F2FS_INLINE_XATTR_ADDRS for cleanup
Use get_inline_xattr_addrs directly instead of F2FS_INLINE_XATTR_ADDRS.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:51 -08:00
Chao Yu
d7997e6368 f2fs: clean up error path of fill_super
This patch cleans up error path of fille_super to avoid unneeded
release step.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:50 -08:00
Sheng Yong
a9d572c755 f2fs: avoid hungtask when GC encrypted block if io_bits is set
When io_bits is set, GCing encrypted block may hit the following hungtask.
Since io_bits requires aligned block address, f2fs_submit_page_write may
return -EAGAIN if new_blkaddr does not satisify io_bits alignment. As a
result, the encrypted page will never be writtenback.

This patch makes move_data_block aware the EAGAIN error and cancel the
writeback.

[  246.751371] INFO: task kworker/u4:4:797 blocked for more than 90 seconds.
[  246.752423]       Not tainted 4.15.0-rc4+ #11
[  246.754176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.755336] kworker/u4:4    D25448   797      2 0x80000000
[  246.755597] Workqueue: writeback wb_workfn (flush-7:0)
[  246.755616] Call Trace:
[  246.755695]  ? __schedule+0x322/0xa90
[  246.755761]  ? blk_init_request_from_bio+0x120/0x120
[  246.755773]  ? pci_mmcfg_check_reserved+0xb0/0xb0
[  246.755801]  ? __radix_tree_create+0x19e/0x200
[  246.755813]  ? delete_node+0x136/0x370
[  246.755838]  schedule+0x43/0xc0
[  246.755904]  io_schedule+0x17/0x40
[  246.755939]  wait_on_page_bit_common+0x17b/0x240
[  246.755950]  ? wake_page_function+0xa0/0xa0
[  246.755961]  ? add_to_page_cache_lru+0x160/0x160
[  246.755972]  ? page_cache_tree_insert+0x170/0x170
[  246.755983]  ? __lru_cache_add+0x96/0xb0
[  246.756086]  __filemap_fdatawait_range+0x14f/0x1c0
[  246.756097]  ? wait_on_page_bit_common+0x240/0x240
[  246.756120]  ? __wake_up_locked_key_bookmark+0x20/0x20
[  246.756167]  ? wait_on_all_pages_writeback+0xc9/0x100
[  246.756179]  ? __remove_ino_entry+0x120/0x120
[  246.756192]  ? wait_woken+0x100/0x100
[  246.756204]  filemap_fdatawait_range+0x9/0x20
[  246.756216]  write_checkpoint+0x18a1/0x1f00
[  246.756254]  ? blk_get_request+0x10/0x10
[  246.756265]  ? cpumask_next_and+0x43/0x60
[  246.756279]  ? f2fs_sync_inode_meta+0x160/0x160
[  246.756289]  ? remove_element.isra.4+0xa0/0xa0
[  246.756300]  ? __put_compound_page+0x40/0x40
[  246.756310]  ? f2fs_sync_fs+0xec/0x1c0
[  246.756320]  ? f2fs_sync_fs+0x120/0x1c0
[  246.756329]  f2fs_sync_fs+0x120/0x1c0
[  246.756357]  ? trace_event_raw_event_f2fs__page+0x260/0x260
[  246.756393]  ? ata_build_rw_tf+0x173/0x410
[  246.756397]  f2fs_balance_fs_bg+0x198/0x390
[  246.756405]  ? drop_inmem_page+0x230/0x230
[  246.756415]  ? ahci_qc_prep+0x1bb/0x2e0
[  246.756418]  ? ahci_qc_issue+0x1df/0x290
[  246.756422]  ? __accumulate_pelt_segments+0x42/0xd0
[  246.756426]  ? f2fs_write_node_pages+0xd1/0x380
[  246.756429]  f2fs_write_node_pages+0xd1/0x380
[  246.756437]  ? sync_node_pages+0x8f0/0x8f0
[  246.756440]  ? update_curr+0x53/0x220
[  246.756444]  ? __accumulate_pelt_segments+0xa2/0xd0
[  246.756448]  ? __update_load_avg_se.isra.39+0x349/0x360
[  246.756452]  ? do_writepages+0x2a/0xa0
[  246.756456]  do_writepages+0x2a/0xa0
[  246.756460]  __writeback_single_inode+0x70/0x490
[  246.756463]  ? check_preempt_wakeup+0x199/0x310
[  246.756467]  writeback_sb_inodes+0x2a2/0x660
[  246.756471]  ? is_empty_dir_inode+0x40/0x40
[  246.756474]  ? __writeback_single_inode+0x490/0x490
[  246.756477]  ? string+0xbf/0xf0
[  246.756480]  ? down_read_trylock+0x35/0x60
[  246.756484]  __writeback_inodes_wb+0x9f/0xf0
[  246.756488]  wb_writeback+0x41d/0x4b0
[  246.756492]  ? writeback_inodes_wb.constprop.55+0x150/0x150
[  246.756498]  ? set_worker_desc+0xf7/0x130
[  246.756502]  ? current_is_workqueue_rescuer+0x60/0x60
[  246.756511]  ? _find_next_bit+0x2c/0xa0
[  246.756514]  ? wb_workfn+0x400/0x5d0
[  246.756518]  wb_workfn+0x400/0x5d0
[  246.756521]  ? finish_task_switch+0xdf/0x2a0
[  246.756525]  ? inode_wait_for_writeback+0x30/0x30
[  246.756529]  process_one_work+0x3a7/0x6f0
[  246.756533]  worker_thread+0x82/0x750
[  246.756537]  kthread+0x16f/0x1c0
[  246.756541]  ? trace_event_raw_event_workqueue_work+0x110/0x110
[  246.756544]  ? kthread_create_worker_on_cpu+0xb0/0xb0
[  246.756548]  ret_from_fork+0x1f/0x30

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:49 -08:00
Jaegeuk Kim
d8a9a22992 f2fs: allow quota to use reserved blocks
This patch allows quota to use reserved blocks all the time.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:48 -08:00
Chao Yu
a2e2e76b23 f2fs: fix to drop all inmem pages correctly
In commit 57864ae5ce ("f2fs: limit # of inmemory pages"), we have
limited memory footprint of all inmem pages with 20% of total memory,
otherwise, if we exceed the threshold, we will try to drop all inmem
pages to avoid excessive memory pressure resulting in performance
regression.

But in some unrelated error paths, we will also drop all inmem pages,
which should be wrong, fix it in this patch.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:47 -08:00
Chao Yu
f3d98e74fc f2fs: speed up defragment on sparse file
We have supported to get next page offset with valid mapping crossing
hole in f2fs_map_blocks, utilizing it to speed up defragment on sparse
file.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:46 -08:00
Chao Yu
c4020b2da4 f2fs: support F2FS_IOC_PRECACHE_EXTENTS
This patch introduces a new ioctl F2FS_IOC_PRECACHE_EXTENTS to precache
extent info like ext4, in order to gain better performance during
triggering AIO by eliminating synchronous waiting of mapping info.

Referred commit: 7869a4a6c5 ("ext4: add support for extent pre-caching")

In addition, with newly added extent precache abilitiy, this patch add
to support FIEMAP_FLAG_CACHE in ->fiemap.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:45 -08:00
Jaegeuk Kim
1ad71a2712 f2fs: add an ioctl to disable GC for specific file
This patch gives a flag to disable GC on given file, which would be useful, when
user wants to keep its block map. It also conducts in-place-update for dontmove
file.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:35 -08:00
Daeho Jeong
9ac1e2d88d f2fs: prevent newly created inode from being dirtied incorrectly
Now, we invoke f2fs_mark_inode_dirty_sync() to make an inode dirty in
advance of creating a new node page for the inode. By this, some inodes
whose node page is not created yet can be linked into the global dirty
list.

If the checkpoint is executed at this moment, the inode will be written
back by writeback_single_inode() and finally update_inode_page() will
fail to detach the inode from the global dirty list because the inode
doesn't have a node page.

The problem is that the inode's state in VFS layer will become clean
after execution of writeback_single_inode() and it's still linked in
the global dirty list of f2fs and this will cause a kernel panic.

So, we will prevent the newly created inode from being dirtied during
the FI_NEW_INODE flag of the inode is set. We will make it dirty
right after the flag is cleared.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-18 22:09:12 -08:00
Chao Yu
442a9dbd57 f2fs: support FIEMAP_FLAG_XATTR
This patch enables ->fiemap to handle FIEMAP_FLAG_XATTR flag for xattr
mapping info lookup purpose.

It makes f2fs passing generic/425 test in fstest.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-18 22:09:11 -08:00
Chao Yu
f1b43d4cd5 f2fs: fix to cover f2fs_inline_data_fiemap with inode_lock
This patch fix to cover f2fs_inline_data_fiemap with inode_lock in order
to make that interface avoiding race with mapping change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-18 22:09:10 -08:00
Yunlei He
7dff55d27e f2fs: check node page again in write end io
Check node page again in write end io in case of
data corruption during inflght IO.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-18 22:09:09 -08:00
Chao Yu
25a912e51a f2fs: fix to caclulate required free section correctly
When calculating required free section during file defragmenting, we
should skip holes in file, otherwise we will probably fail to defrag
sparse file with large size.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-18 22:09:08 -08:00
Daeho Jeong
f1d2564a7c f2fs: handle newly created page when revoking inmem pages
When committing inmem pages is successful, we revoke already committed
blocks in __revoke_inmem_pages() and finally replace the committed
ones with the old blocks using f2fs_replace_block(). However, if
the committed block was newly created one, the address of the old
block is NEW_ADDR and __f2fs_replace_block() cannot handle NEW_ADDR
as new_blkaddr properly and a kernel panic occurrs.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Tested-by: Shu Tan <shu.tan@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-18 22:09:07 -08:00
Jaegeuk Kim
7c2e59632b f2fs: add resgid and resuid to reserve root blocks
This patch adds mount options to reserve some blocks via resgid=%u,resuid=%u.
It only activates with reserve_root=%u.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:40:02 -08:00
Yufen Yu
578c647879 f2fs: implement cgroup writeback support
Cgroup writeback requires explicit support from the filesystem.
f2fs's data and node writeback IOs go through __write_data_page,
which sets fio for submiting IOs. So, we add io_wbc for fio,
associate bios with blkcg by invoking wbc_init_bio() and
account IOs issuing by wbc_account_io().
In addtion, f2fs_fill_super() is updated to set SB_I_CGROUPWB.

Meta writeback IOs is left alone by this patch and will always be
attributed to the root cgroup.

The results show that f2fs can throttle writeback nicely for
data writing and file creating.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:40:01 -08:00
Chao Yu
bffa8d3b00 f2fs: remove unused pend_list_tag
In commit 78997b569f ("f2fs: split discard policy"), we have get rid
of using pend_list_tag field in struct discard_cmd_control, but forgot
to remove it, now do it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:40:00 -08:00
Chao Yu
49c60c67d2 f2fs: avoid high cpu usage in discard thread
We take very long time to finish generic/476, this is because we will
check consistence of all discard entries in global rb tree while
traversing all different granularity pending lists, even when the list
is empty, in order to avoid that unneeded overhead, we have to skip
the check when coming up an empty list.

generic/476 time consumption:
					cost
Before patch & w/o consistence check	57s
Before patch & w/ consistence check	1426s
After patch				78s

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:40:00 -08:00
Wei Yongjun
94b1e10e74 f2fs: make local functions static
Fixes the following sparse warnings:

fs/f2fs/segment.c:887:6: warning:
 symbol '__check_sit_bitmap' was not declared. Should it be static?
fs/f2fs/segment.c:1327:6: warning:
 symbol 'f2fs_wait_discard_bio' was not declared. Should it be static?
fs/f2fs/super.c:1661:5: warning:
 symbol 'f2fs_get_projid' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:39:59 -08:00
Jaegeuk Kim
7e65be49ed f2fs: add reserved blocks for root user
This patch allows root to reserve some blocks via mount option.

"-o reserve_root=N" means N x 4KB-sized blocks for root only.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:39:58 -08:00
Yunlong Song
2c1905042c f2fs: check segment type in __f2fs_replace_block
In some case, the node blocks has wrong blkaddr whose segment type is
NODE, e.g., recover inode has missing xattr flag and the blkaddr is in
the xattr range. Since fsck.f2fs does not check the recovery nodes, this
will cause __f2fs_replace_block change the curseg of node and do the
update_sit_entry(sbi, new_blkaddr, 1) with no next_blkoff refresh, as a
result, when recovery process write checkpoint and sync nodes, the
next_blkoff of curseg is used in the segment bit map, then it will
cause f2fs_bug_on. So let's check segment type in __f2fs_replace_block.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:39:57 -08:00
Yunlei He
1eca05aa9d f2fs: update inode info to inode page for new file
After checkpoint,
 1. creat a new file A ,(with dirty inode && dirty inode page && xattr info)
 2. backgroud wb write back file A inode page (without update from inode cache)
 3. fsync file A, write back inode page of file A with inode cache info
 4. sudden power off before new checkpoint

In this case, recovery process will try to recover a zero inode
page. Inline xattr flag of file A will be miss and xattr info
will be taken as blkaddr index.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:39:56 -08:00
Jaegeuk Kim
f66c027ead f2fs: show precise # of blocks that user/root can use
Let's show precise # of blocks that user/root can use through bavail and bfree
respectively.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16 15:39:55 -08:00
Eric Biggers
3d204e24d4 fscrypt: remove 'ci' parameter from fscrypt_put_encryption_info()
fscrypt_put_encryption_info() is only called when evicting an inode, so
the 'struct fscrypt_info *ci' parameter is always NULL, and there cannot
be races with other threads.  This was cruft left over from the broken
key revocation code.  Remove the unused parameter and the cmpxchg().

Also remove the #ifdefs around the fscrypt_put_encryption_info() calls,
since fscrypt_notsupp.h defines a no-op stub for it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-11 23:30:13 -05:00
Eric Biggers
f2329cb687 f2fs: switch to fscrypt_get_symlink()
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-11 23:26:49 -05:00
Eric Biggers
393c038f5c f2fs: switch to fscrypt ->symlink() helper functions
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-11 23:26:49 -05:00
Ming Lei
263663cd3c block: convert to bio_first_bvec_all & bio_first_page_all
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Chao Yu
7f1a45a5b6 f2fs: clean up unneeded declaration
Commit 6afc662e68 ("f2fs: support flexible inline xattr size")
declared f2fs_sb_has_flexible_inline_xattr in f2fs.h for latter being
used in get_inline_xattr_addrs, but in latter version, related code
has been changed, leave f2fs_sb_has_flexible_inline_xattr w/o any
users. Let's remove it for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-03 22:48:34 -08:00
Chao Yu
d6d478a14b f2fs: continue to do direct IO if we only preallocate partial blocks
While doing direct IO, if we run out-of-space when we preallocate blocks,
we should not return ENOSPC error directly, instead, we should continue
to do following direct IO, which will keep directIO of f2fs acting like
other filesystems.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-03 22:48:33 -08:00
Jaegeuk Kim
6279398db7 f2fs: enable quota at remount from r to w
We have to enable quota only when remounting from read to write. Otherwise,
we'll get remount failure. (e.g., write to write case)

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-03 22:48:25 -08:00
Jaegeuk Kim
b1ca321d1c f2fs: skip stop_checkpoint for user data writes
We can give another chance to write user data, which can resolve
generic/441.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:31 -08:00
Jaegeuk Kim
d620439f25 f2fs: fix missing error number for xattr operation
This fixes generic/449 hang problem caused by no ENOSPC forever which should be
returned by setxattr under disk full scenario.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:31 -08:00
Jaegeuk Kim
0a007b97aa f2fs: recover directory operations by fsync
This fixes generic/342 which doesn't recover renamed file which was fsynced
before. It will be done via another fsync on newly created file.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:31 -08:00
Jaegeuk Kim
c39a1b348c f2fs: return error during fill_super
Let's avoid BUG_ON during fill_super, when on-disk was totall corrupted.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:31 -08:00
Yunlei He
211a6fa04c f2fs: fix an error case of missing update inode page
-Thread A                             Thread B

-write_checkpoint
 -block_operations
  -f2fs_unlock_all                    -f2fs_sync_file
                                       -f2fs_write_inode
                                        -f2fs_inode_synced
    -f2fs_sync_inode_meta
     -sync_node_pages
                                        -set_page_drity

In this case, if sudden power off without next new checkpoint,
the last inode page update will lost. wb_writeback is same with
fsync.

Yunlei also reproduced the bug by:

@@ -366,7 +366,7 @@ int update_inode(struct inode *inode, struct page *node_page)
        struct extent_tree *et = F2FS_I(inode)->extent_tree;

        f2fs_inode_synced(inode);
-
+       msleep(10000);
        f2fs_wait_on_page_writeback(node_page, NODE, true);

shell 1:                                       shell2:

dd if=/dev/zero of=./test bs=1M count=10
sync
echo "hello" >> ./test
fsync test  // sleep 10s
                                               sync //return quickly
echo c > /proc/sysrq-trigger

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:31 -08:00
Chao Yu
4635b46af2 f2fs: fix potential hangtask in f2fs_trace_pid
As Jia-Ju Bai reported:

"According to fs/f2fs/trace.c, the kernel module may sleep under a spinlock.
The function call path is:
f2fs_trace_pid (acquire the spinlock)
   f2fs_radix_tree_insert
     cond_resched --> may sleep

I do not find a good way to fix it, so I only report.
This possible bug is found by my static analysis tool (DSAC) and my code
review."

Obviously, it's problemetic to schedule in critical region of spinlock,
which will cause uninterruptable sleep if there is no waker.

This patch changes to use mutex lock intead of spinlock to avoid this
condition.

Reported-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:30 -08:00
Yunlei He
c376fc0f35 f2fs: no need return value in restore summary process
No need return value in restore summary process

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:30 -08:00
LiFan
fab2adee36 f2fs: use unlikely for release case
Since the variable release is only nonzero when another unlikely
case occurs, use unlikely() on it seems logical.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:30 -08:00
Chao Yu
f652e9d988 f2fs: don't return value in truncate_data_blocks_range
There is no caller cares about return value of truncate_data_blocks_range,
remove it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:30 -08:00
Chao Yu
4c2ac6a860 f2fs: clean up f2fs_map_blocks
f2fs_map_blocks():

if (blkaddr == NEW_ADDR || blkaddr == NULL_ADDR) {
	if (create) {
		...
	} else {
		...
		if (flag == F2FS_GET_BLOCK_FIEMAP &&
					blkaddr == NULL_ADDR) {
			...
		}
		if (flag != F2FS_GET_BLOCK_FIEMAP ||
					blkaddr != NEW_ADDR)
			goto sync_out;
	}

It means we can break the loop in cases of:
a) flag != F2FS_GET_BLOCK_FIEMAP or
b) flag == F2FS_GET_BLOCK_FIEMAP && blkaddr == NULL_ADDR

Condition b) is the same as previous one, so merge operations of them
for readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:30 -08:00
Chao Yu
416d2dbb4e f2fs: clean up hash codes
f2fs_chksum and f2fs_crc32 use the same 'crc32' crypto engine, also
their implementation are almost the same, except with different
shash description context.

Introduce __f2fs_crc32 to wrap the common codes, and reuse it in
f2fs_chksum and f2fs_crc32.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:30 -08:00
Chao Yu
bae01eda8e f2fs: fix error handling in fill_super
In fill_super, if we fail to call f2fs_build_stats(), it needs to detach
from global f2fs shrink list, otherwise once system starts to shrink slab
cache, we will encounter below panic:

BUG: unable to handle kernel paging request at 00007d35
Oops: 0002 [#1] PREEMPT SMP
EIP: __lock_acquire+0x70/0x12c0
Call Trace:
 lock_acquire+0xae/0x220
 mutex_trylock+0xc5/0xf0
 f2fs_shrink_count+0x32/0xb0 [f2fs]
 shrink_slab+0xf1/0x5b0
 drop_slab_node+0x35/0x60
 drop_slab+0xf/0x20
 drop_caches_sysctl_handler+0x79/0xc0
 proc_sys_call_handler+0xa4/0xc0
 proc_sys_write+0x1f/0x30
 __vfs_write+0x24/0x150
 SyS_write+0x44/0x90
 do_fast_syscall_32+0xa1/0x1ca
 entry_SYSENTER_32+0x4c/0x7b

In addition, this patch relocates f2fs_join_shrinker in fill_super to
avoid unneeded error handling of it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:30 -08:00
Chao Yu
4e6aad29bc f2fs: spread f2fs_k{m,z}alloc
Use f2fs_k{m,z}alloc as much as possible to increase fault injection
points.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:29 -08:00
Chao Yu
628b3d1438 f2fs: inject fault to kvmalloc
This patch supports to inject fault into kvmalloc/kvzalloc.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:29 -08:00
Chao Yu
acbf054d53 f2fs: inject fault to kzalloc
This patch introduces f2fs_kzalloc based on f2fs_kmalloc in order to
support error injection for kzalloc().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:29 -08:00
LiFan
979f492fe3 f2fs: remove a redundant conditional expression
Avoid checking is_inode repeatedly, and make the logic
a little bit clearer.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:29 -08:00
Hyunchul Lee
d5097be55c f2fs: apply write hints to select the type of segment for direct write
When blocks are allocated for direct write, select the type of
segment using the kiocb hint. But if an inode has FI_NO_ALLOC,
use the inode hint.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:29 -08:00
Eric Biggers
20bb2479be f2fs: switch to fscrypt_prepare_setattr()
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:29 -08:00
Eric Biggers
55899d7b49 f2fs: switch to fscrypt_prepare_lookup()
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:29 -08:00
Eric Biggers
2e45b07fda f2fs: switch to fscrypt_prepare_rename()
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:28 -08:00
Eric Biggers
b05157e772 f2fs: switch to fscrypt_prepare_link()
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:28 -08:00
Eric Biggers
2e168c82dc f2fs: switch to fscrypt_file_open()
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:28 -08:00
Elena Reshetova
6671726054 posix_acl: convert posix_acl.a_refcount from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable posix_acl.a_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

**Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.
The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the posix_acl.a_refcount it might make a difference
in following places:
 - get_cached_acl(): increment in refcount_inc_not_zero() only
   guarantees control dependency on success vs. fully ordered
   atomic counterpart. However this operation is performed under
   rcu_read_lock(), so this should be fine.
 - posix_acl_release(): decrement in refcount_dec_and_test() only
   provides RELEASE ordering and control dependency on success
   vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:28 -08:00
Zhikang Zhang
de8b10ac13 f2fs: remove repeated f2fs_bug_on
f2fs: remove repeated f2fs_bug_on which has already existed
      in function invalidate_blocks.

Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:28 -08:00
LiFan
736c0a7485 f2fs: remove an excess variable
Remove the variable page_idx which no one would miss.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:28 -08:00
Chao Yu
21020812c9 f2fs: fix lock dependency in between dio_rwsem & i_mmap_sem
test/generic/208 reports a potential deadlock as below:

Chain exists of:
  &mm->mmap_sem --> &fi->i_mmap_sem --> &fi->dio_rwsem[WRITE]

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&fi->dio_rwsem[WRITE]);
                               lock(&fi->i_mmap_sem);
                               lock(&fi->dio_rwsem[WRITE]);
  lock(&mm->mmap_sem);

This patch changes the lock dependency as below in fallocate() to
fix this issue:
- dio_rwsem
 - i_mmap_sem

Fixes: bb06664a53 ("f2fs: avoid race in between GC and block exchange")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:28 -08:00
Sheng Yong
e17d488bce f2fs: remove unused parameter
Commit d260081ccf ("f2fs: change recovery policy of xattr node block")
removes the use of blkaddr, which is no longer used. So remove the
parameter.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:27 -08:00
Sheng Yong
25006645d2 f2fs: still write data if preallocate only partial blocks
If there is not enough space left, f2fs_preallocate_blocks may only
preallocte partial blocks. As a result, the write operation fails
but i_blocks is not 0.  To avoid this, f2fs should write data in
non-preallocation way and write as many data as the size of i_blocks.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:27 -08:00
Sheng Yong
f6df8f234e f2fs: introduce sysfs readdir_ra to readahead inode block in readdir
This patch introduces a sysfs interface readdir_ra to enable/disable
readaheading inode block in f2fs_readdir. When readdir_ra is enabled,
it improves the performance of "readdir + stat".

For 300,000 files:
	time find /data/test > /dev/null
disable readdir_ra: 1m25.69s real  0m01.94s user  0m50.80s system
enable  readdir_ra: 0m18.55s real  0m00.44s user  0m15.39s system

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:27 -08:00
LiFan
5921aaa185 f2fs: fix concurrent problem for updating free bitmap
alloc_nid_failed and scan_nat_page can be called at the same time,
and we haven't protected add_free_nid and update_free_nid_bitmap
with the same nid_list_lock. That could lead to

Thread A				Thread B
- __build_free_nids
 - scan_nat_page
  - add_free_nid
					- alloc_nid_failed
					 - update_free_nid_bitmap
  - update_free_nid_bitmap

scan_nat_page will clear the free bitmap since the nid is PREALLOC_NID,
but alloc_nid_failed needs to set the free bitmap. This results in
free nid with free bitmap cleared.
This patch update the bitmap under the same nid_list_lock in add_free_nid.
And use __GFP_NOFAIL to make sure to update status of free nid correctly.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:27 -08:00
Chao Yu
2ab56a59ca f2fs: remove unneeded memory footprint accounting
We forgot to remov memory footprint accounting of per-cpu type
variables, fix it.

Fixes: 35782b233f ("f2fs: remove percpu_count due to performance regression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:27 -08:00
Yunlei He
66e8336137 f2fs: no need to read nat block if nat_block_bitmap is set
No need to read nat block if nat_block_bitmap is set.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:27 -08:00
Chao Yu
292c196a36 f2fs: reserve nid resource for quota sysfile
During mkfs, quota sysfiles have already occupied nid resource,
it needs to adjust remaining available nid count in kernel side.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-02 19:27:26 -08:00
Adam Borowski
91581e4c60 fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
This link is replicated in most filesystems' config stanzas.  Referring
to an archived version of that site is pointless as it mostly deals with
patches; user documentation is available elsewhere.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-01-01 12:45:37 -07:00
Linus Torvalds
1751e8a6cb Rename superblock flags (MS_xyz -> SB_xyz)
This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 13:05:09 -08:00
Linus Torvalds
a02cd4229e f2fs-for-4.15-rc1
In this round, we introduce sysfile-based quota support which is required
 for Android by default. In addition, we allow that users are able to reserve
 some blocks in runtime to mitigate performance drops in low free space.
 
 Enhancement
 - assign proper data segments according to write_hints given by user
 - issue cache_flush on dirty devices only among multiple devices
 - exploit cp_error flag and add more faults to enhance fault injection test
 - conduct more readaheads during f2fs_readdir
 - add a range for discard commands
 
 Bug fix
 - fix zero stat->st_blocks when inline_data is set
 - drop crypto key and free stale memory pointer while evict_inode is failing
 - fix some corner cases in free space and segment management
 - fix wrong last_disk_size
 
 This series includes lots of clean-ups and code enhancement in terms of xattr
 operations, discard/flush command control. In addition, it adds versatile
 debugfs entries to monitor f2fs status.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAloNCPAACgkQQBSofoJI
 UNLYmg/8DbDp/mTXqJ0AURo84Z4OQUOTRxYkWazx4ct2WPZp2+5HCWDDoM8AAtUn
 1J6/t7cU3osjos+zWvpUREZq1SPbp5m0h818HBFFJ/YMBPXucdQcd6wpepniOR5J
 5uKauVd7jd2pbAAL7hKyr+iBSLrJl816wsq34Ml8y8zkDSJe4wO5YsGDqzqyKf4N
 8nxMavUgerb14I/qXPb3ljlYlfaNNRlCT649QGCG78gx5hPeiUtUJ2l5DKV2xPe7
 v+5lZO93FFwW1siGy+Atq+nqQJyUkeiOYGPR1NPx9tfmaPO58iOIXLirfblKASZY
 HXJigVf50fQQBtwdBFL8ICSop6zV6gCKkNGZCHLzcYFWWL2TQwCIP3/iJdj9Wy+j
 +YUYyN0dyl2mmNEDZjRNX1V+QBW1k+msmvBCb0fT1GJTQAyRfA4XfBDyg94cpWQ1
 9YivNywuzG8YtghY7gYU3lCfT2OG19nXCSdz4qYUb5SSwoeGtLahLxMV4mlil4Tg
 dOa8CPLFhJnCqB9ivI4L6SennBr+gNgL26SeZ3PF+B5KimYOTZxbenrll1kTi1xp
 uCU6UR1xJS0W7Cjk8sCIu5hXkJMJwPJ0hcVeTgsxMkujLGvSSRCGb2hmOeILfwRZ
 N4aGn+kVmwwgKaKjD/F4CY4b3yJLdTKMjjl74u5YaMQWe4Bq4qU=
 =c49T
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we introduce sysfile-based quota support which is
  required for Android by default. In addition, we allow that users are
  able to reserve some blocks in runtime to mitigate performance drops
  in low free space.

  Enhancements:
   - assign proper data segments according to write_hints given by user
   - issue cache_flush on dirty devices only among multiple devices
   - exploit cp_error flag and add more faults to enhance fault
     injection test
   - conduct more readaheads during f2fs_readdir
   - add a range for discard commands

  Bug fixes:
   - fix zero stat->st_blocks when inline_data is set
   - drop crypto key and free stale memory pointer while evict_inode is
     failing
   - fix some corner cases in free space and segment management
   - fix wrong last_disk_size

  This series includes lots of clean-ups and code enhancement in terms
  of xattr operations, discard/flush command control. In addition, it
  adds versatile debugfs entries to monitor f2fs status"

* tag 'f2fs-for-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (75 commits)
  f2fs: deny accessing encryption policy if encryption is off
  f2fs: inject fault in inc_valid_node_count
  f2fs: fix to clear FI_NO_PREALLOC
  f2fs: expose quota information in debugfs
  f2fs: separate nat entry mem alloc from nat_tree_lock
  f2fs: validate before set/clear free nat bitmap
  f2fs: avoid opened loop codes in __add_ino_entry
  f2fs: apply write hints to select the type of segments for buffered write
  f2fs: introduce scan_curseg_cache for cleanup
  f2fs: optimize the way of traversing free_nid_bitmap
  f2fs: keep scanning until enough free nids are acquired
  f2fs: trace checkpoint reason in fsync()
  f2fs: keep isize once block is reserved cross EOF
  f2fs: avoid race in between GC and block exchange
  f2fs: save a multiplication for last_nid calculation
  f2fs: fix summary info corruption
  f2fs: remove dead code in update_meta_page
  f2fs: remove unneeded semicolon
  f2fs: don't bother with inode->i_version
  f2fs: check curseg space before foreground GC
  ...
2017-11-16 12:10:21 -08:00
Mel Gorman
8667982014 mm, pagevec: remove cold parameter for pagevecs
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
67fd707f46 mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.  Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
8faab64229 f2fs: use find_get_pages_tag() for looking up single page
__get_first_dirty_index() wants to lookup only the first dirty page
after given index.  There's no point in using pagevec_lookup_tag() for
that.  Just use find_get_pages_tag() directly.

Link: http://lkml.kernel.org/r/20171009151359.31984-8-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
028a63a6e3 f2fs: simplify page iteration loops
In several places we want to iterate over all tagged pages in a mapping.
However the code was apparently copied from places that iterate only
over a limited range and thus it checks for index <= end, optimizes the
case where we are coming close to range end which is all pointless when
end == ULONG_MAX.  So just remove this dead code.

[akpm@linux-foundation.org: fix warnings]
Link: http://lkml.kernel.org/r/20171009151359.31984-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Jan Kara
69c4f35d25 f2fs: use pagevec_lookup_range_tag()
We want only pages from given range in f2fs_write_cache_pages().  Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Chao Yu
ead710b7d8 f2fs: deny accessing encryption policy if encryption is off
This patch adds missing feature check in encryption ioctl interface.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-15 08:30:19 -08:00
Linus Torvalds
32190f0afb fscrypt: lots of cleanups, mostly courtesy by Eric Biggers
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAloI8AUACgkQ8vlZVpUN
 gaMdjgf8CCW7UhPjoZYwF8sUNtAaX9+JZT1maOcXUhpJ3vRQiRn+AzRH6yBYMm79
 +NZBwVlk4dlEe55Wh4yFIStMAstqzCrke4C9CSbExjgHNsJdU4znyYuLRMbLfyO0
 6c4NObiAIKJdW1/te1aN90keGC6min8pBZot+FqZsRr+Kq2+IOtM43JAv7efOLev
 v3LCjUf9JKxatoB8tgw4AJRa1p18p7D2APWTG05VlFq63TjhVIYNvvwcQlizLwGY
 cuEq3X59FbFdX06fJnucujU3WP3ES4/3rhufBK4NNaec5e5dbnH2KlAx7J5SyMIZ
 0qUFB/dmXDSb3gsfScSGo1F71Ad0CA==
 =asAm
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Lots of cleanups, mostly courtesy by Eric Biggers"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: lock mutex before checking for bounce page pool
  fscrypt: add a documentation file for filesystem-level encryption
  ext4: switch to fscrypt_prepare_setattr()
  ext4: switch to fscrypt_prepare_lookup()
  ext4: switch to fscrypt_prepare_rename()
  ext4: switch to fscrypt_prepare_link()
  ext4: switch to fscrypt_file_open()
  fscrypt: new helper function - fscrypt_prepare_setattr()
  fscrypt: new helper function - fscrypt_prepare_lookup()
  fscrypt: new helper function - fscrypt_prepare_rename()
  fscrypt: new helper function - fscrypt_prepare_link()
  fscrypt: new helper function - fscrypt_file_open()
  fscrypt: new helper function - fscrypt_require_key()
  fscrypt: remove unneeded empty fscrypt_operations structs
  fscrypt: remove ->is_encrypted()
  fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED()
  fs, fscrypt: add an S_ENCRYPTED inode flag
  fscrypt: clean up include file mess
2017-11-14 11:35:15 -08:00
Chao Yu
812c60564c f2fs: inject fault in inc_valid_node_count
This patch adds missing fault injection in inc_valid_node_count.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-13 20:21:35 -08:00
Chao Yu
28cfafb738 f2fs: fix to clear FI_NO_PREALLOC
We need to clear FI_NO_PREALLOC flag in error path of f2fs_file_write_iter,
otherwise we will lose the chance to preallocate blocks in latter write()
at one time.

Fixes: dc91de78e5 ("f2fs: do not preallocate blocks which has wrong buffer")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-13 20:21:22 -08:00
Jaegeuk Kim
2c8a4a2823 f2fs: expose quota information in debugfs
This patch shows # of dirty pages and # of hidden quota files.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-13 18:29:01 -08:00
Yunlei He
12f9ef379a f2fs: separate nat entry mem alloc from nat_tree_lock
This patch splits memory allocation part in nat_entry to avoid lock contention.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-13 18:28:48 -08:00
LiFan
0dd99ca76f f2fs: validate before set/clear free nat bitmap
In flush_nat_entries, all dirty nats will be flushed and if
their new address isn't NULL_ADDR, their bitmaps will be updated,
the free_nid_count of the bitmaps will be increaced regardless
of whether the nats have already been occupied before.
This could lead to wrong free_nid_count.
So this patch checks the status of the bits beforeactually
set/clear them.

Fixes: 586d1492f3 ("f2fs: skip scanning free nid bitmap of full NAT blocks")
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-10 17:35:07 -08:00
Chao Yu
19526d74cf f2fs: avoid opened loop codes in __add_ino_entry
We will keep __add_ino_entry success all the time, for ENOMEM failure
case, we have already handled it by using  __GFP_NOFAIL flag, so we
don't have to use additional opened loop codes here, remove them.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-10 11:50:12 -08:00
Hyunchul Lee
4f0a03d34d f2fs: apply write hints to select the type of segments for buffered write
Write hints helps F2FS to determine which type of segments would be
selected for buffered write.

This patch implements the mapping from write hints to segment types
as shown below.

  hints               segment type
  -----               ------------
  WRITE_LIFE_SHORT    CURSEG_HOT_DATA
  WRITE_LIFE_EXTREME  CURSEG_COLD_DATA
  others              CURSEG_WARM_DATA

the F2FS poliy for hot/cold seperation has precedence over this hints.
And hints are not applied in in-place update.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-09 10:18:16 -08:00
Chao Yu
2fbaa25fde f2fs: introduce scan_curseg_cache for cleanup
Commit 4ac912427c ("f2fs: introduce free nid bitmap") copied codes
from __build_free_nids() into scan_free_nid_bits(), they are redundant,
introduce one common function scan_curseg_cache for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-09 10:12:26 -08:00
Fan Li
9745657449 f2fs: optimize the way of traversing free_nid_bitmap
We call scan_free_nid_bits only when there isn't many
free nids left, it means that marked bits in free_nid_bitmap
are supposed to be few, use find_next_bit_le is more
efficient in such case.
According to my tests, use find_next_bit_le instead of
test_bit_le will cut down the traversal time to one
third of its original.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-09 09:43:09 -08:00
Fan Li
74986213ad f2fs: keep scanning until enough free nids are acquired
In current version, after scan_free_nid_bits, the scan is over if
nid_cnt[FREE_NID] != 0. In most cases, there are still free nids in the
free list during the scan, and scan_free_nid_bits usually can't increase
nid_cnt[FREE_NID]. It causes that __build_free_nids is called many times
without solving the shortage of the free nids. This patch fixes that.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-09 09:41:08 -08:00
Chao Yu
a5fd505092 f2fs: trace checkpoint reason in fsync()
This patch slightly changes need_do_checkpoint to return the detail
info that indicates why we need do checkpoint, then caller could print
it with trace message.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-06 17:01:20 -08:00
Chao Yu
e8ed90a6d9 f2fs: keep isize once block is reserved cross EOF
Without FADVISE_KEEP_SIZE_BIT, we will try to recover file size
according to last non-hole block, so in fallocate(), we must set
FADVISE_KEEP_SIZE_BIT flag once we have preallocated block cross
EOF, instead of when all preallocation is success. Otherwise, file
size will be incorrect due to lack of this flag.

Simple testcase to reproduce this:

1. echo 2 > /sys/fs/f2fs/<device>/inject_type
2. echo 10 > /sys/fs/f2fs/<device>/inject_rate
3. run tests/generic/392
4. disable fault injection
5. do remount

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-06 17:00:09 -08:00
Chao Yu
bb06664a53 f2fs: avoid race in between GC and block exchange
During block exchange in {insert,collapse,move}_range, page-block mapping
is unstable due to mapping moving or recovery, so there should be no
concurrent cache read operation rely on such mapping, nor cache write
operation to mess up block exchange.

So this patch let background GC be aware of that.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:10 -08:00
Fan Li
f6986ede80 f2fs: save a multiplication for last_nid calculation
Use a slightly easier way to calculate last_nid.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:09 -08:00
Chao Yu
2b60311dd1 f2fs: fix summary info corruption
Sometimes, after running generic/270 of fstest, fsck reports summary
info and actual position of block address in direct node becoming
inconsistent.

The root cause is race in between __f2fs_replace_block and change_curseg
as below:

Thread A				Thread B
- __clone_blkaddrs
 - f2fs_replace_block
  - __f2fs_replace_block
   - segnoA = GET_SEGNO(sbi, blkaddrA);
   - type = se->type:=CURSEG_HOT_DATA
   - if (!IS_CURSEG(sbi, segnoA))
         type = CURSEG_WARM_DATA
					- allocate_data_block
					 - allocate_segment
					  - get_ssr_segment
					  - change_curseg(segnoA, CURSEG_HOT_DATA)
   - change_curseg(segnoA, CURSEG_WARM_DATA)
    - reset_curseg
     - __set_sit_entry_type
      - change se->type from CURSEG_HOT_DATA to CURSEG_WARM_DATA

So finally, hot curseg locates in segnoA, but type of segnoA becomes
CURSEG_WARM_DATA.

Then if we invoke __f2fs_replace_block(blkaddrB, blkaddrA, true, false),
as blkaddrA locates in segnoA, so we will move warm type curseg to segnoA,
then change its summary cache and writeback it to summary block.

But segnoA is used by hot type curseg too, once it moves or persist, it
will cover summary block content with inner old summary cache, result in
inconsistent status.

This patch tries to fix this issue by introduce global curseg lock to avoid
race in between __f2fs_replace_block and change_curseg.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:08 -08:00
Chao Yu
0537b81153 f2fs: remove dead code in update_meta_page
After commit a468f0ef51 ("f2fs: use crc and cp version to determine
roll-forward recovery"), last caller of update_meta_page passing @src
with NULL is gone, so remove related dead code there.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:07 -08:00
Chao Yu
dee668c143 f2fs: remove unneeded semicolon
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:06 -08:00
Jeff Layton
d1954ab4c9 f2fs: don't bother with inode->i_version
f2fs does not set the SB_I_VERSION flag, so the i_version will never
be incremented on write. It was recently changed to increment the
i_version on a quota write, which isn't necessary here.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:05 -08:00
Chao Yu
bf34c93d26 f2fs: check curseg space before foreground GC
When we are closing to trigger foreground GC, if there are only a few
of dirty metas, we can log these dirty metas in left space of opened
segments instead of triggering foreground GC.

With this patch, total count of foreground GC triggered by
test/generic/* of fstest suit reduce from 254 to 184.

So let's do the check before foreground GC anyway.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:04 -08:00
Chao Yu
3d26fa6be3 f2fs: use rw_semaphore to protect SIT cache
There are some cases user didn't update SIT cache under this lock,
so let's use rw_semaphore instead of mutex to enhance concurrently
accessing.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:03 -08:00
Jaegeuk Kim
ea6767337f f2fs: support quota sys files
This patch supports hidden quota files in the system, which will be used for
Android. It requires up-to-date f2fs-tools later than v1.9.0.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:02 -08:00
Jaegeuk Kim
234a968961 f2fs: add quota_ino feature infra
This patch adds quota_ino feature infra to be used for quota files.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:01 -08:00
Fan Li
37a0ab2a3b f2fs: optimize __update_nat_bits
Make three modification for __update_nat_bits:
1. Take the codes of dealing the nat with nid 0 out of the loop
    Such nat only needs to be dealt with once at beginning.
2. Use " nat_index == 0" instead of " start_nid == 0" to decide if it's the first nat block
    It's better that we don't assume @start_nid is the first nid of the nat block it's in.
3. Use " if (nat_blk->entries[i].block_addr != NULL_ADDR)" to explicitly comfirm the value of block_addr
    use constant to make sure the codes is right, even if the value of NULL_ADDR changes.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:00 -08:00
Yunlei He
f15194fcfa f2fs: modify for accurate fggc node io stat
modify for accurate fggc node io stat

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:59 -08:00
Yunlong Song
65f1b80b33 Revert "f2fs: handle dirty segments inside refresh_sit_entry"
This reverts commit 5e443818fa

The commit should be reverted because call sequence of below two parts
of code must be kept:
a. update sit information, it needs to be updated before segment
allocation since latter allocation may trigger SSR, and SSR allocation
needs latest valid block information of all segments.
b. update segment status, it needs to be updated after segment allocation
since we can skip updating current opened segment status.

Fixes: 5e443818fa ("f2fs: handle dirty segments inside refresh_sit_entry")
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: remove refresh_sit_entry function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:58 -08:00
Fan Li
a0761f63ea f2fs: add a function to move nid
This patch add a new function to move nid from one state to another.
Move operation is heavily used, by adding a new function for it
we can cut down some branches from several flow.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:58 -08:00
Chao Yu
a2a12b679f f2fs: export SSR allocation threshold
This patch exports min_ssr_segments threshold in sysfs to let user
control triggering SSR allocation flexibly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:57 -08:00
Chao Yu
0ea805129d f2fs: give correct trimmed blocks in fstrim
We have supported to issue discard in specified range during fstrim,
it needs to return caller with successfully trimmed bytes in that
range instead of bytes of invalid blocks which are scanned in
checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:56 -08:00
Chao Yu
d62fe97148 f2fs: support bio allocation error injection
This patch adds to support bio allocation error injection to simulate
out-of-memory test scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:55 -08:00
Chao Yu
01eccef793 f2fs: support get_page error injection
This patch adds to support get_page error injection to simulate
out-of-memory test scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:54 -08:00
Yunlong Song
80d4214501 f2fs: support soft block reservation
It supports to extend reserved_blocks sysfs interface to be soft
threshold, which allows user configure it exceeding current available
user space. This patch also introduces a new sysfs interface called
current_reserved_blocks, which shows the current blocks that have
already been reserved.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:52 -08:00
Jaegeuk Kim
bf9c142785 f2fs: handle error case when adding xattr entry
This patch fixes recovering incomplete xattr entries remaining in inline xattr
and xattr block, caused by any kind of errors.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:51 -08:00
Chao Yu
6afc662e68 f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.

In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.

So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.

To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.

Inode disk layout:
  +----------------------+
  | .i_mode              |
  | ...                  |
  | .i_ext               |
  +----------------------+
  | .i_extra_isize       |
  | .i_inline_xattr_size |-----------+
  | ...                  |           |
  +----------------------+           |
  | .i_addr              |           |
  |  - block address or  |           |
  |  - inline data       |           |
  +----------------------+<---+      v
  |    inline xattr      |    +---inline xattr range
  +----------------------+<---+
  | .i_nid               |
  +----------------------+
  |   node_footer        |
  | (nid, ino, offset)   |
  +----------------------+

Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.

Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:50 -08:00
Jaegeuk Kim
b4b153f8c2 f2fs: show current cp state
This patch shows whether checkpoint met any error case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:49 -08:00
Jaegeuk Kim
d8d1389ea1 f2fs: add missing quota_initialize
This patch adds to call quota_intialize in f2fs_set_acl, f2fs_unlink,
and f2fs_rename.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:48 -08:00
Jaegeuk Kim
8f1572f7ce f2fs: show # of dirty segments via sysfs
This patch adds one sysfs entry to show # of dirty segments which can be
used for gc timing by user.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:47 -08:00
Jaegeuk Kim
1f227a3e21 f2fs: stop all the operations by cp_error flag
This patch replaces to use cp_error flag instead of RDONLY for quota off.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:43 -08:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Colin Ian King
dca6951f5a f2fs: remove several redundant assignments
There are several assignments to variables that are redundant
as the values are never read when the variables are updated later
and so the redundant statements can be safely removed.

Cleans up clang warnings:
fs/f2fs/segment.c:923:19: warning: Value stored to 'p' during its initialization is never read
fs/f2fs/segment.c:2060:2: warning: Value stored to 'hint' is never read
fs/f2fs/segment.c:2353:2: warning: Value stored to 'start_block' is never read
fs/f2fs/segment.c:2354:2: warning: Value stored to 'end_block' is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:26 +02:00
Arnd Bergmann
6bccfa19bb f2fs: avoid using timespec
All uses of timespec are deprecated, and this one is not particularly
useful, as the documented method for converting seconds to jiffies
is to multiply by 'HZ'.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:25 +02:00
Chao Yu
7e515b31d4 f2fs: fix to correct no_fggc_candidate
There may be extreme case as below:

For one section contains one segment, and there are total 100 segments
with 10% over-privision ratio in f2fs partition, fggc_threshold will
be rounded down to 460 instead of 460.8 as below caclulation:

sbi->fggc_threshold = div_u64((u64)(main_count - ovp_count) *
			BLKS_PER_SEC(sbi), (main_count - resv_count));

If section usage is as:
60 segments which contain 460 valid blocks
40 segments which contain 462 valid blocks

As valid block number in all sections is large than fggc_threshold, so
none of them will be chosen as candidate due to incorrect fggc_threshold.

Let's just soften the term of choosing foreground GC candidates.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:24 +02:00
Jaegeuk Kim
6e5b5d41c9 Revert "f2fs: return wrong error number on f2fs_quota_write"
This reverts commit 4f31d26b0c.

It turns out that we need to report error number if nothing was written.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:23 +02:00
Jaegeuk Kim
9c77f754f8 f2fs: remove obsolete pointer for truncate_xattr_node
This patch removes obosolete parameter for truncate_xattr_node.

Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:22 +02:00
Jaegeuk Kim
4e46a023c5 f2fs: retry ENOMEM for quota_read|write
This gives another chance to read or write quota data.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:22 +02:00
Jaegeuk Kim
57864ae5ce f2fs: limit # of inmemory pages
If some abnormal users try lots of atomic write operations, f2fs is able to
produce pinned pages in the main memory which affects system performance.
This patch limits that as 20% over total memory size, and if f2fs reaches
to the limit, it will drop all the inmemory pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:21 +02:00
Chao Yu
ab383be510 f2fs: update ctx->pos correctly when hitting hole in directory
This patch fixes to update ctx->pos correctly when hitting hole in
directory.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:20 +02:00
Chao Yu
cb7a844865 f2fs: relocate readahead codes in readdir()
Previously, for large directory, we just do readahead only once in
readdir(), readdir()'s performance may drop when traversing latter
blocks. In order to avoid this, relocate readahead codes to covering
all traverse flow.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:19 +02:00
Chao Yu
4414dea8d3 f2fs: allow readdir() to be interrupted
This patch follows ext4 to allow readdir() in large empty directory to
be interrupted. Referenced commit of ext4: 1f60fbe727 ("ext4: allow
readdir()'s of large empty directories to be interrupted").

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:18 +02:00
Chao Yu
e97a3c4c6f f2fs: trace f2fs_readdir
This patch adds trace for f2fs_readdir.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:17 +02:00
Chao Yu
0c5e36db17 f2fs: trace f2fs_lookup
This patch adds trace for f2fs_lookup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:16 +02:00
Weichao Guo
48ab25f486 f2fs: skip searching non-exist range in truncate_hole
Let's skip entire non-exist area to speed up truncate_hole by
using get_next_page_offset.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:16 +02:00
Jaegeuk Kim
5b4267d195 f2fs: expose some sectors to user in inline data or dentry case
If there's some data written through inline data or dentry, we need to shouw
st_blocks. This fixes reporting zero blocks even though there is small written
data.

Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid link file for quotacheck]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:15 +02:00
Jaegeuk Kim
943973cd52 f2fs: avoid stale fi->gdirty_list pointer
When doing fault injection test, f2fs_evict_inode() didn't remove gdirty_list
which incurs a kernel panic due to wrong pointer access.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:14 +02:00
Jaegeuk Kim
204b4ae067 f2fs/crypto: drop crypto key at evict_inode only
This patch avoids dropping crypto key in f2fs_drop_inode, so we can guarantee
it happens only at evict_inode.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:13 +02:00
Chao Yu
a0d00fad35 f2fs: fix to avoid race when accessing last_disk_size
last_disk_size could be wrong due to concurrently updating, so using
i_sem semaphore to make last_disk_size updating exclusive to fix this
issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:12 +02:00
Thomas Meyer
ebf7c522fd f2fs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:11 +02:00
Chao Yu
cf5c759f92 f2fs: give up CP_TRIMMED_FLAG if it drops discards
In ->umount, once we drop remained discard entries, we should not
set CP_TRIMMED_FLAG with another checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:10 +02:00
Chao Yu
2ec6f2ef79 f2fs: trace f2fs_remove_discard
This patch adds tracepoint to trace f2fs_remove_discard.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:09 +02:00
Chao Yu
33da62cf7a f2fs: reduce cmd_lock coverage in __issue_discard_cmd
__submit_discard_cmd may lead long latency due to exhaustion of I/O
request resource in block layer, so issuing all discard under cmd_lock
may lead to hangtask, in order to avoid that, let's reduce it's coverage.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:09 +02:00
Chao Yu
78997b569f f2fs: split discard policy
There are many different scenarios such as fstrim, umount, urgent or
background where we will issue discards, actually, they need use
different policy in aspect of io aware, discard granularity, delay
interval and so on. But now they just share one common discard policy,
so there will be race when changing policy in between these scenarios,
the interference of changing discard policy will be very serious.

This patch changes to split discard policy for different scenarios.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:08 +02:00
Chao Yu
ecc9aa00db f2fs: wrap discard policy
This patch wraps scattered optional parameters into discard policy as
below, later, with it we expect that we can adjust these parameters with
proper strategy in different scenario.

struct discard_policy {
	unsigned int min_interval;	/* used for candidates exist */
	unsigned int max_interval;	/* used for candidates not exist */
	unsigned int max_requests;	/* # of discards issued per round */
	unsigned int io_aware_gran;	/* minimum granularity discard not be aware of I/O */
	bool io_aware;			/* issue discard in idle time */
	bool sync;			/* submit discard with REQ_SYNC flag */
};

This patch doesn't change any logic of codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:07 +02:00
Chao Yu
8412663d17 f2fs: support issuing/waiting discard in range
Fstrim intends to trim invalid blocks of filesystem only with specified
range and granularity, but actually, it will issue all previous cached
discard commands which may be out-of-range and be with unmatched
granularity, it's unneeded.

In order to fix above issues, this patch introduces new helps to support
to issue and wait discard in range and adds a new fstrim_list for tracking
in-flight discard from ->fstrim.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-26 10:44:06 +02:00
Eric Biggers
ffcc41829a fscrypt: remove unneeded empty fscrypt_operations structs
In the case where a filesystem has been configured without encryption
support, there is no longer any need to initialize ->s_cop at all, since
none of the methods are ever called.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-18 19:52:37 -04:00
Eric Biggers
f7293e48bb fscrypt: remove ->is_encrypted()
Now that all callers of fscrypt_operations.is_encrypted() have been
switched to IS_ENCRYPTED(), remove ->is_encrypted().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-18 19:52:37 -04:00
Eric Biggers
2ee6a576be fs, fscrypt: add an S_ENCRYPTED inode flag
Introduce a flag S_ENCRYPTED which can be set in ->i_flags to indicate
that the inode is encrypted using the fscrypt (fs/crypto/) mechanism.

Checking this flag will give the same information that
inode->i_sb->s_cop->is_encrypted(inode) currently does, but will be more
efficient.  This will be useful for adding higher-level helper functions
for filesystems to use.  For example we'll be able to replace this:

	if (ext4_encrypted_inode(inode)) {
		ret = fscrypt_get_encryption_info(inode);
		if (ret)
			return ret;
		if (!fscrypt_has_encryption_key(inode))
			return -ENOKEY;
	}

with this:

	ret = fscrypt_require_key(inode);
	if (ret)
		return ret;

... since we'll be able to retain the fast path for unencrypted files as
a single flag check, using an inline function.  This wasn't possible
before because we'd have had to frequently call through the
->i_sb->s_cop->is_encrypted function pointer, even when the encryption
support was disabled or not being used.

Note: we don't define S_ENCRYPTED to 0 if CONFIG_FS_ENCRYPTION is
disabled because we want to continue to return an error if an encrypted
file is accessed without encryption support, rather than pretending that
it is unencrypted.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-18 19:52:36 -04:00
Dave Chinner
734f0d241d fscrypt: clean up include file mess
Filesystems have to include different header files based on whether they
are compiled with encryption support or not. That's nasty and messy.

Instead, rationalise the headers so we have a single include fscrypt.h
and let it decide what internal implementation to include based on the
__FS_HAS_ENCRYPTION define.  Filesystems set __FS_HAS_ENCRYPTION to 1
before including linux/fscrypt.h if they are built with encryption
support.  Otherwise, they must set __FS_HAS_ENCRYPTION to 0.

Add guards to prevent fscrypt_supp.h and fscrypt_notsupp.h from being
directly included by filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[EB: use 1 and 0 rather than defined/undefined]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-18 19:52:36 -04:00
Chao Yu
1228b482c4 f2fs: fix to flush multiple device in checkpoint
If f2fs manages multiple devices, in checkpoint, we need to issue flush
in those devices which contain dirty data/node in their cache before
we write checkpoint region, otherwise, filesystem metadata could be
corrupted if hitting SPO after checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:53 -07:00
Chao Yu
39d787bec4 f2fs: enhance multiple device flush
When multiple device feature is enabled, during ->fsync we will issue
flush in all devices to make sure node/data of the file being persisted
into storage. But some flushes of device could be unneeded as file's
data may be not writebacked into those devices. So this patch adds and
manage bitmap per inode in global cache to indicate which device is
dirty and it needs to issue flush during ->fsync, hence, we could improve
performance of fsync in scenario of multiple device.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:53 -07:00
Chao Yu
b77061bfcb f2fs: fix to show ino management cache size correctly
It needs to stat size of ino management cache with all type instead of
orphan ino type.

Fixes: 652be55162 ("f2fs: show # of orphan inodes")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:53 -07:00
Chao Yu
3f06252f7a f2fs: drop FI_UPDATE_WRITE tag after f2fs_issue_flush
If we failed to issue flush in ->fsync, we need to keep FI_UPDATE_WRITE
flag to make sure triggering flush in next ->fsync.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:53 -07:00
Chao Yu
9a4ffdf558 f2fs: obsolete ALLOC_NID_LIST list
As Fan Li reported, there is no user traversing nid_list[ALLOC_NID_LIST]
which is used for tracking preallocated nids. Let's drop it, and only
track preallocated nids in free_nid_root radix-tree.

Reported-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:53 -07:00
Weichao Guo
71ad682c1c f2fs: convert inline data for direct I/O & FI_NO_PREALLOC
In FI_NO_PREALLOC cases, direct I/O path may allocate blocks for an
inode but keep its inline data flag. This inconsistency may trigger
vfs clear_inode nrpages bug_on when evicting the inode. We should
convert inline data first in this case.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:52 -07:00
Hsiang Kao
71cb4afff8 f2fs: allow readpages with NULL file pointer
Keep in line with the other Linux file system implementations
since page_cache_sync_readahead supports NULL file pointer,
and thus we can readahead data by f2fs itself without file opening
(something like the btrfs behavior).

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:52 -07:00
Chao Yu
14d8d5f7de f2fs: show flush list status in sysfs
This patch adds to show flush list status in sysfs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:52 -07:00
Chao Yu
63840695f6 f2fs: introduce read_xattr_block
Commit ba38c27eb9 ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_xattr_block to clean up redundant codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:52 -07:00
Chao Yu
a5f433f741 f2fs: introduce read_inline_xattr
Commit ba38c27eb9 ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_inline_xattr to clean up redundant codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:52 -07:00
Chao Yu
c1fe3e9814 Revert "f2fs: reuse nids more aggressively"
Commit 2683446646 ("f2fs: reuse nids more aggressively") tries to
reuse nids as many as possilbe, in order to mitigate producing obsolete
node pages in page cache.

But acutally, before we reuse the nids and related node page cache,
we will always invalidate that node page, so there will be not any
obsolete node pages in cache.

Let's just revert previous commit, so that nm_i::next_scan_nid can be
increased ascendingly, making __build_free_nids traverses all NAT pages
more easily, finally, free nid bitmap cache can be enabled as soon as
possible.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:52 -07:00
Yunlong Song
91f4382b50 Revert "f2fs: node segment is prior to data segment selected victim"
This reverts commit b9cd20619e.

That patch causes much fewer node segments (which can be used for SSR)
than before, and in the corner case (e.g. create and delete *.txt files in
one same directory, there will be very few node segments but many data
segments), if the reserved free segments are all used up during gc, then
the write_checkpoint can still flush dentry pages to data ssr segments,
but will probably fail to flush node pages to node ssr segments, since
there are not enough node ssr segments left (the left ones are all
full).

So revert this patch to give a fair chance to let node segments remain
for SSR, which provides more robustness for corner cases.

Conflicts:
	fs/f2fs/gc.c

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-10 12:49:51 -07:00
Chao Yu
638164a271 f2fs: fix potential panic during fstrim
As Ju Hyung Park reported:

"When 'fstrim' is called for manual trim, a BUG() can be triggered
randomly with this patch.

I'm seeing this issue on both x86 Desktop and arm64 Android phone.

On x86 Desktop, this was caused during Ubuntu boot-up. I have a
cronjob installed which calls 'fstrim -v /' during boot. On arm64
Android, this was caused during GC looping with 1ms gc_min_sleep_time
& gc_max_sleep_time."

Root cause of this issue is that f2fs_wait_discard_bios can only be
used by f2fs_put_super, because during put_super there must be no
other referrers, so it can ignore discard entry's reference count
when removing the entry, otherwise in other caller we will hit bug_on
in __remove_discard_cmd as there may be other issuer added reference
count in discard entry.

Thread A				Thread B
					- issue_discard_thread
- f2fs_ioc_fitrim
 - f2fs_trim_fs
  - f2fs_wait_discard_bios
   - __issue_discard_cmd
    - __submit_discard_cmd
					 - __wait_discard_cmd
					  - dc->ref++
					  - __wait_one_discard_bio
   - __wait_discard_cmd
    - __remove_discard_cmd
     - f2fs_bug_on(sbi, dc->ref)

Fixes: 969d1b180d
Reported-by: Ju Hyung Park <qkrwngud825@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-03 08:06:05 -07:00
Linus Torvalds
6d8ef53e8b for-f2fs-4.14
In this round, we've mostly tuned f2fs to provide better user experience
 for Android. Especially, we've worked on atomic write feature again with
 SQLite community in order to support it officially. And we added or modified
 several facilities to analyze and enhance IO behaviors.
 
 Major changes include:
 - add app/fs io stat
 - add inode checksum feature
 - support project/journalled quota
 - enhance atomic write with new ioctl() which exposes feature set
 - enhance background gc/discard/fstrim flows with new gc_urgent mode
 - add F2FS_IOC_FS{GET,SET}XATTR
 - fix some quota flows
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlm4brsACgkQQBSofoJI
 UNK6dw/+Jd0j2whU5oxRcxZ6aL1Pj9fp2IdnGk1NbAI2mKAAlGknE/8CDS9OOMdO
 y8O0x3H271DXTMfHJAq2pAfJzcMhiT/Wmw2UsHvmHU0mPmfDcSBKEqPQj6Nbl483
 4s1dyt20InfHsVaKhUWAhov14bxLSiQTfeFH0SL2qv/NTp1Xlp6mwQvKCrmNNxud
 coUL45Zk5uVAVckR0hsyfqudvdXM1LTDG0Y6/j0IaFtO9HqyAEgkILiSqL65TpBV
 2OrXsTf0p2HN9g8vSUUouyD4Oj+q1OHt+VN7gw03xXm3TqAaqnkpIq/dtGLEPyM5
 HD6Q2nDHDTLeKO2Ibi9C0f+bph4UqrCq/eoAjG1sM+6Sm+Hyf193FLR/E2R9aj8w
 ++lCoHUSf/krrMs9d+vnNWaTsKszAbAQRLiZaSHi21+0lcDZtYejNsm52LpDMAfO
 jzz+TTOvXTSlHWSlt8DRKVolNhMRFy9OYIJ0schYYD6FJldARmBMfcZosrhL1Xoh
 oU/bBaXwMv1XOWAOGCQbGrqREiciqXbKDGPQJq65Zvn60U6YzZf04wDbm0zXku5E
 x7S8kPxz8c/010JHIxvULZRamlvXSjFevbAa+QtNsEhlj6DkDSdisMj+w7/jU4Yx
 uInHojIq7ARJO0SBIoYFkz3+/2w++McCK0b/gpx1WHsN8I013zs=
 =w4KH
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've mostly tuned f2fs to provide better user
  experience for Android. Especially, we've worked on atomic write
  feature again with SQLite community in order to support it officially.
  And we added or modified several facilities to analyze and enhance IO
  behaviors.

  Major changes include:
   - add app/fs io stat
   - add inode checksum feature
   - support project/journalled quota
   - enhance atomic write with new ioctl() which exposes feature set
   - enhance background gc/discard/fstrim flows with new gc_urgent mode
   - add F2FS_IOC_FS{GET,SET}XATTR
   - fix some quota flows"

* tag 'f2fs-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (63 commits)
  f2fs: hurry up to issue discard after io interruption
  f2fs: fix to show correct discard_granularity in sysfs
  f2fs: detect dirty inode in evict_inode
  f2fs: clear radix tree dirty tag of pages whose dirty flag is cleared
  f2fs: speed up gc_urgent mode with SSR
  f2fs: better to wait for fstrim completion
  f2fs: avoid race in between read xattr & write xattr
  f2fs: make get_lock_data_page to handle encrypted inode
  f2fs: use generic terms used for encrypted block management
  f2fs: introduce f2fs_encrypted_file for clean-up
  Revert "f2fs: add a new function get_ssr_cost"
  f2fs: constify super_operations
  f2fs: fix to wake up all sleeping flusher
  f2fs: avoid race in between atomic_read & atomic_inc
  f2fs: remove unneeded parameter of change_curseg
  f2fs: update i_flags correctly
  f2fs: don't check inode's checksum if it was dirtied or writebacked
  f2fs: don't need to update inode checksum for recovery
  f2fs: trigger fdatasync for non-atomic_write file
  f2fs: fix to avoid race in between aio and gc
  ...
2017-09-12 20:05:58 -07:00
Chao Yu
e6c6de18f0 f2fs: hurry up to issue discard after io interruption
Once we encounter I/O interruption during issuing discards, we will delay
long time before next round, but if system status is I/O idle during the
time, it may loses opportunity to issue discards. So this patch changes
to hurry up to issue discard after io interruption.

Besides, this patch also fixes to issue discards accurately with assigned
rate.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-12 10:02:55 -07:00
Chao Yu
80647e5f4c f2fs: fix to show correct discard_granularity in sysfs
Fix below incorrect display when reading discard_granularity sysfs node.

$ cat /sys/fs/f2fs/<device>/discard_granularity
$ 16
$ echo 32 > /sys/fs/f2fs/<device>/discard_granularity
$ cat /sys/fs/f2fs/<device>/discard_granularity
$ 16

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-12 10:02:47 -07:00
Chao Yu
ca7d802a7d f2fs: detect dirty inode in evict_inode
Add a bugon in f2fs_evict_inode to detect inconsistent status between
inode cache and related node page cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-12 10:02:39 -07:00
Daeho Jeong
0abd8e70d2 f2fs: clear radix tree dirty tag of pages whose dirty flag is cleared
On a senario like writing out the first dirty page of the inode
as the inline data, we only cleared dirty flags of the pages, but
didn't clear the dirty tags of those pages in the radix tree.

If we don't clear the dirty tags of the pages in the radix tree, the
inodes which contain the pages will be marked with I_DIRTY_PAGES again
and again, and writepages() for the inodes will be invoked in every
writeback period. As a result, nothing will be done in every
writepages() for the inodes and it will just consume CPU time
meaninglessly.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-11 21:32:38 -07:00
Jaegeuk Kim
b3a97a2a9a f2fs: speed up gc_urgent mode with SSR
This patch activates SSR in gc_urgent mode.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-11 17:22:18 -07:00
Jaegeuk Kim
1eb1ef4a8e f2fs: better to wait for fstrim completion
In android, we'd better wait for fstrim completion instead of issuing the
discard commands asynchronous.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-11 17:22:12 -07:00
Jérôme Glisse
2916ecc0f9 mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
Introduce a new migration mode that allow to offload the copy to a device
DMA engine.  This changes the workflow of migration and not all
address_space migratepage callback can support this.

This is intended to be use by migrate_vma() which itself is use for thing
like HMM (see include/linux/hmm.h).

No additional per-filesystem migratepage testing is needed.  I disables
MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
added comment in those to explain why (part of this patch).  The commit
message is unclear it should say that any callback that wish to support
this new mode need to be aware of the difference in the migration flow
from other mode.

Some of these callbacks do extra locking while copying (aio, zsmalloc,
balloon, ...) and for DMA to be effective you want to copy multiple
pages in one DMA operations.  But in the problematic case you can not
easily hold the extra lock accross multiple call to this callback.

Usual flow is:

For each page {
 1 - lock page
 2 - call migratepage() callback
 3 - (extra locking in some migratepage() callback)
 4 - migrate page state (freeze refcount, update page cache, buffer
     head, ...)
 5 - copy page
 6 - (unlock any extra lock of migratepage() callback)
 7 - return from migratepage() callback
 8 - unlock page
}

The new mode MIGRATE_SYNC_NO_COPY:
 1 - lock multiple pages
For each page {
 2 - call migratepage() callback
 3 - abort in all problematic migratepage() callback
 4 - migrate page state (freeze refcount, update page cache, buffer
     head, ...)
} // finished all calls to migratepage() callback
 5 - DMA copy multiple pages
 6 - unlock all the pages

To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
new callback migratepages() (for instance) that deals with multiple
pages in one transaction.

Because the problematic cases are not important for current usage I did
not wanted to complexify this patchset even more for no good reason.

Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:46 -07:00
Yunlei He
27161f13e3 f2fs: avoid race in between read xattr & write xattr
Thread A:					Thread B:
-f2fs_getxattr
   -lookup_all_xattrs
      -xnid = F2FS_I(inode)->i_xattr_nid;
						-f2fs_setxattr
						    -__f2fs_setxattr
						        -write_all_xattrs
						            -truncate_xattr_node
							          ...  ...
						-write_checkpoint
								  ...  ...
						-alloc_nid   <- nid reuse
          -get_node_page
              -f2fs_bug_on  <- nid != node_footer->nid

It's need a rw_sem to avoid the race

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-07 20:57:20 -07:00
Jaegeuk Kim
13ba41e346 f2fs: make get_lock_data_page to handle encrypted inode
This patch refactors get_lock_data_page() to handle encryption case directly.
In order to do that, it introduces common f2fs_submit_page_read().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-07 20:55:51 -07:00
Linus Torvalds
a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Linus Torvalds
ec3604c7a5 Writeback error handling fixes for v4.14
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZrTy3AAoJEAAOaEEZVoIVaucP/ApBAj2S5wzvlV1u6l8E6ae7
 ZeEEZfcWwzRYlKjZAkTWqj9XvGpDGO5gLq4wsZK2edFAq++/MJF8ZVtN4tdZ1kUZ
 DUvRodtVOrT08Kp9wZXGT7JOFrf6U/6gMcR6p0MuWnHndeKYvlpcFi9NPT4EC9/z
 Zm9V7gtlPdSOha7eaSjUS0+vLERkxqXLBW3Av9QUOBP/lbI3lqIroGKeHDYnVdya
 2P/k5EcRRJMyJP6TqyYxmmJl+UWjJFMLvnlUDBslHnD/u3mIUhw3JLHYBjn5dZRE
 Xjq56IDPoXDUvzlBhtn/Uqyx+/wtwsNsylpmKv6K5G1JfdeuSsPVsCey+A1cqV64
 LpE5896wf9TmnmI9LNyh6vDn925xPSGBiF45UEp5f9aO7jXeY0MaEZ8g+ENqFIDK
 v4gtZdS9FhYHV+/l4qEwYMKrqSbwKEs1r1FT+f4wnABby1ojfdA57ZPlp5PV2Vjp
 szTp88Zkb7cMvZwEnWwxWofcJNmgS7uNahvnQF3IJ4ITsioEkuyYR3K4ZQMaaaV9
 wCp6G0FhXZaK3OI7o9WiDwaO2elp9Hxc8bnqKpiBbHZkY0NLh7/++5VxpeNbTHFy
 AGijQiiKGNNyYqNj93wq9jpVdMNjB0pXrHRxfav8v7MtQ+WfbEoAENF4T7hN7iXn
 UuF6eSWEC5O1UCRUk1A+
 =LLY3
 -----END PGP SIGNATURE-----

Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull writeback error handling updates from Jeff Layton:
 "This pile continues the work from last cycle on better tracking
  writeback errors. In v4.13 we added some basic errseq_t infrastructure
  and converted a few filesystems to use it.

  This set continues refining that infrastructure, adds documentation,
  and converts most of the other filesystems to use it. The main
  exception at this point is the NFS client"

* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  ecryptfs: convert to file_write_and_wait in ->fsync
  mm: remove optimizations based on i_size in mapping writeback waits
  fs: convert a pile of fsync routines to errseq_t based reporting
  gfs2: convert to errseq_t based writeback error reporting for fsync
  fs: convert sync_file_range to use errseq_t based error-tracking
  mm: add file_fdatawait_range and file_write_and_wait
  fuse: convert to errseq_t based error tracking for fsync
  mm: consolidate dax / non-dax checks for writeback
  Documentation: add some docs for errseq_t
  errseq: rename __errseq_set to errseq_set
2017-09-06 14:11:03 -07:00
Jaegeuk Kim
d4c759ee5f f2fs: use generic terms used for encrypted block management
This patch renames functions regarding to buffer management via META_MAPPING
used for encrypted blocks especially. We can actually use them in generic way.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 20:21:48 -07:00
Jaegeuk Kim
1958593e4f f2fs: introduce f2fs_encrypted_file for clean-up
This patch replaces (f2fs_encrypted_inode() && S_ISREG()) with
f2fs_encrypted_file(), which gives no functional change.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 20:11:29 -07:00
Yunlong Song
2afce76a11 Revert "f2fs: add a new function get_ssr_cost"
This reverts commit b7b7c4cf1c.

se->ckpt_valid_blocks will never be smaller than se->valid_blocks, so just
remove get_ssr_cost.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:52:28 -07:00
Arvind Yadav
f62fc9f976 f2fs: constify super_operations
super_operations are not supposed to change at runtime.
"struct super_block" working with super_operations provided
by <linux/fs.h> work with const super_operations. So mark
the non-const structs as const

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:24 -07:00
Chao Yu
d3238691ed f2fs: fix to wake up all sleeping flusher
In scenario of remount_ro vs flush, after flush_thread exits in
->remount_fs, flusher will only clean up golbal issue_list, but
without waking up flushers waiting on that list, result in hang
related user threads.

In order to fix this issue, this patch enables the flusher to
take charge of issue_flush thread: executes merged flush command,
and wake up all sleeping flushers.

Fixes: 5eba8c5d1f ("f2fs: fix to access nullified flush_cmd_control pointer")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:23 -07:00
Chao Yu
edd748e6c8 f2fs: avoid race in between atomic_read & atomic_inc
Previously, we will miss merging flush command during fsync due to below
race condition:

Thread A 		Thread B		Thread C
- f2fs_issue_flush
 - atomic_read(&issing_flush)
			- f2fs_issue_flush
			 - atomic_read(&issing_flush)
						- f2fs_issue_flush
						 - atomic_read(&issing_flush)
  - atomic_inc(&issing_flush)
			  - atomic_inc(&issing_flush)
						  - atomic_inc(&issing_flush)
   - submit_flush_wait
			   - submit_flush_wait
						   - submit_flush_wait

It needs to use atomic_inc_return instead to avoid such race.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:22 -07:00
Chao Yu
025d63a486 f2fs: remove unneeded parameter of change_curseg
allocate_segment_by_default is the only caller of change_curseg passing
@reuse with 'false', but commit 763bfe1bc5 ("f2fs: remove reusing any
prefree segments") removes the calling, after that, @reuse in
change_curseg always be true, so, let's clean up the unneeded parameter.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:21 -07:00
Chao Yu
11f5020d2f f2fs: update i_flags correctly
f2fs enables hash-indexed directory by default, so we need to tag
FS_INDEX_FL in inode::i_flags during directory creataion, in order
to show correct status of inode in lsattr:

Before:
------------------- /mnt/f2fs/dir/
After:
-----------I------- /mnt/f2fs/dir/

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:21 -07:00
Jaegeuk Kim
ee60523499 f2fs: don't check inode's checksum if it was dirtied or writebacked
If another thread already made the page dirtied or writebacked, we must avoid
to verify checksum. If we got an error, we need to remove its uptodate as well.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:11 -07:00
Jaegeuk Kim
a298d57f96 f2fs: don't need to update inode checksum for recovery
This patch fixes "f2fs: support inode checksum".
The recovered inode page will be rewritten with valid checksum.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:49:37 -07:00
Chao Yu
774e1b78a0 f2fs: trigger fdatasync for non-atomic_write file
Sqlite only cares about synchronization of file data instead of other data
unrelated attribute of inode, so in commit flow, call fdatasync is enough.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:05:42 -07:00
Chao Yu
73ac2f4e82 f2fs: fix to avoid race in between aio and gc
We won't wait DIO synchronously when doing AIO, so there will be potential
IO reorder in between AIO and GC, which will cause data corruption.

This patch adds inode_dio_wait to serialize aio and data GC to avoid this
issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:05:42 -07:00
Jaegeuk Kim
01983c715a f2fs: wake up discard_thread iff there is a candidate
This patch fixes to avoid needless wake ups.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:05:33 -07:00
Jaegeuk Kim
adb6dc1971 f2fs: return error when accessing insane flie offset
If file offset is insane, we have to return error instead of kernel panic.

Reported-by: Eric Zhang <followme999@163.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:58 -07:00
Chao Yu
0adf6a1b79 f2fs: trigger normal fsync for non-atomic_write file
If file was not opened with atomic write mode, but user uses atomic write
ioctl to fsync datas, in the flow, we should not fsync that file with
atomic write mode.

Fixes: 608514deba ("f2fs: set fsync mark only for the last dnode")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:57 -07:00
Chao Yu
84a23fbe96 f2fs: clear FI_HOT_DATA correctly
This patch fixes to clear FI_HOT_DATA correctly in below path:
- error handling in f2fs_ioc_start_atomic_write
- after commit atomic write in f2fs_ioc_commit_atomic_write
- after drop atomic write in drop_inmem_pages

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:56 -07:00
Chao Yu
6f890df0a7 f2fs: fix out-of-order execution in f2fs_issue_flush
In f2fs_issue_flush, due to out-of-order execution of CPU, wake_up can
be called before we insert issue_list, result in long latency of
wait_for_completion. Fix this by adding smp_mb() to force the order of
related codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:55 -07:00
Jaegeuk Kim
5f656541ff f2fs: issue discard commands if gc_urgent is set
It's time to issue all the discard commands, if user sets the idle time.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:47 -07:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Chao Yu
969d1b180d f2fs: introduce discard_granularity sysfs entry
Commit d618ebaf0a ("f2fs: enable small discard by default") enables
f2fs to issue 4K size discard in real-time discard mode. However, issuing
smaller discard may cost more lifetime but releasing less free space in
flash device. Since f2fs has ability of separating hot/cold data and
garbage collection, we can expect that small-sized invalid region would
expand soon with OPU, deletion or garbage collection on valid datas, so
it's better to delay or skip issuing smaller size discards, it could help
to reduce overmuch consumption of IO bandwidth and lifetime of flash
storage.

This patch makes f2fs selectng 64K size as its default minimal
granularity, and issue discard with the size which is not smaller than
minimal granularity. Also it exposes discard granularity as sysfs entry
for configuration in different scenario.

Jaegeuk Kim:
 We must issue all the accumulated discard commands when fstrim is called.
 So, I've added pend_list_tag[] to indicate whether we should issue the
 commands or not. If tag sets P_ACTIVE or P_TRIM, we have to issue them.
 P_TRIM is set once at a time, given fstrim trigger.
 In addition, issue_discard_thread is calling too much due to the number of
 discard commands remaining in the pending list. I added a timer to control
 it likewise gc_thread.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:07 -07:00
Yunlong Song
f24b150a63 f2fs: remove unused function overprovision_sections
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:07 -07:00
Jaegeuk Kim
125c9fb1cc f2fs: check hot_data for roll-forward recovery
We need to check HOT_DATA to truncate any previous data block when doing
roll-forward recovery.

Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:06 -07:00
Chao Yu
c56f16dab0 f2fs: add tracepoint for f2fs_gc
This patch adds tracepoint for f2fs_gc.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:05 -07:00
Chao Yu
7f2b4e8ea5 f2fs: retry to revoke atomic commit in -ENOMEM case
During atomic committing, if we encounter -ENOMEM in revoke path, it's
better to give a chance to retry revoking.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:04 -07:00
Jaegeuk Kim
afd2b4da40 f2fs: let fill_super handle roll-forward errors
If we set CP_ERROR_FLAG in roll-forward error, f2fs is no longer to proceed
any IOs due to f2fs_cp_error(). But, for example, if some stale data is involved
on roll-forward process, we're able to get -ENOENT, getting fs stuck.
If we get any error, let fill_super set SBI_NEED_FSCK and try to recover back
to stable point.

Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:03 -07:00
Qiuyang Sun
f2220c7f15 f2fs: merge equivalent flags F2FS_GET_BLOCK_[READ|DIO]
Currently, the two flags F2FS_GET_BLOCK_[READ|DIO] are totally equivalent
and can be used interchangably in all scenarios they are involved in.
Neither of the flags is referenced in f2fs_map_blocks(), making them both
the default case. To remove the ambiguity, this patch merges both flags
into F2FS_GET_BLOCK_DEFAULT, and introduces an enum for all distinct flags.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:02 -07:00
Chao Yu
4b2414d04e f2fs: support journalled quota
This patch supports to enable f2fs to accept quota information through
mount option:
- {usr,grp,prj}jquota=<quota file path>
- jqfmt=<quota type>

Then, in ->mount flow, we can recover quota file during log replaying,
by this, journelled quota can be supported.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: Fix wrong return values.]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:54:48 -07:00
Chao Yu
b8c502b81e f2fs: fix potential overflow when adjusting GC cycle
While comparing signed and unsigned variables, compiler will converts the
signed value to unsigned one, due to this reason, {in,de}crease_sleep_time
may return overflowed result.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:14 -07:00
Chao Yu
9a20d391cd f2fs: avoid unneeded sync on quota file
We only need to sync quota file with appointed quota type instead of all
types in f2fs_quota_{on,off}.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:13 -07:00
Jaegeuk Kim
d9872a698c f2fs: introduce gc_urgent mode for background GC
This patch adds a sysfs entry to control urgent mode for background GC.
If this is set, background GC thread conducts GC with gc_urgent_sleep_time
all the time.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:12 -07:00
Jaegeuk Kim
3537581a72 f2fs: use IPU for cold files
We expect cold files write data sequentially, but sometimes some of small data
can be updated, which incurs fragmentation.
Let's avoid that.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:11 -07:00
Yunlong Song
008396e1b0 f2fs: fix the size value in __check_sit_bitmap
The current size value is not correct and will miss bitmap check.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:10 -07:00
Chao Yu
b0af6d491a f2fs: add app/fs io stat
This patch enables inner app/fs io stats and introduces below virtual fs
nodes for exposing stats info:
/sys/fs/f2fs/<dev>/iostat_enable
/proc/fs/f2fs/<dev>/iostat_info

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix wrong stat assignment]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-09 21:43:58 -07:00
Yunlong Song
35ee82ca13 f2fs: do not change the valid_block value if cur_valid_map was wrongly set or cleared
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-09 17:45:23 -07:00
Yunlong Song
6415fedc57 f2fs: update cur_valid_map_mir together with cur_valid_map
When cur_valid_map passes the f2fs_test_and_set(,clear)_bit test,
cur_valid_map_mir update is skipped unlikely, so fix it. The fix
now changes the mirror check together with cur_valid_map all the
time.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: Fix unused variable and add unlikely for corner condition.]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-09 17:45:21 -07:00
Jaegeuk Kim
a36c106dff f2fs: use printk_ratelimited for f2fs_msg
This patch reduces contention of printks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:09:52 -07:00
Jaegeuk Kim
bf9e697ecd f2fs: expose features to sysfs entry
This patch exposes what features are supported by current f2fs build to sysfs
entry via:

/sys/fs/f2fs/features/
/sys/fs/f2fs/dev/features

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:09:51 -07:00
Chao Yu
704956ecf5 f2fs: support inode checksum
This patch adds to support inode checksum in f2fs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix verification flow]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:09:26 -07:00
Jaegeuk Kim
4f31d26b0c f2fs: return wrong error number on f2fs_quota_write
This must return size, not error number.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:05:05 -07:00
Yunlong Song
401db79f61 f2fs: provide f2fs_balance_fs to __write_node_page
Let node writeback also do f2fs_balance_fs to ensure there are always enough free
segments.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:05:04 -07:00
Jeff Layton
3b49c9a1e9 fs: convert a pile of fsync routines to errseq_t based reporting
This patch converts most of the in-kernel filesystems that do writeback
out of the pagecache to report errors using the errseq_t-based
infrastructure that was recently added. This allows them to report
errors once for each open file description.

Most filesystems have a fairly straightforward fsync operation. They
call filemap_write_and_wait_range to write back all of the data and
wait on it, and then (sometimes) sync out the metadata.

For those filesystems this is a straightforward conversion from calling
filemap_write_and_wait_range in their fsync operation to calling
file_write_and_wait_range.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-08-01 08:39:29 -04:00
Chao Yu
ddc34e328d f2fs: introduce f2fs_statfs_project
This patch introduces f2fs_statfs_project, it enables to show usage
status of directory tree which is limited with project quota.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:36 -07:00
Chao Yu
2c1d030569 f2fs: support F2FS_IOC_FS{GET,SET}XATTR
This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface
support for f2fs. The interface is kept consistent with the one
of ext4/xfs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:35 -07:00
Jaegeuk Kim
b6a245eb34 f2fs: don't need to wait for node writes for atomic write
We have a node chain to serialize node block writes, so if any IOs for
node block writes are reordered, we'll get broken node chain. IOWs,
roll-forward recovery will see all or none node blocks given fsync
mark.

E.g.,
Node chain consists of:
 N1 -> N2 -> N3 -> NFSYNC -> N1' -> N2' -> N'FSYNC

Reordered to:
1) N1 -> N2 -> N3 -> N2' -> NFSYNC -> N'FSYNC -> power-cut
2) N1 -> N2 -> N3 -> N1' -> NFSYNC -> power-cut
3) N1 -> N2 -> NFSYNC -> N1' -> N'FSYNC -> N3 -> power-cut
4) N1 -> NFSYNC -> N1' -> N2' -> N'FSYNC -> N3 -> power-cut

Roll-forward recovery can proceed to:
1) N1 -> N2 -> N3 -> NFSYNC -> X
2) N1 -> N2 -> N3 -> NFSYNC -> N1' -> X
3) N1 -> N2 -> N3 -> FSYNC -> N1' -> X
4) N1 -> X

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:34 -07:00
Jaegeuk Kim
dc6b205510 f2fs: avoid naming confusion of sysfs init
This patch changes the function names of sysfs init to follow ext4.

f2fs_init_sysfs <-> f2fs_register_sysfs
f2fs_exit_sysfs <-> f2fs_unregister_sysfs

Suggested-by: Chao Yu <yuchao0@huawei.com>
Reivewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:33 -07:00
Chao Yu
5c57132eaf f2fs: support project quota
This patch adds to support plain project quota.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:32 -07:00
Chao Yu
a6d3a479ae f2fs: record quota during dot{,dot} recovery
In ->lookup(), we will have a try to recover dot or dotdot for
corrupted directory, once disk quota is on, if it allocates new
block during dotdot recovery, we need to record disk quota info
for the allocation, so this patch fixes this issue by adding
missing dquot_initialize() in __recover_dot_dentries.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:31 -07:00
Chao Yu
7a2af766af f2fs: enhance on-disk inode structure scalability
This patch add new flag F2FS_EXTRA_ATTR storing in inode.i_inline
to indicate that on-disk structure of current inode is extended.

In order to extend, we changed the inode structure a bit:

Original one:

struct f2fs_inode {
	...
	struct f2fs_extent i_ext;
	__le32 i_addr[DEF_ADDRS_PER_INODE];
	__le32 i_nid[DEF_NIDS_PER_INODE];
}

Extended one:

struct f2fs_inode {
        ...
        struct f2fs_extent i_ext;
	union {
		struct {
			__le16 i_extra_isize;
			__le16 i_padding;
			__le32 i_extra_end[0];
		};
		__le32 i_addr[DEF_ADDRS_PER_INODE];
	};
        __le32 i_nid[DEF_NIDS_PER_INODE];
}

Once F2FS_EXTRA_ATTR is set, we will steal four bytes in the head of
i_addr field for storing i_extra_isize and i_padding. with i_extra_isize,
we can calculate actual size of reserved space in i_addr, available
attribute fields included in total extra attribute fields for current
inode can be described as below:

  +--------------------+
  | .i_mode            |
  | ...                |
  | .i_ext             |
  +--------------------+
  | .i_extra_isize     |-----+
  | .i_padding         |     |
  | .i_prjid           |     |
  | .i_atime_extra     |     |
  | .i_ctime_extra     |     |
  | .i_mtime_extra     |<----+
  | .i_inode_cs        |<----- store blkaddr/inline from here
  | .i_xattr_cs        |
  | ...                |
  +--------------------+
  |                    |
  |    block address   |
  |                    |
  +--------------------+
  | .i_nid             |
  +--------------------+
  |   node_footer      |
  | (nid, ino, offset) |
  +--------------------+

Hence, with this patch, we would enhance scalability of f2fs inode for
storing more newly added attribute.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:30 -07:00
Chao Yu
f247037120 f2fs: make max inline size changeable
This patch tries to make below macros calculating max inline size,
inline dentry field size considerring reserving size-changeable
space:
- MAX_INLINE_DATA
- NR_INLINE_DENTRY
- INLINE_DENTRY_BITMAP_SIZE
- INLINE_RESERVED_SIZE

Then, when inline_{data,dentry} options is enabled, it allows us to
reserve inline space with different size flexibly for adding newly
introduced inode attribute.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:29 -07:00
Jaegeuk Kim
e65ef20781 f2fs: add ioctl to expose current features
This patch adds an ioctl to provide feature information to user.
For exapmle, SQLite can use this ioctl to detect whether f2fs support atomic
write or not.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:28 -07:00
Chao Yu
dc6febb6bc f2fs: make background threads of f2fs being aware of freezing
When ->freeze_fs is called from lvm for doing snapshot, it needs to
make sure there will be no more changes in filesystem's data, however,
previously, background threads like GC thread wasn't aware of freezing,
so in environment with active background threads, data of snapshot
becomes unstable.

This patch fixes this issue by adding sb_{start,end}_intwrite in
below background threads:
- GC thread
- flush thread
- discard thread

Note that, don't use sb_start_intwrite() in gc_thread_func() due to:

generic/241 reports below bug:

 ======================================================
 WARNING: possible circular locking dependency detected
 4.13.0-rc1+ #32 Tainted: G           O
 ------------------------------------------------------
 f2fs_gc-250:0/22186 is trying to acquire lock:
  (&sbi->gc_mutex){+.+...}, at: [<f8fa7f0b>] f2fs_sync_fs+0x7b/0x1b0 [f2fs]

 but task is already holding lock:
  (sb_internal#2){++++.-}, at: [<f8fb5609>] gc_thread_func+0x159/0x4a0 [f2fs]

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 (sb_internal#2){++++.-}:
        __lock_acquire+0x405/0x7b0
        lock_acquire+0xae/0x220
        __sb_start_write+0x11d/0x1f0
        f2fs_evict_inode+0x2d6/0x4e0 [f2fs]
        evict+0xa8/0x170
        iput+0x1fb/0x2c0
        f2fs_sync_inode_meta+0x3f/0xf0 [f2fs]
        write_checkpoint+0x1b1/0x750 [f2fs]
        f2fs_sync_fs+0x85/0x1b0 [f2fs]
        f2fs_do_sync_file.isra.24+0x137/0xa30 [f2fs]
        f2fs_sync_file+0x34/0x40 [f2fs]
        vfs_fsync_range+0x4a/0xa0
        do_fsync+0x3c/0x60
        SyS_fdatasync+0x15/0x20
        do_fast_syscall_32+0xa1/0x1b0
        entry_SYSENTER_32+0x4c/0x7b

 -> #1 (&sbi->cp_mutex){+.+...}:
        __lock_acquire+0x405/0x7b0
        lock_acquire+0xae/0x220
        __mutex_lock+0x4f/0x830
        mutex_lock_nested+0x25/0x30
        write_checkpoint+0x2f/0x750 [f2fs]
        f2fs_sync_fs+0x85/0x1b0 [f2fs]
        sync_filesystem+0x67/0x80
        generic_shutdown_super+0x27/0x100
        kill_block_super+0x22/0x50
        kill_f2fs_super+0x3a/0x40 [f2fs]
        deactivate_locked_super+0x3d/0x70
        deactivate_super+0x40/0x60
        cleanup_mnt+0x39/0x70
        __cleanup_mnt+0x10/0x20
        task_work_run+0x69/0x80
        exit_to_usermode_loop+0x57/0x92
        do_fast_syscall_32+0x18c/0x1b0
        entry_SYSENTER_32+0x4c/0x7b

 -> #0 (&sbi->gc_mutex){+.+...}:
        validate_chain.isra.36+0xc50/0xdb0
        __lock_acquire+0x405/0x7b0
        lock_acquire+0xae/0x220
        __mutex_lock+0x4f/0x830
        mutex_lock_nested+0x25/0x30
        f2fs_sync_fs+0x7b/0x1b0 [f2fs]
        f2fs_balance_fs_bg+0xb9/0x200 [f2fs]
        gc_thread_func+0x302/0x4a0 [f2fs]
        kthread+0xe9/0x120
        ret_from_fork+0x19/0x24

 other info that might help us debug this:

 Chain exists of:
   &sbi->gc_mutex --> &sbi->cp_mutex --> sb_internal#2

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(sb_internal#2);
                                lock(&sbi->cp_mutex);
                                lock(sb_internal#2);
   lock(&sbi->gc_mutex);

  *** DEADLOCK ***

 1 lock held by f2fs_gc-250:0/22186:
  #0:  (sb_internal#2){++++.-}, at: [<f8fb5609>] gc_thread_func+0x159/0x4a0 [f2fs]

 stack backtrace:
 CPU: 2 PID: 22186 Comm: f2fs_gc-250:0 Tainted: G           O    4.13.0-rc1+ #32
 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 Call Trace:
  dump_stack+0x5f/0x92
  print_circular_bug+0x1b3/0x1bd
  validate_chain.isra.36+0xc50/0xdb0
  ? __this_cpu_preempt_check+0xf/0x20
  __lock_acquire+0x405/0x7b0
  lock_acquire+0xae/0x220
  ? f2fs_sync_fs+0x7b/0x1b0 [f2fs]
  __mutex_lock+0x4f/0x830
  ? f2fs_sync_fs+0x7b/0x1b0 [f2fs]
  mutex_lock_nested+0x25/0x30
  ? f2fs_sync_fs+0x7b/0x1b0 [f2fs]
  f2fs_sync_fs+0x7b/0x1b0 [f2fs]
  f2fs_balance_fs_bg+0xb9/0x200 [f2fs]
  gc_thread_func+0x302/0x4a0 [f2fs]
  ? preempt_schedule_common+0x2f/0x4d
  ? f2fs_gc+0x540/0x540 [f2fs]
  kthread+0xe9/0x120
  ? f2fs_gc+0x540/0x540 [f2fs]
  ? kthread_create_on_node+0x30/0x30
  ret_from_fork+0x19/0x24

The deadlock occurs in below condition:
GC Thread			Thread B
- sb_start_intwrite
				- f2fs_sync_file
				 - f2fs_sync_fs
				  - mutex_lock(&sbi->gc_mutex)
				   - write_checkpoint
				    - block_operations
				     - f2fs_sync_inode_meta
				      - iput
				       - sb_start_intwrite
 - mutex_lock(&sbi->gc_mutex)

Fix this by altering sb_start_intwrite to sb_start_write_trylock.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:47:27 -07:00
Jaegeuk Kim
7a10f0177e f2fs: don't give partially written atomic data from process crash
This patch resolves the below scenario.

== Process 1 ==     == Process 2 ==
open(w)             open(rw)
begin
write(new_#1)
process_crash
  f_op->flush
  locks_remove_posix
  f_op>release
                    read (new_#1)

In order to avoid corrupted database caused by new_#1, we must do roll-back
at process_crash time. In order to check that, this patch keeps task which
triggers transaction begin, and does roll-back in f_op->flush before removing
file locks.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-28 17:49:01 -07:00
Jaegeuk Kim
640cc18982 f2fs: give a try to do atomic write in -ENOMEM case
It'd be better to retry writing atomic pages when we get -ENOMEM.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-28 17:49:00 -07:00
Ernesto A. Fernández
14af20fcb1 f2fs: preserve i_mode if __f2fs_set_acl() fails
When changing a file's acl mask, __f2fs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by only changing the inode mode after the acl has been set.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-28 17:48:54 -07:00
Yunlei He
8790568255 f2fs: alloc new nids for xattr block in recovery
recovery file A:			recovery file B:
	-get_dnode_of_data
		-alloc_nid
					-recover_xattr_data
						-set_node_addr(sbi, &ni, NEW_ADDR, false);
							--->bug_on for nid has been used by file A

In recovery process, new allocated node blocks may "reuse" xattr block
nids, this patch alloc new nids for xattr blocks in recovery process to
avoid this problem.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-26 19:34:30 -07:00
Chao Yu
76a9dd85d4 f2fs: spread struct f2fs_dentry_ptr for inline path
Use f2fs_dentry_ptr structure to indicate inline dentry structure as
much as possible, so we can wrap inline dentry with size-fixed fields
to the one with size-changeable fields. With this change, we can
handle size-changeable inline dentry more easily.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-26 19:34:30 -07:00
Yunlei He
5f4ce6abc2 f2fs: remove unused input parameter
This patch remove unused input parameter in function
new_node_page.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Yong Sheng <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-26 19:34:30 -07:00
Jaegeuk Kim
4db08d016c f2fs: avoid cpu lockup
Before retrying to flush data or dentry pages, we need to release cpu in order
to prevent watchdog.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-17 19:23:18 -07:00
Jaegeuk Kim
299aa41a45 f2fs: include seq_file.h for sysfs.c
This patch includes seq_file.h to avoid compile error.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-17 19:23:12 -07:00
Jaegeuk Kim
c925dc162f f2fs: Don't clear SGID when inheriting ACLs
This patch copies commit b7f8a09f80:
"btrfs: Don't clear SGID when inheriting ACLs" written by Jan.

Fixes: 073931017b
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-15 21:10:23 -07:00
Luis Henriques
5ffff2854a f2fs: remove extra inode_unlock() in error path
This commit removes an extra inode_unlock() that is being done in function
f2fs_ioc_setflags error path.  While there, get rid of a useless 'out'
label as well.

Fixes: 0abd675e97 ("f2fs: support plain user/group quota")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-15 21:10:23 -07:00
Linus Torvalds
5cdd4c0468 for-f2fs-4.13
In this round, we've added new features such as disk quota and statx, and
 modified internal bio management flow to merge more IOs depending on block
 types. We've also made internal threads freezeable for Android battery life.
 In addition to them, there are some patches to avoid lock contention as well
 as a couple of deadlock conditions.
 
 = Enhancement
 - support usrquota, grpquota, and statx
 - manage DATA/NODE typed bios separately to serialize more IOs
 - modify f2fs_lock_op/wio_mutex to avoid lock contention
 - prevent lock contention in migratepage
 
 = Bug fix
 - miss to load written inode flag
 - fix worst case victim selection in GC
 - freezeable GC and discard threads for Android battery life
 - sanitize f2fs metadata to deal with security hole
 - clean up sysfs-related code and docs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAllj6fMACgkQQBSofoJI
 UNJ6Ng/+PqdGV/b6KroYIXI/scFx/1t87/0W+rY9tyLr1jX7nIHn9KLPjeDdvdlk
 5vEeZ/dGfW8wSI+ESzscvKberG2QlOPwJRyTB4jWR+bLatwzg7YjEblz+RX4/wfJ
 jKjnR7M//gRdhHdqA0xXrqguAjPbcEDK2RiVbhioMjWbZ/77j0IjcRokjMYdEf0m
 cJc2oMXFtlo+DJ1h9/8BmwQPTI9FfVdgbkPFTTJzV0ydQnBdxcAigrzwYZhPOVv0
 n2M1dKOiQewB4OADMuepZLFqJheItlgG9wlvEjGq7zTd5epHXRIqhM6h9GikQVb9
 YKAkajlKfWcwEXaEcVXtsMHC9x69Yf8xxOSQ1VrhypSUNbaynC9LDsErJx6yrF3P
 XC5baiqXsd/btg7tfrHJjk3gI+ck97d6TrTfUVR91X+1Tpkz7cyB226WxFKbyOG3
 EYCFVMbrIN2CaHHt1xWIT2zCfX5w9ycp8kFjY6jPi0OOZrKXpFw+1AwwTu9kn4xJ
 iuUc8pmc0/FyPqokmLef4Qp/RRM83+f+nzW/y//lkEf3nMn6qlHzNI1RAxXnBvGV
 DMXzuJDcJcHGcSDr7mWyKkm6gYcak/E4DdQLQqJ6VCt6KCdCEXP/XDlig5ey5ODY
 uGEr1QhXIpiYAON45HUi3gmytB3J3ZdzzpsG1PEco4+hjSuFhyE=
 =N4GZ
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added new features such as disk quota and statx,
  and modified internal bio management flow to merge more IOs depending
  on block types. We've also made internal threads freezeable for
  Android battery life. In addition to them, there are some patches to
  avoid lock contention as well as a couple of deadlock conditions.

  Enhancements:
   - support usrquota, grpquota, and statx
   - manage DATA/NODE typed bios separately to serialize more IOs
   - modify f2fs_lock_op/wio_mutex to avoid lock contention
   - prevent lock contention in migratepage

  Bug fixes:
   - fix missing load of written inode flag
   - fix worst case victim selection in GC
   - freezeable GC and discard threads for Android battery life
   - sanitize f2fs metadata to deal with security hole
   - clean up sysfs-related code and docs"

* tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (59 commits)
  f2fs: support plain user/group quota
  f2fs: avoid deadlock caused by lock order of page and lock_op
  f2fs: use spin_{,un}lock_irq{save,restore}
  f2fs: relax migratepage for atomic written page
  f2fs: don't count inode block in in-memory inode.i_blocks
  Revert "f2fs: fix to clean previous mount option when remount_fs"
  f2fs: do not set LOST_PINO for renamed dir
  f2fs: do not set LOST_PINO for newly created dir
  f2fs: skip ->writepages for {mete,node}_inode during recovery
  f2fs: introduce __check_sit_bitmap
  f2fs: stop gc/discard thread in prior during umount
  f2fs: introduce reserved_blocks in sysfs
  f2fs: avoid redundant f2fs_flush after remount
  f2fs: report # of free inodes more precisely
  f2fs: add ioctl to do gc with target block address
  f2fs: don't need to check encrypted inode for partial truncation
  f2fs: measure inode.i_blocks as generic filesystem
  f2fs: set CP_TRIMMED_FLAG correctly
  f2fs: require key for truncate(2) of encrypted file
  f2fs: move sysfs code from super.c to fs/f2fs/sysfs.c
  ...
2017-07-10 14:29:45 -07:00
Chao Yu
0abd675e97 f2fs: support plain user/group quota
This patch adds to support plain user/group quota.

Change Note by Jaegeuk Kim.

- Use f2fs page cache for quota files in order to consider garbage collection.
  so, quota files are not tolerable for sudden power-cuts, so user needs to do
  quotacheck.

- setattr() calls dquot_transfer which will transfer inode->i_blocks.
  We can't reclaim that during f2fs_evict_inode(). So, we need to count
  node blocks as well in order to match i_blocks with dquot's space.

  Note that, Chao wrote a patch to count inode->i_blocks without inode block.
  (f2fs: don't count inode block in in-memory inode.i_blocks)

- in f2fs_remount, we need to make RW in prior to dquot_resume.

- handle fault_injection case during f2fs_quota_off_umount

- TODO: Project quota

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-08 23:12:27 -07:00
Jaegeuk Kim
d29460e5cf f2fs: avoid deadlock caused by lock order of page and lock_op
- punch_hole
 - fill_zero
  - f2fs_lock_op
  - get_new_data_page
   - lock_page

- f2fs_write_data_pages
 - lock_page
 - do_write_data_page
  - f2fs_lock_op

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 11:07:59 -07:00
Chao Yu
d1aa245354 f2fs: use spin_{,un}lock_irq{save,restore}
generic/361 reports below warning, this is because: once, there is
someone entering into critical region of sbi.cp_lock, if write_end_io.
f2fs_stop_checkpoint is invoked from an triggered IRQ, we will encounter
deadlock.

So this patch changes to use spin_{,un}lock_irq{save,restore} to create
critical region without IRQ enabled to avoid potential deadlock.

 irq event stamp: 83391573
 loop: Write error at byte offset 438729728, length 1024.
 hardirqs last  enabled at (83391573): [<c1809752>] restore_all+0xf/0x65
 hardirqs last disabled at (83391572): [<c1809eac>] reschedule_interrupt+0x30/0x3c
 loop: Write error at byte offset 438860288, length 1536.
 softirqs last  enabled at (83389244): [<c180cc4e>] __do_softirq+0x1ae/0x476
 softirqs last disabled at (83389237): [<c101ca7c>] do_softirq_own_stack+0x2c/0x40
 loop: Write error at byte offset 438990848, length 2048.
 ================================
 WARNING: inconsistent lock state
 4.12.0-rc2+ #30 Tainted: G           O
 --------------------------------
 inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
 xfs_io/7959 [HC1[1]:SC0[0]:HE0:SE1] takes:
  (&(&sbi->cp_lock)->rlock){?.+...}, at: [<f96f96cc>] f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
 {HARDIRQ-ON-W} state was registered at:
   __lock_acquire+0x527/0x7b0
   lock_acquire+0xae/0x220
   _raw_spin_lock+0x42/0x50
   do_checkpoint+0x165/0x9e0 [f2fs]
   write_checkpoint+0x33f/0x740 [f2fs]
   __f2fs_sync_fs+0x92/0x1f0 [f2fs]
   f2fs_sync_fs+0x12/0x20 [f2fs]
   sync_filesystem+0x67/0x80
   generic_shutdown_super+0x27/0x100
   kill_block_super+0x22/0x50
   kill_f2fs_super+0x3a/0x40 [f2fs]
   deactivate_locked_super+0x3d/0x70
   deactivate_super+0x40/0x60
   cleanup_mnt+0x39/0x70
   __cleanup_mnt+0x10/0x20
   task_work_run+0x69/0x80
   exit_to_usermode_loop+0x57/0x85
   do_fast_syscall_32+0x18c/0x1b0
   entry_SYSENTER_32+0x4c/0x7b
 irq event stamp: 1957420
 hardirqs last  enabled at (1957419): [<c1808f37>] _raw_spin_unlock_irq+0x27/0x50
 hardirqs last disabled at (1957420): [<c1809f9c>] call_function_single_interrupt+0x30/0x3c
 softirqs last  enabled at (1953784): [<c180cc4e>] __do_softirq+0x1ae/0x476
 softirqs last disabled at (1953773): [<c101ca7c>] do_softirq_own_stack+0x2c/0x40

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&(&sbi->cp_lock)->rlock);
   <Interrupt>
     lock(&(&sbi->cp_lock)->rlock);

  *** DEADLOCK ***

 2 locks held by xfs_io/7959:
  #0:  (sb_writers#13){.+.+.+}, at: [<c11fd7ca>] vfs_write+0x16a/0x190
  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<f96e33f5>] f2fs_file_write_iter+0x25/0x140 [f2fs]

 stack backtrace:
 CPU: 2 PID: 7959 Comm: xfs_io Tainted: G           O    4.12.0-rc2+ #30
 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 Call Trace:
  dump_stack+0x5f/0x92
  print_usage_bug+0x1d3/0x1dd
  ? check_usage_backwards+0xe0/0xe0
  mark_lock+0x23d/0x280
  __lock_acquire+0x699/0x7b0
  ? __this_cpu_preempt_check+0xf/0x20
  ? trace_hardirqs_off_caller+0x91/0xe0
  lock_acquire+0xae/0x220
  ? f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  _raw_spin_lock+0x42/0x50
  ? f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  f2fs_write_end_io+0x147/0x150 [f2fs]
  bio_endio+0x7a/0x1e0
  blk_update_request+0xad/0x410
  blk_mq_end_request+0x16/0x60
  lo_complete_rq+0x3c/0x70
  __blk_mq_complete_request_remote+0x11/0x20
  flush_smp_call_function_queue+0x6d/0x120
  ? debug_smp_processor_id+0x12/0x20
  generic_smp_call_function_single_interrupt+0x12/0x30
  smp_call_function_single_interrupt+0x25/0x40
  call_function_single_interrupt+0x37/0x3c
 EIP: _raw_spin_unlock_irq+0x2d/0x50
 EFLAGS: 00000296 CPU: 2
 EAX: 00000001 EBX: d2ccc51c ECX: 00000001 EDX: c1aacebd
 ESI: 00000000 EDI: 00000000 EBP: c96c9d1c ESP: c96c9d18
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
  ? inherit_task_group.isra.98.part.99+0x6b/0xb0
  __add_to_page_cache_locked+0x1d4/0x290
  add_to_page_cache_lru+0x38/0xb0
  pagecache_get_page+0x8e/0x200
  f2fs_write_begin+0x96/0xf00 [f2fs]
  ? trace_hardirqs_on_caller+0xdd/0x1c0
  ? current_time+0x17/0x50
  ? trace_hardirqs_on+0xb/0x10
  generic_perform_write+0xa9/0x170
  __generic_file_write_iter+0x1a2/0x1f0
  ? f2fs_preallocate_blocks+0x137/0x160 [f2fs]
  f2fs_file_write_iter+0x6e/0x140 [f2fs]
  ? __lock_acquire+0x429/0x7b0
  __vfs_write+0xc1/0x140
  vfs_write+0x9b/0x190
  SyS_pwrite64+0x63/0xa0
  do_fast_syscall_32+0xa1/0x1b0
  entry_SYSENTER_32+0x4c/0x7b
 EIP: 0xb7786c61
 EFLAGS: 00000293 CPU: 2
 EAX: ffffffda EBX: 00000003 ECX: 08416000 EDX: 00001000
 ESI: 18b24000 EDI: 00000000 EBP: 00000003 ESP: bf9b36b0
  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b

Fixes: aaec2b1d18 ("f2fs: introduce cp_lock to protect updating of ckpt_flags")
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:49 -07:00
Jaegeuk Kim
ff1048e7df f2fs: relax migratepage for atomic written page
In order to avoid lock contention for atomic written pages, we'd better give
EBUSY in f2fs_migrate_page when mode is asynchronous. We expect it will be
released soon as transaction commits.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:48 -07:00
Chao Yu
000519f278 f2fs: don't count inode block in in-memory inode.i_blocks
Previously, we count all inode consumed blocks including inode block,
xattr block, index block, data block into i_blocks, for other generic
filesystems, they won't count inode block into i_blocks, so for
userspace applications or quota system, they may detect incorrect block
count according to i_blocks value in inode.

This patch changes to count all blocks into inode.i_blocks excluding
inode block, for on-disk i_blocks, we keep counting inode block for
backward compatibility.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:47 -07:00
Chao Yu
6ac851ba89 Revert "f2fs: fix to clean previous mount option when remount_fs"
Don't clear old mount option before parse new option during ->remount_fs
like other generic filesystems.

This reverts commit 26666c8a43.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:46 -07:00
Sheng Yong
b855bf0e16 f2fs: do not set LOST_PINO for renamed dir
After renaming a directory, fsck could detect unmatched pino. The scenario
can be reproduced as the following:

	$ mkdir /bar/subbar /foo
	$ rename /bar/subbar /foo

Then fsck will report:
[ASSERT] (__chk_dots_dentries:1182)  --> Bad inode number[0x3] for '..', parent parent ino is [0x4]

Rename sets LOST_PINO for old_inode. However, the flag cannot be cleared,
since dir is written back with CP. So, let's get rid of LOST_PINO for a
renamed dir and fix the pino directly at the end of rename.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:45 -07:00
Sheng Yong
d58dfb7505 f2fs: do not set LOST_PINO for newly created dir
Since directories will be written back with checkpoint and fsync a
directory will always write CP, there is no need to set LOST_PINO
after creating a directory.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:45 -07:00
Chao Yu
0771fcc71c f2fs: skip ->writepages for {mete,node}_inode during recovery
Skip ->writepages in prior to ->writepage for {meta,node}_inode during
recovery, hence unneeded loop in ->writepages can be avoided.

Moreover, check SBI_POR_DOING earlier while writebacking pages.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:44 -07:00
Chao Yu
6915ea9d8d f2fs: introduce __check_sit_bitmap
After we introduce discard thread, discard command can be issued
concurrently with data allocating, this patch adds new function to
heck sit bitmap to ensure that userdata was invalid in which on-going
discard command covered.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:43 -07:00
Chao Yu
cce1325247 f2fs: stop gc/discard thread in prior during umount
This patch resolves kernel panic for xfstests/081, caused by recent f2fs_bug_on

 f2fs: add f2fs_bug_on in __remove_discard_cmd

For fixing, we will stop gc/discard thread in prior in ->kill_sb in order to
avoid referring and releasing race among them.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:42 -07:00
Chao Yu
daeb433e42 f2fs: introduce reserved_blocks in sysfs
In this patch, we add a new sysfs interface, with it, we can control
number of reserved blocks in system which could not be used by user,
it enable f2fs to let user to configure for adjusting over-provision
ratio dynamically instead of changing it by mkfs.

So we can expect it will help to reserve more free space for relieving
GC in both filesystem and flash device.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:41 -07:00
Yunlong Song
d871cd046f f2fs: avoid redundant f2fs_flush after remount
create_flush_cmd_control will create redundant issue_flush_thread after each
remount with flush_merge option.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:41 -07:00
Jaegeuk Kim
0cc091d0c8 f2fs: report # of free inodes more precisely
If the partition is small, we don't need to report total # of inodes including
hidden free nodes.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:40 -07:00
Jaegeuk Kim
34dc77ad74 f2fs: add ioctl to do gc with target block address
This patch adds f2fs_ioc_gc_range() to move blocks located in the given
range.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:50 -07:00
Jaegeuk Kim
a9bcf9bcd0 f2fs: don't need to check encrypted inode for partial truncation
The cache_only is always false, if inode is encrypted.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:49 -07:00
Chao Yu
0eb0adadf2 f2fs: measure inode.i_blocks as generic filesystem
Both in memory or on disk, generic filesystems record i_blocks with
512bytes sized sector count, also VFS sub module such as disk quota
follows this rule, but f2fs records it with 4096bytes sized block
count, this difference leads to that once we use dquota's function
which inc/dec iblocks, it will make i_blocks of f2fs being inconsistent
between in memory and on disk.

In order to resolve this issue, this patch changes to make in-memory
i_blocks of f2fs recording sector count instead of block count,
meanwhile leaving on-disk i_blocks recording block count.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:48 -07:00
Chao Yu
663f387b71 f2fs: set CP_TRIMMED_FLAG correctly
Don't set CP_TRIMMED_FLAG for non-zoned block device or discard
unsupported device, it can avoid to trigger unneeded checkpoint for
that kind of device.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:47 -07:00
Eric Biggers
67773a1fbd f2fs: require key for truncate(2) of encrypted file
Currently, filesystems allow truncate(2) on an encrypted file without
the encryption key.  However, it's impossible to correctly handle the
case where the size being truncated to is not a multiple of the
filesystem block size, because that would require decrypting the final
block, zeroing the part beyond i_size, then encrypting the block.

As other modifications to encrypted file contents are prohibited without
the key, just prohibit truncate(2) as well, making it fail with ENOKEY.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:46 -07:00
Chao Yu
8ceffcb29e f2fs: move sysfs code from super.c to fs/f2fs/sysfs.c
Codes related to sysfs and procfs are dispersive and mixed with sb
related codes, but actually these codes are independent from others,
so split them from super.c, and reorgnize and manger them in sysfs.c.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:45 -07:00
Chao Yu
a398101aa1 f2fs: clean up sysfs codes
Just cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:45 -07:00
Chao Yu
1727f31721 f2fs: fix wrong error number of fill_super
This patch fixes incorrect error number in error path of fill_super.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:43 -07:00
Chao Yu
44529f8975 f2fs: fix to show injection rate in ->show_options
If fault injection functionality is enabled, show additional injection
rate in ->show_options.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:41 -07:00
Christophe JAILLET
b63def9112 f2fs: Fix a return value in case of error in 'f2fs_fill_super'
err must be set to -ENOMEM, otherwise we return 0.

Fixes: a912b54d3a ("f2fs: split bio cache")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:40 -07:00
Tiezhu Yang
a005774c8d f2fs: use proper variable name
It is better to use variable name "inline_dentry" instead of "dentry_blk"
when data type is "struct f2fs_inline_dentry". This patch has no functional
changes, just to make code more readable especially when call the function
make_dentry_ptr_inline() and f2fs_convert_inline_dir().

Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:40 -07:00
Chao Yu
1f258ec13b f2fs: fix to avoid panic when encountering corrupt node
With fault_injection option, generic/361 of fstests will complain us
with below message:

Call Trace:
 get_node_page+0x12/0x20 [f2fs]
 f2fs_iget+0x92/0x7d0 [f2fs]
 f2fs_fill_super+0x10fb/0x15e0 [f2fs]
 mount_bdev+0x184/0x1c0
 f2fs_mount+0x15/0x20 [f2fs]
 mount_fs+0x39/0x150
 vfs_kern_mount+0x67/0x110
 do_mount+0x1bb/0xc70
 SyS_mount+0x83/0xd0
 do_syscall_64+0x6e/0x160
 entry_SYSCALL64_slow_path+0x25/0x25

Since mkfs loop device in f2fs partition can be failed silently due to
checkpoint error injection, so root inode page can be corrupted, in order
to avoid needless panic, in get_node_page, it's better to leave message
and return error to caller, and let fsck repaire it later.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:39 -07:00
Chao Yu
febeca6d37 f2fs: don't track newly allocated nat entry in list
We will never persist newly allocated nat entries during checkpoint(), so
we don't need to track such nat entries in nat dirty list in order to
avoid:
- more latency during traversing dirty list;
- sorting nat sets incorrectly due to recording wrong entry_cnt in nat
entry set.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:38 -07:00
Chao Yu
d9703d9097 f2fs: add f2fs_bug_on in __remove_discard_cmd
Recently, discard related codes have changed a lot, so add f2fs_bug_on to
detect potential bug.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:37 -07:00
Chao Yu
2a510c005c f2fs: introduce __wait_one_discard_bio
In order to avoid copied codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:36 -07:00
Qiuyang Sun
5a3a2d83cd f2fs: dax: fix races between page faults and truncating pages
Currently in F2FS, page faults and operations that truncate the pagecahe
or data blocks, are completely unsynchronized. This can result in page
fault faulting in a page into a range that we are changing after
truncating, and thus we can end up with a page mapped to disk blocks that
will be shortly freed. Filesystem corruption will shortly follow.

This patch fixes the problem by creating new rw semaphore i_mmap_sem in
f2fs_inode_info and grab it for functions removing blocks from extent tree
and for read over page faults. The mechanism is similar to that in ext4.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:35 -07:00
Fan Li
72fdbe2efe f2fs: simplify the way of calulating next nat address
The index of segment which the next nat block is in has only one different
bit than the current one, so to get the next nat address, we can simply
alter that one bit.

Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:34 -07:00
Jin Qian
21d3f8e1c3 f2fs: sanity check size of nat and sit cache
Make sure number of entires doesn't exceed max journal size.

Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@android.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:34 -07:00
Yunlei He
d4fdf8ba0e f2fs: fix a panic caused by NULL flush_cmd_control
Mount fs with option noflush_merge, boot failed for illegal address
fcc in function f2fs_issue_flush:

        if (!test_opt(sbi, FLUSH_MERGE)) {
                ret = submit_flush_wait(sbi);
                atomic_inc(&fcc->issued_flush);   ->  Here, fcc illegal
                return ret;
        }

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:33 -07:00
Zhang Shengju
68390dd9bd f2fs: remove the unnecessary cast for PTR_ERR
It's not necessary to specify 'int' casting for PTR_ERR.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:32 -07:00
Jaegeuk Kim
d8c4256c17 f2fs: remove false-positive bug_on
For example,

f2fs_create
 - new_node_page is failed
 - handle_failed_inode
  - skip to add it into orphan list, since ni.blk_addr == NULL_ADDR
   : set_inode_flag(inode, FI_FREE_NID)

f2fs_evict_inode
 - EIO due to fault injection
 - f2fs_bug_on() is triggered

So, we don't need to call f2fs_bug_on in this case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:31 -07:00
Damien Le Moal
acfd2810c7 f2fs: Do not issue small discards in LFS mode
clear_prefree_segments() issues small discards after discarding full
segments. These small discards may not be section aligned, so not zone
aligned on a zoned block device, causing __f2fs_iissue_discard_zone() to fail.
Fix this by not issuing small discards for a volume mounted with the BLKZONED
feature enabled.

Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:10:42 -07:00
Linus Torvalds
c6b1e36c8f Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block
Pull core block/IO updates from Jens Axboe:
 "This is the main pull request for the block layer for 4.13. Not a huge
  round in terms of features, but there's a lot of churn related to some
  core cleanups.

  Note this depends on the UUID tree pull request, that Christoph
  already sent out.

  This pull request contains:

   - A series from Christoph, unifying the error/stats codes in the
     block layer. We now use blk_status_t everywhere, instead of using
     different schemes for different places.

   - Also from Christoph, some cleanups around request allocation and IO
     scheduler interactions in blk-mq.

   - And yet another series from Christoph, cleaning up how we handle
     and do bounce buffering in the block layer.

   - A blk-mq debugfs series from Bart, further improving on the support
     we have for exporting internal information to aid debugging IO
     hangs or stalls.

   - Also from Bart, a series that cleans up the request initialization
     differences across types of devices.

   - A series from Goldwyn Rodrigues, allowing the block layer to return
     failure if we will block and the user asked for non-blocking.

   - Patch from Hannes for supporting setting loop devices block size to
     that of the underlying device.

   - Two series of patches from Javier, fixing various issues with
     lightnvm, particular around pblk.

   - A series from me, adding support for write hints. This comes with
     NVMe support as well, so applications can help guide data placement
     on flash to improve performance, latencies, and write
     amplification.

   - A series from Ming, improving and hardening blk-mq support for
     stopping/starting and quiescing hardware queues.

   - Two pull requests for NVMe updates. Nothing major on the feature
     side, but lots of cleanups and bug fixes. From the usual crew.

   - A series from Neil Brown, greatly improving the bio rescue set
     support. Most notably, this kills the bio rescue work queues, if we
     don't really need them.

   - Lots of other little bug fixes that are all over the place"

* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
  lightnvm: pblk: set line bitmap check under debug
  lightnvm: pblk: verify that cache read is still valid
  lightnvm: pblk: add initialization check
  lightnvm: pblk: remove target using async. I/Os
  lightnvm: pblk: use vmalloc for GC data buffer
  lightnvm: pblk: use right metadata buffer for recovery
  lightnvm: pblk: schedule if data is not ready
  lightnvm: pblk: remove unused return variable
  lightnvm: pblk: fix double-free on pblk init
  lightnvm: pblk: fix bad le64 assignations
  nvme: Makefile: remove dead build rule
  blk-mq: map all HWQ also in hyperthreaded system
  nvmet-rdma: register ib_client to not deadlock in device removal
  nvme_fc: fix error recovery on link down.
  nvmet_fc: fix crashes on bad opcodes
  nvme_fc: Fix crash when nvme controller connection fails.
  nvme_fc: replace ioabort msleep loop with completion
  nvme_fc: fix double calls to nvme_cleanup_cmd()
  nvme-fabrics: verify that a controller returns the correct NQN
  nvme: simplify nvme_dev_attrs_are_visible
  ...
2017-07-03 10:34:51 -07:00
Linus Torvalds
81e3e04489 UUID/GUID updates:
- introduce the new uuid_t/guid_t types that are going to replace
    the somewhat confusing uuid_be/uuid_le types and make the terminology
    fit the various specs, as well as the userspace libuuid library.
    (me, based on a previous version from Amir)
  - consolidated generic uuid/guid helper functions lifted from XFS
    and libnvdimm (Amir and me)
  - conversions to the new types and helpers (Amir, Andy and me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllZfmILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMvyg/9EvWHOOsSdeDykCK3KdH2uIqnxwpl+m7ljccaGJIc
 MmaH0KnsP9p/Cuw5hESh2tYlmCYN7pmYziNXpf/LRS65/HpEYbs4oMqo8UQsN0UM
 2IXHfXY0HnCoG5OixH8RNbFTkxuGphsTY8meaiDr6aAmqChDQI2yGgQLo3WM2/Qe
 R9N1KoBWH/bqY6dHv+urlFwtsREm2fBH+8ovVma3TO73uZCzJGLJBWy3anmZN+08
 uYfdbLSyRN0T8rqemVdzsZ2SrpHYkIsYGUZV43F581vp8e/3OKMoMxpWRRd9fEsa
 MXmoaHcLJoBsyVSFR9lcx3axKrhAgBPZljASbbA0h49JneWXrzghnKBQZG2SnEdA
 ktHQ2sE4Yb5TZSvvWEKMQa3kXhEfIbTwgvbHpcDr5BUZX8WvEw2Zq8e7+Mi4+KJw
 QkvFC1S96tRYO2bxdJX638uSesGUhSidb+hJ/edaOCB/GK+sLhUdDTJgwDpUGmyA
 xVXTF51ramRS2vhlbzN79x9g33igIoNnG4/PV0FPvpCTSqxkHmPc5mK6Vals1lqt
 cW6XfUjSQECq5nmTBtYDTbA/T+8HhBgSQnrrvmferjJzZUFGr/7MXl+Evz2x4CjX
 OBQoAMu241w6Vp3zoXqxzv+muZ/NLar52M/zbi9TUjE0GvvRNkHvgCC4NmpIlWYJ
 Sxg=
 =J/4P
 -----END PGP SIGNATURE-----

Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid

Pull uuid subsystem from Christoph Hellwig:
 "This is the new uuid subsystem, in which Amir, Andy and I have started
  consolidating our uuid/guid helpers and improving the types used for
  them. Note that various other subsystems have pulled in this tree, so
  I'd like it to go in early.

  UUID/GUID summary:

   - introduce the new uuid_t/guid_t types that are going to replace the
     somewhat confusing uuid_be/uuid_le types and make the terminology
     fit the various specs, as well as the userspace libuuid library.
     (me, based on a previous version from Amir)

   - consolidated generic uuid/guid helper functions lifted from XFS and
     libnvdimm (Amir and me)

   - conversions to the new types and helpers (Amir, Andy and me)"

* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
  ACPI: hns_dsaf_acpi_dsm_guid can be static
  mmc: sdhci-pci: make guid intel_dsm_guid static
  uuid: Take const on input of uuid_is_null() and guid_is_null()
  thermal: int340x_thermal: fix compile after the UUID API switch
  thermal: int340x_thermal: Switch to use new generic UUID API
  acpi: always include uuid.h
  ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
  ACPI / extlog: Switch to use new generic UUID API
  ACPI / bus: Switch to use new generic UUID API
  ACPI / APEI: Switch to use new generic UUID API
  acpi, nfit: Switch to use new generic UUID API
  MAINTAINERS: add uuid entry
  tmpfs: generate random sb->s_uuid
  scsi_debug: switch to uuid_t
  nvme: switch to uuid_t
  sysctl: switch to use uuid_t
  partitions/ldm: switch to use uuid_t
  overlayfs: use uuid_t instead of uuid_be
  fs: switch ->s_uuid to uuid_t
  ima/policy: switch to use uuid_t
  ...
2017-07-03 09:55:26 -07:00
Christoph Hellwig
fdd050b5b3 Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base 2017-06-13 11:45:14 +02:00
Christoph Hellwig
4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
David Miller
d41519a69b crypto: Work around deallocated stack frame reference gcc bug on sparc.
On sparc, if we have an alloca() like situation, as is the case with
SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
memory.  The result can be that the value is clobbered if a trap
or interrupt arrives at just the right instruction.

It only occurs if the function ends returning a value from that
alloca() area and that value can be placed into the return value
register using a single instruction.

For example, in lib/libcrc32c.c:crc32c() we end up with a return
sequence like:

        return  %i7+8
         lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],

%o5 holds the base of the on-stack area allocated for the shash
descriptor.  But the return released the stack frame and the
register window.

So if an intererupt arrives between 'return' and 'lduw', then
the value read at %o5+16 can be corrupted.

Add a data compiler barrier to work around this problem.  This is
exactly what the gcc fix will end up doing as well, and it absolutely
should not change the code generated for other cpus (unless gcc
on them has the same bug :-)

With crucial insight from Eric Sandeen.

Cc: <stable@vger.kernel.org>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-06-08 17:36:03 +08:00
Christoph Hellwig
85787090a2 fs: switch ->s_uuid to uuid_t
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers.  More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:12 +02:00
Eric Biggers
aaebdee8b8 f2fs: don't bother checking for encryption key in ->write_iter()
Since only an open file can be written to, and we only allow open()ing
an encrypted file when its key is available, there is no need to check
for the key again before permitting each ->write_iter().

This code was also broken in that it wouldn't actually have failed if
the key was in fact unavailable.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:11:08 -07:00
Eric Biggers
b82a6ea6ec f2fs: don't bother checking for encryption key in ->mmap()
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().

This f2fs copy of this code was also broken in that it wouldn't actually
have failed if the key was in fact unavailable.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:10:36 -07:00
Chao Yu
6afae6336a f2fs: wait discard IO completion without cmd_lock held
Wait discard IO completion outside cmd_lock to avoid long latency
of holding cmd_lock in IO busy scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:10:03 -07:00
Chao Yu
e31b982157 f2fs: wake up all waiters in f2fs_submit_discard_endio
There could be more than one waiter waiting discard IO completion, so we
need use complete_all() instead of complete() in f2fs_submit_discard_endio
to avoid hungtask.

Fixes: 	ec9895add2 ("f2fs: don't hold cmd_lock during waiting discard
command")
Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:54 -07:00
Chao Yu
04dfc23006 f2fs: show more info if fail to issue discard
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:45 -07:00
Chao Yu
fb830fc5cf f2fs: introduce io_list for serialize data/node IOs
Serialize data/node IOs by using fifo list instead of mutex lock,
it will help to enhance concurrency of f2fs, meanwhile keeping LFS
IO semantics.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:03 -07:00
Chao Yu
e41e6d75e5 f2fs: split wio_mutex
Split wio_mutex to adjust different temperature bio cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:23 -07:00
Yunlei He
963932a93c f2fs: combine huge num of discard rb tree consistence checks
Came across a hungtask caused by huge number of rb tree traversing
during adding discard addrs in cp. This patch combine these consistence
checks and move it to discard thread.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:19 -07:00
Yunlei He
dad48e7312 f2fs: fix a bug caused by NULL extent tree
Thread A:					Thread B:

-f2fs_remount
    -sbi->mount_opt.opt = 0;
						<--- -f2fs_iget
						         -do_read_inode
							     -f2fs_init_extent_tree
							         -F2FS_I(inode)->extent_tree is NULL
        -default_options && parse_options
	    -remount return
						<---  -f2fs_map_blocks
						          -f2fs_lookup_extent_tree
                                                              -f2fs_bug_on(sbi, !et);

The same problem with f2fs_new_inode.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:18 -07:00
Jaegeuk Kim
1d7be27082 f2fs: try to freeze in gc and discard threads
This allows to freeze gc and discard threads.

Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:18 -07:00
Yunlei He
b7b7c4cf1c f2fs: add a new function get_ssr_cost
This patch add a new method get_ssr_cost to select
SSR segment more accurately.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:17 -07:00
Hou Pengyang
bd80a4b981 f2fs: declare load_free_nid_bitmap static
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:16 -07:00
Jaegeuk Kim
cc15620bc8 f2fs: avoid f2fs_lock_op for IPU writes
Currently, if we do get_node_of_data before f2fs_lock_op, there may be dead lock
as follows, where process A would be in infinite loop, and B will NOT be awaked.

Process A(cp):            Process B:
f2fs_lock_all(sbi)
                        get_dnode_of_data <---- lock dn.node_page
flush_nodes             f2fs_lock_op

So, this patch adds f2fs_trylock_op to avoid f2fs_lock_op done by IPU.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:15 -07:00
Jaegeuk Kim
a912b54d3a f2fs: split bio cache
Split DATA/NODE type bio cache according to different temperature,
so write IOs with the same temperature can be merged in corresponding
bio cache as much as possible, otherwise, different temperature write
IOs submitting into one bio cache will always cause split of bio.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:39 -07:00
Jaegeuk Kim
81377bd628 f2fs: use fio instead of multiple parameters
This patch just changes using fio instead of parameters.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:38 -07:00
Jaegeuk Kim
b9109b0e49 f2fs: remove unnecessary read cases in merged IO flow
Merged IO flow doesn't need to care about read IOs.

f2fs_submit_merged_bio -> f2fs_submit_merged_write
f2fs_submit_merged_bios -> f2fs_submit_merged_writes
f2fs_submit_merged_bio_cond -> f2fs_submit_merged_write_cond

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:37 -07:00
Jaegeuk Kim
1919ffc0d7 f2fs: use f2fs_submit_page_bio for ra_meta_pages
This patch avoids to use f2fs_submit_merged_bio for read, which was the only
read case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:36 -07:00
Weichao Guo
e5dbd9563e f2fs: make sure f2fs_gc returns consistent errno
By default, f2fs_gc returns -EINVAL in general error cases, e.g., no victim
was selected. However, the default errno may be overwritten in two cases:
gc_more and BG_GC -> FG_GC. We should return consistent errno in such cases.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:35 -07:00
Chao Yu
1c6d8ee4b8 f2fs: support statx
Last kernel has already support new syscall statx() in commit a528d35e8b
("statx: Add a system call to make enhanced file info available"), with
this interface we can show more file info including file creation and some
attribute flags to user.

This patch tries to support this functionality.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:34 -07:00
Jaegeuk Kim
93607124c5 f2fs: load inode's flag from disk
This patch fixes missing inode flag loaded from disk, reported by Tom.

[tom@localhost ~]$ sudo mount /dev/loop0 /mnt/
[tom@localhost ~]$ sudo chown tom:tom /mnt/
[tom@localhost ~]$ touch /mnt/testfile
[tom@localhost ~]$ sudo chattr +i /mnt/testfile
[tom@localhost ~]$ echo test > /mnt/testfile
bash: /mnt/testfile: Operation not permitted
[tom@localhost ~]$ rm /mnt/testfile
rm: cannot remove '/mnt/testfile': Operation not permitted
[tom@localhost ~]$ sudo umount /mnt/
[tom@localhost ~]$ sudo mount /dev/loop0 /mnt/
[tom@localhost ~]$ lsattr /mnt/testfile
----i-------------- /mnt/testfile
[tom@localhost ~]$ echo test > /mnt/testfile
[tom@localhost ~]$ rm /mnt/testfile
[tom@localhost ~]$ sudo umount /mnt/

Cc: stable@vger.kernel.org
Reported-by: Tom Yan <tom.ty89@outlook.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:31 -07:00
Jin Qian
15d3042a93 f2fs: sanity check checkpoint segno and blkoff
Make sure segno and blkoff read from raw image are valid.

Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@google.com>
[Jaegeuk Kim: adjust minor coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-16 13:29:39 -07:00
Linus Torvalds
bf5f89463f Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - various misc things

 - procfs updates

 - lib/ updates

 - checkpatch updates

 - kdump/kexec updates

 - add kvmalloc helpers, use them

 - time helper updates for Y2038 issues. We're almost ready to remove
   current_fs_time() but that awaits a btrfs merge.

 - add tracepoints to DAX

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
  selftests/vm: add a test for virtual address range mapping
  dax: add tracepoint to dax_insert_mapping()
  dax: add tracepoint to dax_writeback_one()
  dax: add tracepoints to dax_writeback_mapping_range()
  dax: add tracepoints to dax_load_hole()
  dax: add tracepoints to dax_pfn_mkwrite()
  dax: add tracepoints to dax_iomap_pte_fault()
  mtd: nand: nandsim: convert to memalloc_noreclaim_*()
  treewide: convert PF_MEMALLOC manipulations to new helpers
  mm: introduce memalloc_noreclaim_{save,restore}
  mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
  mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
  mm/huge_memory.c: use zap_deposited_table() more
  time: delete CURRENT_TIME_SEC and CURRENT_TIME
  gfs2: replace CURRENT_TIME with current_time
  apparmorfs: replace CURRENT_TIME with current_time()
  lustre: replace CURRENT_TIME macro
  fs: ubifs: replace CURRENT_TIME_SEC with current_time
  fs: ufs: use ktime_get_real_ts64() for birthtime
  ...
2017-05-08 18:17:56 -07:00
Deepa Dinamani
48fbfe50f1 fs: f2fs: use ktime_get_real_seconds for sit_info times
CURRENT_TIME_SEC is not y2038 safe.

Replace use of CURRENT_TIME_SEC with ktime_get_real_seconds in segment
timestamps used by GC algorithm including the segment mtime timestamps.

Link: http://lkml.kernel.org/r/1491613030-11599-2-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Michal Hocko
a7c3e901a4 mm: introduce kv[mz]alloc helpers
Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree.  Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code.  Yet we do not have any common helper
for that and so users have invented their own helpers.  Some of them are
really creative when doing so.  Let's just add kv[mz]alloc and make sure
it is implemented properly.  This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures.  This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them.  In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL).  Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers.  Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
  Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Linus Torvalds
70ef8f0d37 for-f2fs-4.12
In this round, we've focused on enhancing performance with regards to block
 allocation, GC, and discard/in-place-update IO controls. There are a bunch
 of clean-ups as well as minor bug fixes.
 
 = Enhancement
 - disable heap-based allocation by default
 - issue small-sized discard commands by default
 - change the policy of data hotness for logging
 - distinguish IOs in terms of size and wbc type
 - start SSR earlier to avoid foreground GC
 - enhance data structures managing discard commands
 - enhance in-place update flow
 - add some more fault injection routines
 - secure one more xattr entry
 
 = Bug fix
 - calculate victim cost for GC correctly
 - remain correct victim segment number for GC
 - race condition in nid allocator and initializer
 - stale pointer produced by atomic_writes
 - fix missing REQ_SYNC for flush commands
 - handle missing errors in more corner cases
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZEKXrAAoJEEAUqH6CSFDSJJ8P/1Zy0NS9TM/PFtT7Sevb6vgC
 LcKLtX1bVhUuX9wAt5Q6BZ9927tCQPt5vLEYUxtniqEQaC0fsJAMbRYot+gR/dvN
 4bGgv1TeVST5pKbmctzhAL30PvZ1w4QS6dLvPMm2sPQSrPKGUGt0J8wPiHHZuvH4
 pygKzDxbrIJTeMhLm9tgFg7dWTJXV3VDb57WpA1AM1LAFVsIPF4vZnryLv3GsRmY
 eGRxgZEtt/90hCRbEcPirPZrtpv/O5f12K4Vp/NPw+4XGMEk+nTYndq6rlUWVNjg
 iPEDuxONyk/yb274SqB6sbNDuxHOqn7stGJepdUpSbprIsLZ0RmMaYWjSNsLU3Vh
 p4fAzRqvfSqAHCt0FEL/vT8M9ST5xQRVr9P/l0kDK5Ww95RROd05bEaGm/sKc7NB
 PHiWUoMIFFmuVsoCi6sM0AKps53ZGON8GEUyVKyM7NWTw1oWLPWifGMthEkysmwm
 08SdU5+XqbCeyMPAA2GURqMA5A8ssuA8+F0Citf4JPckQHPPj5pAydmx2wVlfBlc
 /bneR7T/8OsUbxgG8JSbdHUiPcjb20F0GTxSOTXiV/AaZAMCtyETnw64K2V6E0n7
 uraKcYYhypyphCj/IYc4vnQ3dCu3U2/NvTYEVX8DBvboN38/JVqmNWgQx9g+tLzj
 +r5s7PqTDuXv5Cfzc5NC
 =SBUb
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've focused on enhancing performance with regards to
  block allocation, GC, and discard/in-place-update IO controls. There
  are a bunch of clean-ups as well as minor bug fixes.

  Enhancements:
   - disable heap-based allocation by default
   - issue small-sized discard commands by default
   - change the policy of data hotness for logging
   - distinguish IOs in terms of size and wbc type
   - start SSR earlier to avoid foreground GC
   - enhance data structures managing discard commands
   - enhance in-place update flow
   - add some more fault injection routines
   - secure one more xattr entry

  Bug fixes:
   - calculate victim cost for GC correctly
   - remain correct victim segment number for GC
   - race condition in nid allocator and initializer
   - stale pointer produced by atomic_writes
   - fix missing REQ_SYNC for flush commands
   - handle missing errors in more corner cases"

* tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (111 commits)
  f2fs: fix a mount fail for wrong next_scan_nid
  f2fs: enhance scalability of trace macro
  f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
  f2fs: Make flush bios explicitely sync
  f2fs: show available_nids in f2fs/status
  f2fs: flush dirty nats periodically
  f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
  f2fs: allow cpc->reason to indicate more than one reason
  f2fs: release cp and dnode lock before IPU
  f2fs: shrink size of struct discard_cmd
  f2fs: don't hold cmd_lock during waiting discard command
  f2fs: nullify fio->encrypted_page for each writes
  f2fs: sanity check segment count
  f2fs: introduce valid_ipu_blkaddr to clean up
  f2fs: lookup extent cache first under IPU scenario
  f2fs: reconstruct code to write a data page
  f2fs: introduce __wait_discard_cmd
  f2fs: introduce __issue_discard_cmd
  f2fs: enable small discard by default
  f2fs: delay awaking discard thread
  ...
2017-05-08 12:24:17 -07:00
Eric Biggers
1f73d49177 f2fs: switch to using fscrypt_match_name()
Switch f2fs directory searches to use the fscrypt_match_name() helper
function.  There should be no functional change.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:39 -04:00
Eric Biggers
6b06cdee81 fscrypt: avoid collisions when presenting long encrypted filenames
When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes.  Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long.  Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.

However, there is a bug.  It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions.  However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped".  Thus, it's actually the
second-to-last block which depends on the full plaintext.

This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable.  For example, with ext4:

    # echo pass | e4crypt add_key -p 16 edir/
    # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
    # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
    100000
    # sync
    # echo 3 > /proc/sys/vm/drop_caches
    # keyctl new_session
    # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
    2004
    # rm -rf edir/
    rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
    ...

To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.

Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient.  This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations.  Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories.  They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.

For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs.  It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet.  Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.

Fixes: 5de0b4d0cd ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:36 -04:00
Jaegeuk Kim
6332cd32c8 f2fs: check entire encrypted bigname when finding a dentry
If user has no key under an encrypted dir, fscrypt gives digested dentries.
Previously, when looking up a dentry, f2fs only checks its hash value with
first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
This patch enhances to check entire dentry bytes likewise ext4.

Eric reported how to reproduce this issue by:

 # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
100000
 # sync
 # echo 3 > /proc/sys/vm/drop_caches
 # keyctl new_session
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
99999

Cc: <stable@vger.kernel.org>
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(fixed f2fs_dentry_hash() to work even when the hash is 0)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:35 -04:00
Eric Biggers
faac7fd97e f2fs: sync f2fs_lookup() with ext4_lookup()
As for ext4, now that fscrypt_has_permitted_context() correctly handles
the case where we have the key for the parent directory but not the
child, f2fs_lookup() no longer has to work around it.  Also add the same
warning message that ext4 uses.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:34 -04:00
Yunlei He
e9cdd30770 f2fs: fix a mount fail for wrong next_scan_nid
-write_checkpoint
   -do_checkpoint
      -next_free_nid    <--- something wrong with next free nid

-f2fs_fill_super
   -build_node_manager
      -build_free_nids
          -get_current_nat_page
             -__get_meta_page   <--- attempt to access beyond end of device

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 19:00:30 -07:00
Chao Yu
a72d4b97bb f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
This patch expands cover region of inode->i_rwsem to keep setting flag
atomically.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 14:30:19 -07:00
Jan Kara
3adc5fcb7e f2fs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions.

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
Cc: stable@vger.kernel.org # 4.9+
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 14:30:18 -07:00
Jaegeuk Kim
5b0ef73c9d f2fs: show available_nids in f2fs/status
This patch adds an entry in f2fs/status to show # of available nids.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:57 -07:00
Jaegeuk Kim
1c0f4bf5c3 f2fs: flush dirty nats periodically
This patch flushes dirty nats in order to acquire available nids by writing
checkpoint. Otherwise, we can have no chance to get freed nids.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:56 -07:00
Chao Yu
1f43e2ad7b f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
Introduce CP_TRIMMED_FLAG to indicate all invalid block were trimmed
before umount, so once we do mount with image which contain the flag,
we don't record invalid blocks as undiscard one, when fstrim is being
triggered, we can avoid issuing redundant discard commands.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:56 -07:00
Chao Yu
c473f1a965 f2fs: allow cpc->reason to indicate more than one reason
Change to use different bits of cpc->reason to indicate different status,
so cpc->reason can indicate more than one reason.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:55 -07:00
Hou Pengyang
279d6df20c f2fs: release cp and dnode lock before IPU
We don't need to rewrite the page under cp_rwsem and dnode locks.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:54 -07:00
Chao Yu
9a744b92da f2fs: shrink size of struct discard_cmd
In order to shrink size of struct discard_cmd, change variable type of
@state in struct discard_cmd from int to unsigned char.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:51 -07:00
Chao Yu
ec9895add2 f2fs: don't hold cmd_lock during waiting discard command
Previously, with protection of cmd_lock, we will wait for end io of
discard command which potentially may lead long latency, making worse
concurrency.

So, in this patch, we try to add reference into discard entry to prevent
the entry being released by other thread, then we can avoid holding
global cmd_lock during waiting discard to finish.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:50 -07:00
Jaegeuk Kim
4d97807813 f2fs: nullify fio->encrypted_page for each writes
This makes sure each write request has nullified encrypted_page pointer.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:49 -07:00
Jin Qian
b9dd46188e f2fs: sanity check segment count
F2FS uses 4 bytes to represent block address. As a result, supported
size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments.

Signed-off-by: Jin Qian <jinqian@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:48 -07:00
Jaegeuk Kim
a817737e87 f2fs: introduce valid_ipu_blkaddr to clean up
This patch introduces valid_ipu_blkaddr to clean up checking block address for
inplace-update.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:48 -07:00
Hou Pengyang
e959c8f543 f2fs: lookup extent cache first under IPU scenario
If a page is cold, NOT atomit written and need_ipu now, there is
a high probability that IPU should be adapted. For IPU, we try to
check extent tree to get the block index first, instead of reading
the dnode page, where may lead to an useless dnode IO, since no need to
update the dnode index for IPU.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:47 -07:00
Hou Pengyang
7eab0c0df8 f2fs: reconstruct code to write a data page
This patch introduces encrypt_one_page which encrypts one data page before
submit_bio, and change the use of need_inplace_update.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:46 -07:00
Chao Yu
63a94fa1d7 f2fs: introduce __wait_discard_cmd
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:45 -07:00
Chao Yu
bd5b07383a f2fs: introduce __issue_discard_cmd
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:44 -07:00
Chao Yu
d618ebaf0a f2fs: enable small discard by default
This patch start to enable 4K granularity small discard by default
when realtime discard is on, so, in seriously fragmented space,
small size discard can be issued in time to avoid useless storage
space occupying of invalid filesystem's data, then performance of
flash storage can be recovered.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:18:45 -07:00
Chao Yu
34e159da41 f2fs: delay awaking discard thread
It's better to delay awaking discard thread while queuing discard commands
in checkpoint, it will help to give more chances for merging big and small
discard.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:18:44 -07:00
Yunlei He
66a82d1fc7 f2fs: seperate read nat page from nat_tree_lock
This patch seperate nat page read io from nat_tree_lock.

-lock_page
	-get_node_info()
		-current_nat_addr

			......           	->       write_checkpoint

			-get_meta_page

Because we lock node page, we can make sure no other threads
modify this nid concurrently. So we just obtain current_nat_addr
under nat_tree_lock, node info is always same in both nat pack.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:16:39 -07:00
Sheng Yong
d3bb910c15 f2fs: fix multiple f2fs_add_link() having same name for inline dentry
Commit 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having
same name) does not cover the scenario where inline dentry is enabled.
In that case, F2FS_I(dir)->task will be NULL, and __f2fs_add_link will
lookup dentries one more time.

This patch fixes it by moving the assigment of current task to a upper
level to cover both normal and inline dentry.

Cc: <stable@vger.kernel.org>
Fixes: 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having same name)
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:16:31 -07:00
Hou Pengyang
4086d3f61b f2fs: skip encrypted inode in ASYNC IPU policy
Async request may be throttled in block layer, so page for async may keep WRITE_BACK
for a long time.

For encrytped inode, we need wait on page writeback no matter if the device supports
BDI_CAP_STABLE_WRITES. This may result in a higher waiting page writeback time for
async encrypted inode page.

This patch skips IPU for encrypted inode's updating write.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:24 -07:00
Jaegeuk Kim
a788189305 f2fs: fix out-of free segments
This patch also reverts d0db7703ac ("f2fs: do SSR in higher priority").

This patch fixes out of free segments caused by many small file creation by
1) mkfs -s 1 2G
2) mount
3) untar
 - preoduce 60000 small files burstly
4) sync
 - flush node pages
 - flush imeta

Here, when we do f2fs_balance_fs, we missed # of imeta blocks, resulting in
skipping to check has_not_enough_free_secs.

Another test is done by
1) mkfs -s 12 2G
2) mount
3) untar
 - preoduce 60000 small files burstly
4) sync
 - flush node pages
 - flush imeta

In this case, this patch also fixes wrong block allocation under large section
size.

Reported-by: William Brana <wbrana@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:23 -07:00
Arnd Bergmann
d66450e773 f2fs: improve definition of statistic macros
With a recent addition of f2fs_lookup_extent_tree(), we get a warning about
the use of empty macros:

fs/f2fs/extent_cache.c: In function 'f2fs_lookup_extent_tree':
fs/f2fs/extent_cache.c:358:32: error: suggest braces around empty body in an 'else' statement [-Werror=empty-body]
   stat_inc_rbtree_node_hit(sbi);

A good way to avoid the warning and make the code more robust is to define
all no-op macros as 'do { } while (0)'.

Fixes: 54c2258cd6 ("f2fs: extract rb-tree operation infrastructure")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reivewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:22 -07:00
Jaegeuk Kim
d579324998 f2fs: assign allocation hint for warm/cold data
This patch gives slower device region to warm/cold data area more eagerly.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:06:53 -07:00
Jaegeuk Kim
d07efb5077 f2fs: fix _IOW usage
This patch fixes wrong _IOW usage.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 12:55:45 -07:00
Jaegeuk Kim
e066b83c9b f2fs: add ioctl to flush data from faster device to cold area
This patch adds an ioctl to flush data in faster device to cold area. User can
give device number and number of segments to move. It doesn't move it if there
is only one device.

The parameter looks like:

struct f2fs_flush_device {
	u32 dev_num;		/* device number to flush */
	u32 segments;		/* # of segments to flush */
};

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 12:55:41 -07:00
Hou Pengyang
04485987f0 f2fs: introduce async IPU policy
This patch introduces an ASYNC IPU policy.

Under senario of large # of async updating(e.g. log writing in Android),
disk would be seriously fragmented, and higher frequent gc would be triggered.

This patch uses IPU to rewrite the async update writting, since async is
NOT sensitive to io latency.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
2017-04-19 11:00:46 -07:00
Chao Yu
d84d1cbdec f2fs: add undiscard blocks stat
This patch adds to account undiscard blocks.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
2017-04-19 11:00:45 -07:00
Chao Yu
001c584cca f2fs: unlock cp_rwsem early for IPU writes
For IPU writes, there won't be any udpates in dnode page since we
will reuse old block address instead of allocating new one, so we
don't need to lock cp_rwsem during IPU IO submitting.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
2017-04-19 11:00:44 -07:00
Chao Yu
df0f6b44dd f2fs: introduce __check_rb_tree_consistence
Introduce __check_rb_tree_consistence to check consistence of rb-tree
based discard cache in runtime.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:44 -07:00
Chao Yu
0243a5f9da f2fs: trace __submit_discard_cmd
Add an even class f2fs_discard for introducing f2fs_queue_discard, then
use f2fs_{queue,issue}_discard to trace __{queue,submit}_discard_cmd.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:43 -07:00
Chao Yu
ba48a33ef6 f2fs: in prior to issue big discard
Keep issuing big size discard in prior instead of the one with random
size, so that we expect that it will help to:
- be quick to recycle unused large space in flash storage device.
- give a chance for
  a) wait to merge small piece discards into bigger one, or
  b) avoid issuing discards while they have being reallocated by SSR.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:42 -07:00
Chao Yu
46f84c2c05 f2fs: clean up discard_cmd_control structure
Avoid long variable name in discard_cmd_control structure, no logic
change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:41 -07:00
Chao Yu
004b686218 f2fs: use rb-tree to track pending discard commands
Introduce rb-tree based discard cache infrastructure to speed up lookup and
merge operation of discard entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: initialize dc to avoid build warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:40 -07:00
Jaegeuk Kim
d40d30c5aa f2fs: avoid dirty node pages in check_only recovery
In the check_only mode, we should not make any dirty node pages. Otherwise,
we can get this panic:

F2FS-fs (nvme0n1p1): Need to recover fsync data
------------[ cut here ]------------
kernel BUG at fs/f2fs/node.c:2204!
CPU: 7 PID: 19923 Comm: mount Tainted: G           OE   4.9.8 #2
RIP: 0010:[<ffffffffc0979c0b>]  [<ffffffffc0979c0b>] flush_nat_entries+0x43b/0x7d0 [f2fs]
Call Trace:
 [<ffffffffc096ddaa>] ? __f2fs_submit_merged_bio+0x5a/0xd0 [f2fs]
 [<ffffffffc096ddaa>] ? __f2fs_submit_merged_bio+0x5a/0xd0 [f2fs]
 [<ffffffffc096dddb>] ? __f2fs_submit_merged_bio+0x8b/0xd0 [f2fs]
 [<ffffffff860e450f>] ? up_write+0x1f/0x40
 [<ffffffffc096dddb>] ? __f2fs_submit_merged_bio+0x8b/0xd0 [f2fs]
 [<ffffffffc0969f04>] write_checkpoint+0x2f4/0xf20 [f2fs]
 [<ffffffff860e938d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffc0960bc9>] ? f2fs_sync_fs+0x79/0x190 [f2fs]
 [<ffffffffc0960bc9>] ? f2fs_sync_fs+0x79/0x190 [f2fs]
 [<ffffffffc0960bd5>] f2fs_sync_fs+0x85/0x190 [f2fs]
 [<ffffffffc097b6de>] f2fs_balance_fs_bg+0x7e/0x1c0 [f2fs]
 [<ffffffffc0977b64>] f2fs_write_node_pages+0x34/0x350 [f2fs]
 [<ffffffff860e5f42>] ? __lock_is_held+0x52/0x70
 [<ffffffff861d9b31>] do_writepages+0x21/0x30
 [<ffffffff86298ce1>] __writeback_single_inode+0x61/0x760
 [<ffffffff86909127>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffff8629a735>] writeback_single_inode+0xd5/0x190
 [<ffffffff8629a889>] write_inode_now+0x99/0xc0
 [<ffffffff86283876>] iput+0x1f6/0x2c0
 [<ffffffffc0964b52>] f2fs_fill_super+0xc32/0x10c0 [f2fs]
 [<ffffffff86266462>] mount_bdev+0x182/0x1b0
 [<ffffffffc0963f20>] ? f2fs_commit_super+0x100/0x100 [f2fs]
 [<ffffffffc0960da5>] f2fs_mount+0x15/0x20 [f2fs]
 [<ffffffff86266e08>] mount_fs+0x38/0x170
 [<ffffffff86288bab>] vfs_kern_mount+0x6b/0x160
 [<ffffffff8628bcfe>] do_mount+0x1be/0xd60

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-18 13:37:49 -07:00
Jaegeuk Kim
d29fd17218 f2fs: fix not to set fsync/dentry mark
Otherwise, we can see stale fsync/dentry mark given by previous calls, resulting
in giving up roll-forward recovery due to wrong dentry mark.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:09 -07:00
Jaegeuk Kim
6c3acd9757 f2fs: allocate hot_data for atomic writes
We'd better allocate atomic writes to hot_data zone.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:08 -07:00
Jaegeuk Kim
3097388354 f2fs: give time to flush dirty pages for checkpoint
If all the threads are waiting for checkpoint, we have no chance to flush
required dirty pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:07 -07:00
Jaegeuk Kim
9bb02c3627 f2fs: fix fs corruption due to zero inode page
This patch fixes the following scenario.

- f2fs_create/f2fs_mkdir             - write_checkpoint
 - f2fs_mark_inode_dirty_sync         - block_operations
                                       - f2fs_lock_all
                                       - f2fs_sync_inode_meta
                                        - f2fs_unlock_all
                                        - sync_inode_metadata
 - f2fs_lock_op
                                         - f2fs_write_inode
                                          - update_inode_page
                                           - get_node_page
                                             return -ENOENT
 - new_inode_page
  - fill_node_footer
 - f2fs_mark_inode_dirty_sync
 - ...
 - f2fs_unlock_op
                                          - f2fs_inode_synced
                                       - f2fs_lock_all
                                       - do_checkpoint

In this checkpoint, we can get an inode page which contains zeros having valid
node footer only.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:06 -07:00
Chao Yu
a54455f5ee f2fs: shrink blk plug region
Don't use blk plug covering area where there won't be any IOs being issued.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:05 -07:00
Chao Yu
54c2258cd6 f2fs: extract rb-tree operation infrastructure
rb-tree lookup/update functions are deeply coupled into extent cache
codes, it's very hard to reuse these basic functions, this patch
extracts common rb-tree operation infrastructure for latter reusing.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-11 15:13:52 -07:00
Jaegeuk Kim
8fd5a37efa f2fs: avoid frequent checkpoint during f2fs_gc
Now we're doing SSR aggressively more than ever before, so once we reach to
the reserved_segment, f2fs_balance_fs will call f2fs_gc, which triggers
checkpoint everytime. We actually must avoid that.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-11 15:12:39 -07:00
Jaegeuk Kim
4ddb1a4d4d f2fs: clean up some macros in terms of GET_SEGNO
This patch cleans several macros by introducing:
- BLKS_PER_SEC
- GET_SEC_FROM_SEG
- GET_SEG_FROM_SEC
- GET_ZONE_FROM_SEC
- GET_ZONE_FROM_SEG

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:13 -07:00
Jaegeuk Kim
302bd34882 f2fs: clean up get_valid_blocks with consistent parameter
This patch cleans up get_valid_blocks, which has no functional change.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:12 -07:00
Jaegeuk Kim
63fcf8e8d6 f2fs: use segment number for get_valid_blocks
This patch fixes to submit a segment number for get_valid_blocks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:11 -07:00
Tomohiro Kusumi
68afcf2d38 f2fs: guard macro variables with braces
Add braces around variables used within macros for those make sense
to do it. Many of the macros in f2fs already do this. What this commit
doesn't do is anything that changes line# as a result of adding braces,
which usually affects the binary via __LINE__.

Confirmed no diff in fs/f2fs/f2fs.ko before/after this commit on x86_64,
to make sure this has no functional change as well as there's been no
unexpected side effect due to callers' arithmetics within the existing
code.

Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:10 -07:00
Tomohiro Kusumi
771a9a7177 f2fs: fix comment on f2fs_flush_merged_bios() after 86531d6b
Callers are to unlock the page on failure after 86531d6b.

Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:10 -07:00
Chao Yu
fa64a0036c f2fs: prevent waiter encountering incorrect discard states
In f2fs_submit_discard_endio, we will wake up waiter before setting
discard command states, so waiter may use incorrect states. Change
the order between complete() and states setting to fix this issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:09 -07:00
Chao Yu
d431413f00 f2fs: introduce f2fs_wait_discard_bios
Split f2fs_wait_discard_bios from f2fs_wait_discard_bio, just for cleanup,
no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:08 -07:00
Chao Yu
22d375dd9c f2fs: split discard_cmd_list
Split discard_cmd_list to discard_{pend,wait}_list, so while sending/waiting
discard command, we can avoid traversing unneeded entries in original list.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:07 -07:00
Jaegeuk Kim
c6f82fe90d Revert "f2fs: put allocate_segment after refresh_sit_entry"
This reverts commit 3436c4bdb3.

This makes a leak to register dirty segments. I reproduced the issue by
modified postmark which injects a lot of file create/delete/update and
finally triggers huge number of SSR allocations.

Cc: <stable@vger.kernel.org> # v4.10+
[Jaegeuk Kim: Change missing incorrect comment]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:06 -07:00
Tomohiro Kusumi
64c24ecb3c f2fs: split make_dentry_ptr() into block and inline versions
Since callers statically know which type to use, make_dentry_ptr()
can simply be splitted into two inline functions. This way, the code
has less inlined, fewer arguments, and no cast.

Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:08 -07:00
Jaegeuk Kim
d1b3e72d54 f2fs: submit bio of in-place-update pages
This patch tries to split in-place-update bios from sequential bios.

Suggested-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:07 -07:00
Kaixu Xia
fc2e2875d5 f2fs: remove the redundant variable definition
The variable 'i' has been defined before, so here we can
use it directly.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:07 -07:00
Jaegeuk Kim
687de7f101 f2fs: avoid IO split due to mixed WB_SYNC_ALL and WB_SYNC_NONE
If two threads try to flush dirty pages in different inodes respectively,
f2fs_write_data_pages() will produce WRITE and WRITE_SYNC one at a time,
resulting in a lot of 4KB seperated IOs.

So, this patch gives higher priority to WB_SYNC_ALL IOs and gathers write
IOs with a big WRITE_SYNC'ed bio.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:06 -07:00
Jaegeuk Kim
ef095d19e8 f2fs: write small sized IO to hot log
It would better split small and large IOs separately in order to get more
consecutive big writes.

The default threshold is set to 64KB, but configurable by sysfs/min_hot_blocks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:05 -07:00
Chao Yu
a7eeb82385 f2fs: use bitmap in discard_entry
This patch changes to use bitmap instead of extent in struct discard_entry
to indicate discard range in one segment, for fragmented space, this
implementation can save memory footprint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:04 -07:00
Chao Yu
f099405fc8 f2fs: clean up destroy_discard_cmd_control
Remove unneeded parameter and simply change flow in
destroy_discard_cmd_control.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:04 -07:00
Chao Yu
5f32366a29 f2fs: count discard command entry
Adds to count discard command entry and show the number in debugfs,
also fix to add cost of discard command cache into total comsumed
memory footprint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:03 -07:00
Chao Yu
8b8dd65f72 f2fs: show issued flush/discard count
Show historical count of flush command and discard command.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:02 -07:00
Jaegeuk Kim
c13ff37e35 f2fs: relax node version check for victim data in gc
- has_not_enough_free_secs
node_secs: 0  dent_secs: 0  freed:0  free_segments:103  reserved:104

          - f2fs_gc
             - get_victim_by_default
alloc_mode 0, gc_mode 1, max_search 2672, offset 4654, ofs_unit 1

                - do_garbage_collect
start_segno 3976, end_segno 3977   type 0

                  - is_alive
nid 22797, blkaddr 2131882, ofs_in_node 0, version 0x8/0x0

                   - gc_data_segment 766, segno 3976, block 512/426 not alive

So, this patch fixes subtle corrupted case where node version does not match
to summary version which results in infinite loop by gc.

Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-29 17:34:40 -07:00
Jaegeuk Kim
796dbbfe4e f2fs: start SSR much eariler to avoid FG_GC
This patch initiates SSR much eariler, resulting in less FG_GC.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-29 17:34:39 -07:00
Jaegeuk Kim
7a20b8a61e f2fs: allocate node and hot data in the beginning of partition
In order to give more spatial locality, this patch changes the block allocation
policy which assigns beginning of partition for small and hot data/node blocks.
In order to do this, we set noheap allocation by default and introduce another
mount option, heap, to reset it back.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-29 17:34:37 -07:00
Jaegeuk Kim
c541a51b8c f2fs: fix wrong max cost initialization
This patch fixes missing increased max cost caused by a patch that we increased
cose of data segments in greedy algorithm.

Cc: <stable@vger.kernel.org> # v4.10+
Fixes: b9cd20619 "f2fs: node segment is prior to data segment selected victim"
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-28 15:49:27 -07:00
Yunlei He
59c9081bc8 f2fs: allow write page cache when writting cp
This patch allow write data to normal file when writting
new checkpoint.

We relax three limitations for write_begin path:
1. data allocation
2. node allocation
3. variables in checkpoint

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-25 00:19:37 -07:00
Chao Yu
22588f8773 f2fs: don't reserve additional space in xattr block
In this patch, we change xattr block disk layout as below:

Before:
			xattr node block layout
+---------------------------------------------+---------------+-------------+
|           node block xattr entries          |   reserved    | node footer |
|                  4068 Bytes                 |    4 Bytes    |  24 Bytes   |

			In memory layout
+--------------------+---------------------------------+--------------------+
|    inline xattr    |     node block xattr entries    |      reserved      |
|     200 Bytes      |         4068 Bytes              |      4 Bytes       |

After:
			xattr node block layout
+-------------------------------------------------------------+-------------+
|                  node block xattr entries                   | node footer |
|                         4072 Bytes                          |  24 Bytes   |

			In memory layout
+--------------------+---------------------------------+--------------------+
|    inline xattr    |     node block xattr entries    |      reserved      |
|     200 Bytes      |         4072 Bytes              |      4 Bytes       |

With this change, we don't need to reserve additional space in node block,
just keep reserved space in logical in-memory layout. So that it would help
to enlarge valid free space of xattr node block.

As tested, generic/026 shows max stored xattr entires number increases from
531 to 532 when inline_xattr option is enabled.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-24 15:10:53 -04:00
Chao Yu
89e9eabd7d f2fs: clean up xattr operation
1. don't allocate redundant memory in read_all_xattrs.
2. introduce RESERVED_XATTR_SIZE for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-24 15:10:52 -04:00
Chao Yu
99f4b917b0 f2fs: don't track volatile file in dirty inode list
Don't track volatile file in dirty inode list, otherwise with data_flush
option, background thread will entry into endless loop for flushing
journal file's pages.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-24 15:10:51 -04:00
Chao Yu
648d50ba12 f2fs: show the max number of volatile operations
This patch adds to show the max number of volatile operations which are
conducting concurrently.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-24 15:10:50 -04:00
Chao Yu
30a61ddf81 f2fs: fix race condition in between free nid allocator/initializer
In below concurrent case, allocated nid can be loaded into free nid cache
and be allocated again.

Thread A				Thread B
- f2fs_create
 - f2fs_new_inode
  - alloc_nid
   - __insert_nid_to_list(ALLOC_NID_LIST)
					- f2fs_balance_fs_bg
					 - build_free_nids
					  - __build_free_nids
					   - scan_nat_page
					    - add_free_nid
					     - __lookup_nat_cache
 - f2fs_add_link
  - init_inode_metadata
   - new_inode_page
    - new_node_page
     - set_node_addr
 - alloc_nid_done
  - __remove_nid_from_list(ALLOC_NID_LIST)
					     - __insert_nid_to_list(FREE_NID_LIST)

This patch makes nat cache lookup and free nid list operation being atomical
to avoid this race condition.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-24 15:10:50 -04:00
Yunlei He
5f4c3dec22 f2fs: use set_page_private marcro in f2fs_trace_pid
Use set_page_private marcro instead of operte page struct
directly

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-24 15:10:49 -04:00
Chao Yu
9897159a7b f2fs: fix recording invalid last_victim
When doing garbage collection, we try to record segment offset which
locates at next one of last victim, using it as the start offset in
next searching.

But in some corner cases, recorded offset may cross the end of main
segment area, it will cause incorrectly searching in dirty_segmap
bitmap. This patch adds modular operation to avoid this issue.

Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-24 15:10:48 -04:00
Kinglong Mee
8f73cbb7d4 f2fs: more reasonable mem_size calculating of ino_entry
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:37 -04:00
Kinglong Mee
70874fb34f f2fs: calculate the f2fs_stat_info into base_mem
The memory size of f2fs_stat_info also should be calculated.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:36 -04:00
Kinglong Mee
684ca7e55d f2fs: avoid stat_inc_atomic_write for non-atomic file
After filemap_write_and_wait_range fail, the FI_ATOMIC_FILE flags is removed,
so that f2fs should not increase the stat of atomic_write.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:36 -04:00
Kinglong Mee
c6f89dfd52 f2fs: sanity check of crc_offset from raw checkpoint
The crc_offset towards or beyond the end of block is wrong,
sanity check it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:34 -04:00
Kinglong Mee
d03ba4cc3f f2fs: cleanup the disk level filename updating
As discuss with Jaegeuk and Chao,
"Once checkpoint is done, f2fs doesn't need to update there-in filename at all."

The disk-level filename is used only one case,
1. create a file A under a dir
2. sync A
3. godown
4. umount
5. mount (roll_forward)

Only the rename/cross_rename changes the filename, if it happens,
a. between step 1 and 2, the sync A will caused checkpoint, so that,
   the roll_forward at step 5 never happens.
b. after step 2, the roll_forward happens, file A will roll forward
   to the result as after step 1.

So that, any updating the disk filename is useless, just cleanup it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:33 -04:00
Chao Yu
346fe752c4 f2fs: cover update_free_nid_bitmap with nid_list_lock
free_nid_bitmap and free_nid_count in update_free_nid_bitmap should be
updated atomically, use nid_list_lock cover them to avoid race in
concurrent scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:32 -04:00
Kinglong Mee
a83d50bc16 f2fs: fix bad prefetchw of NULL page
For f2fs_read_data_pages, the f2fs_mpage_readpages gets "page == NULL",
so that, the prefetchw(&page->flags) is operated on NULL.

Fixes: f1e8866016 ("f2fs: expose f2fs_mpage_readpages")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:31 -04:00
Kinglong Mee
bd4667cb4b f2fs: clear FI_DATA_EXIST flag in truncate_inline_inode
Clear FI_DATA_EXIST flag atomically in truncate_inline_inode, and
the return value from truncate_inline_inode isn't used, remove it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:31 -04:00
Kinglong Mee
d756386172 f2fs: move mnt_want_write_file after arguments checking
It's needless of mnt_want_write_file for arguments checking.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:30 -04:00
Kinglong Mee
46e82fb1b5 f2fs: check new size by inode_newsize_ok in f2fs_insert_range
The inode_newsize_ok is better than only checking the maxbytes,
eg. the rlimit etc.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:29 -04:00
Kinglong Mee
3cecfa5f67 f2fs: avoid copy date to user-space if move file range fail
If move file range return error, the data copied to user-space is duplicate.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:28 -04:00
Kinglong Mee
087d3d8bae f2fs: drop duplicate new_size assign in f2fs_zero_range
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:28 -04:00
Fan Li
8a6aa32502 f2fs: adjust the way of calculating nat block
use a slightly simpler expression to calculate nat block with nid.

Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:27 -04:00
Jaegeuk Kim
14b44d238c f2fs: add fault injection on f2fs_truncate
Inject a fault during f2fs_truncate().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:26 -04:00
Sheng Yong
1941d7bcb4 f2fs: check range before defragment
This patch checks the parameter range passed by ioctl to void that range
exceeds the max_file_blocks limit.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:25 -04:00
Sheng Yong
b0beab5016 f2fs: use parameter max_items instead of PIDVEC_SIZE
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:24 -04:00
Yunlei He
3d6a650feb f2fs: add a punch discard command function
This patch add a function to punch discard command if one segment
reuse before discard. Split this segment from multi-segments discard
range, and discard the left bigger range.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:23 -04:00
Jaegeuk Kim
c81abe34fe f2fs: allocate a bio for discarding when actually issuing it
Let's allocate a bio when issuing discard commands later.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:23 -04:00
Yunlei He
a29d0e0bc0 f2fs: skip writeback meta pages if cp_mutex acquire failed
Skip writeback meta pages if cp_mutex lock acquire failed, cp will
flush dirty pages instead.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:22 -04:00
Jaegeuk Kim
5ce4738a02 f2fs: show more precise message on orphan recovery failure
This case is not caused by fsck.f2fs. User needs to retry mount.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:21 -04:00
Kinglong Mee
fc7d5bc427 f2fs: remove dead macro PGOFS_OF_NEXT_DNODE
Fixes: 3cf4574705 ("f2fs: introduce get_next_page_offset to speed up SEEK_DATA")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:21 -04:00
Kinglong Mee
0b28b71e29 f2fs: drop duplicate radix tree lookup of nat_entry_set
The nat entry is listed from the set list for freeing,
it's duplicate to do radix tree lookup again.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
[Jaegeuk Kim: remove unnecessary f2fs_bug_on]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:20 -04:00
Kinglong Mee
20fda56b01 f2fs: make sure trace all f2fs_issue_flush
The root device's issue flush trace is missing,
add it and tracing the result from submit.

Fixes d50aaeec90 ("f2fs: show actual device info in tracepoints")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:19 -04:00
Chao Yu
8ff0971f15 f2fs: don't allow volatile writes for non-regular file
Now f2fs only supports volatile writes for journal db regular file.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:18 -04:00
Jaegeuk Kim
e811898c97 f2fs: don't allow atomic writes for not regular files
The atomic writes only supports regular files for database.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:17 -04:00
Jaegeuk Kim
8c242db9b8 f2fs: fix stale ATOMIC_WRITTEN_PAGE private pointer
When I forced to enable atomic operations intentionally, I could hit the below
panic, since we didn't clear page->private in f2fs_invalidate_page called by
file truncation.

The panic occurs due to NULL mapping having page->private.

BUG: unable to handle kernel paging request at ffffffffffffffff
IP: drop_buffers+0x38/0xe0
PGD 5d00c067
PUD 5d00e067
PMD 0
CPU: 3 PID: 1648 Comm: fsstress Tainted: G      D    OE   4.10.0+ #5
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: ffff9151952863c0 task.stack: ffffaaec40db4000
RIP: 0010:drop_buffers+0x38/0xe0
RSP: 0018:ffffaaec40db74c8 EFLAGS: 00010292
Call Trace:
 ? page_referenced+0x8b/0x170
 try_to_free_buffers+0xc5/0xe0
 try_to_release_page+0x49/0x50
 shrink_page_list+0x8bc/0x9f0
 shrink_inactive_list+0x1dd/0x500
 ? shrink_active_list+0x2c0/0x430
 shrink_node_memcg+0x5eb/0x7c0
 shrink_node+0xe1/0x320
 do_try_to_free_pages+0xef/0x2e0
 try_to_free_pages+0xe9/0x190
 __alloc_pages_slowpath+0x390/0xe70
 __alloc_pages_nodemask+0x291/0x2b0
 alloc_pages_current+0x95/0x140
 __page_cache_alloc+0xc4/0xe0
 pagecache_get_page+0xab/0x2a0
 grab_cache_page_write_begin+0x20/0x40
 get_read_data_page+0x2e6/0x4c0 [f2fs]
 ? f2fs_mark_inode_dirty_sync+0x16/0x30 [f2fs]
 ? truncate_data_blocks_range+0x238/0x2b0 [f2fs]
 get_lock_data_page+0x30/0x190 [f2fs]
 __exchange_data_block+0xaaf/0xf40 [f2fs]
 f2fs_fallocate+0x418/0xd00 [f2fs]
 vfs_fallocate+0x157/0x220
 SyS_fallocate+0x48/0x80

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Chao Yu: use INMEM_INVALIDATE for better tracing]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 22:34:10 -04:00
Jaegeuk Kim
aa51d08a11 f2fs: build stat_info before orphan inode recovery
f2fs_sync_fs() -> write_checkpoint() calls stat_inc_cp_count(sbi->stat_info),
which needs stat_info allocation.
Otherwise, we can hit:

[254042.598623]  ? count_shadow_nodes+0xa0/0xa0
[254042.598633]  f2fs_sync_fs+0x65/0xd0 [f2fs]
[254042.598645]  f2fs_balance_fs_bg+0xe4/0x1c0 [f2fs]
[254042.598657]  f2fs_write_node_pages+0x34/0x1a0 [f2fs]
[254042.598664]  ? pagevec_lookup_entries+0x1e/0x30
[254042.598673]  do_writepages+0x1e/0x30
[254042.598682]  __writeback_single_inode+0x45/0x330
[254042.598688]  writeback_single_inode+0xd7/0x190
[254042.598694]  write_inode_now+0x86/0xa0
[254042.598699]  iput+0x122/0x200
[254042.598709]  f2fs_fill_super+0xd4a/0x14d0 [f2fs]
[254042.598717]  mount_bdev+0x184/0x1c0
[254042.598934]  ? f2fs_commit_super+0x100/0x100 [f2fs]
[254042.599142]  f2fs_mount+0x15/0x20 [f2fs]
[254042.599349]  mount_fs+0x39/0x160
[254042.599554]  ? __alloc_percpu+0x15/0x20
[254042.599759]  vfs_kern_mount+0x67/0x110
[254042.599972]  do_mount+0x1bb/0xc80
[254042.600175]  ? memdup_user+0x42/0x60
[254042.600380]  SyS_mount+0x83/0xd0
[254042.600583]  entry_SYSCALL_64_fastpath+0x1e/0xad

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Kinglong Mee
10a875f82b f2fs: fix the fault of calculating blkstart twice
When the zone type is BLK_ZONE_TYPE_CONVENTIONAL, the blkstart is
calculated twice.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Kinglong Mee
e2f0e962ac f2fs: fix the fault of checking F2FS_LINK_MAX for rename inode
The parent directory's nlink will change, not the inode.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Jaegeuk Kim
4f1bca9f0d f2fs: don't allow to get pino when filename is encrypted
After renaming an encrypted file, we have no way to get its
encrypted filename from its dentry.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Jaegeuk Kim
8c1b3c0fb6 f2fs: fix wrong error injection for evict_inode
The previous one was not a proper location to inject an error, since there
is no point to get errors. Instead, we can emulate EIO during truncation,
and the below logic should handle it correctly.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Kinglong Mee
10047f537c f2fs: le32_to_cpu for ckpt->cp_pack_total_block_count
Fixes: 22ad0b6ab4 ("f2fs: add bitmaps for empty or full NAT blocks")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Jaegeuk Kim
b71deadbc4 f2fs: le16_to_cpu for xattr->e_value_size
This patch fixes missing le16 conversion, reported by kbuild test robot.

Fixes: 5f35a2cd5 ("f2fs: Don't update the xattr data that same as the exist")
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Jaegeuk Kim
4f295443bf f2fs: don't need to invalidate wrong node page
If f2fs_new_inode() is failed, the bad inode will invalidate 0'th node page
during f2fs_evict_inode(), which doesn't need to do.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Yunlei He
a78aaa2c3c f2fs: fix an error return value in truncate_partial_data_page
This patch fix a error return value in truncate_partial_data_page

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-21 16:52:16 -04:00
Chao Yu
7041d5d286 f2fs: combine nat_bits and free_nid_bitmap cache
Both nat_bits cache and free_nid_bitmap cache provide same functionality
as a intermediate cache between free nid cache and disk, but with
different granularity of indicating free nid range, and different
persistence policy. nat_bits cache provides better persistence ability,
and free_nid_bitmap provides better granularity.

In this patch we combine advantage of both caches, so finally policy of
the intermediate cache would be:
- init: load free nid status from nat_bits into free_nid_bitmap
- lookup: scan free_nid_bitmap before load NAT blocks
- update: update free_nid_bitmap in real-time
- persistence: udpate and persist nat_bits in checkpoint

This patch also resolves performance regression reported by lkp-robot.

commit:
  4ac912427c ("f2fs: introduce free nid bitmap")
  d00030cf9cd0bb96fdccc41e33d3c91dcbb672ba ("f2fs: use __set{__clear}_bit_le")
  1382c0f3f9d3f936c8bc42ed1591cf7a593ef9f7 ("f2fs: combine nat_bits and free_nid_bitmap cache")

4ac912427c d00030cf9cd0bb96fdccc41e33 1382c0f3f9d3f936c8bc42ed15
---------------- -------------------------- --------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
     77863 ±  0%      +2.1%      79485 ±  1%     +50.8%     117404 ±  0%  aim7.jobs-per-min
    231.63 ±  0%      -2.0%     227.01 ±  1%     -33.6%     153.80 ±  0%  aim7.time.elapsed_time
    231.63 ±  0%      -2.0%     227.01 ±  1%     -33.6%     153.80 ±  0%  aim7.time.elapsed_time.max
    896604 ±  0%      -0.8%     889221 ±  3%     -20.2%     715260 ±  1%  aim7.time.involuntary_context_switches
      2394 ±  1%      +4.6%       2503 ±  1%      +3.7%       2481 ±  2%  aim7.time.maximum_resident_set_size
      6240 ±  0%      -1.5%       6145 ±  1%     -14.1%       5360 ±  1%  aim7.time.system_time
   1111357 ±  3%      +1.9%    1132509 ±  2%      -6.2%    1041932 ±  2%  aim7.time.voluntary_context_switches
...

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Xiaolong Ye <xiaolong.ye@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-20 10:00:18 -04:00
Chao Yu
586d1492f3 f2fs: skip scanning free nid bitmap of full NAT blocks
This patch adds to account free nids for each NAT blocks, and while
scanning all free nid bitmap, do check count and skip lookuping in
full NAT block.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-20 10:00:17 -04:00
Jaegeuk Kim
23380b8568 f2fs: use __set{__clear}_bit_le
This patch uses __set{__clear}_bit_le for highter speed.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-20 10:00:16 -04:00
Jaegeuk Kim
9f7e4a2c49 f2fs: declare static functions
This is to avoid build warning reported by kbuild test robot.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-20 10:00:15 -04:00
Jaegeuk Kim
720037f939 f2fs: don't overwrite node block by SSR
This patch fixes that SSR can overwrite previous warm node block consisting of
a node chain since the last checkpoint.

Fixes: 5b6c6be2d8 ("f2fs: use SSR for warm node as well")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-03-20 10:00:14 -04:00
Linus Torvalds
590dce2d49 Merge branch 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs 'statx()' update from Al Viro.

This adds the new extended stat() interface that internally subsumes our
previous stat interfaces, and allows user mode to specify in more detail
what kind of information it wants.

It also allows for some explicit synchronization information to be
passed to the filesystem, which can be relevant for network filesystems:
is the cached value ok, or do you need open/close consistency, or what?

From David Howells.

Andreas Dilger points out that the first version of the extended statx
interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

* 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  statx: Add a system call to make enhanced file info available
2017-03-03 11:38:56 -08:00
David Howells
a528d35e8b statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode.  This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included.  The
following have been included:

 (1) Make the fields a consistent size on all arches and make them large.

 (2) Spare space, request flags and information flags are provided for
     future expansion.

 (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
     __s64).

 (4) Creation time: The SMB protocol carries the creation time, which could
     be exported by Samba, which will in turn help CIFS make use of
     FS-Cache as that can be used for coherency data (stx_btime).

     This is also specified in NFSv4 as a recommended attribute and could
     be exported by NFSD [Steve French].

 (5) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
     Dilger] (AT_STATX_DONT_SYNC).

 (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
     its cached attributes are up to date [Trond Myklebust]
     (AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

 (7) Data version number: Could be used by userspace NFS servers [Aneesh
     Kumar].

     Can also be used to modify fill_post_wcc() in NFSD which retrieves
     i_version directly, but has just called vfs_getattr().  It could get
     it from the kstat struct if it used vfs_xgetattr() instead.

     (There's disagreement on the exact semantics of a single field, since
     not all filesystems do this the same way).

 (8) BSD stat compatibility: Including more fields from the BSD stat such
     as creation time (st_btime) and inode generation number (st_gen)
     [Jeremy Allison, Bernd Schubert].

 (9) Inode generation number: Useful for FUSE and userspace NFS servers
     [Bernd Schubert].

     (This was asked for but later deemed unnecessary with the
     open-by-handle capability available and caused disagreement as to
     whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

     (No particular data were offered, but things like last backup
     timestamp, the data version number and the DOS archive bit would come
     into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
     filesystem can now say it doesn't support a standard stat feature if
     that isn't available, so if, for instance, inode numbers or UIDs don't
     exist or are fabricated locally...

     (This requires a separate system call - I have an fsinfo() call idea
     for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
     struct xstat [Steve French].

     (Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
     granularity of each of the times (NFSv4 time_delta) [Steve French].

     (Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
     Note that the Linux IOC flags are a mess and filesystems such as Ext4
     define flags that aren't in linux/fs.h, so translation in the kernel
     may be a necessity (or, possibly, we provide the filesystem type too).

     (Some attributes are made available in stx_attributes, but the general
     feeling was that the IOC flags were to ext[234]-specific and shouldn't
     be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
     Michael Kerrisk].

     (Deferred, probably to fsinfo.  Finding out if there's an ACL or
     seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

     (A __reserved field has been left in the statx_timestamp struct for
     this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

	int ret = statx(int dfd,
			const char *filename,
			unsigned int flags,
			unsigned int mask,
			struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat().  There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

 (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
     respect.

 (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
     its attributes with the server - which might require data writeback to
     occur to get the timestamps correct.

 (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
     network filesystem.  The resulting values should be considered
     approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller.  The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat().  It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data.  This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

	struct statx_timestamp {
		__s64	tv_sec;
		__s32	tv_nsec;
		__s32	__reserved;
	};

	struct statx {
		__u32	stx_mask;
		__u32	stx_blksize;
		__u64	stx_attributes;
		__u32	stx_nlink;
		__u32	stx_uid;
		__u32	stx_gid;
		__u16	stx_mode;
		__u16	__spare0[1];
		__u64	stx_ino;
		__u64	stx_size;
		__u64	stx_blocks;
		__u64	__spare1[1];
		struct statx_timestamp	stx_atime;
		struct statx_timestamp	stx_btime;
		struct statx_timestamp	stx_ctime;
		struct statx_timestamp	stx_mtime;
		__u32	stx_rdev_major;
		__u32	stx_rdev_minor;
		__u32	stx_dev_major;
		__u32	stx_dev_minor;
		__u64	__spare2[14];
	};

The defined bits in request_mask and stx_mask are:

	STATX_TYPE		Want/got stx_mode & S_IFMT
	STATX_MODE		Want/got stx_mode & ~S_IFMT
	STATX_NLINK		Want/got stx_nlink
	STATX_UID		Want/got stx_uid
	STATX_GID		Want/got stx_gid
	STATX_ATIME		Want/got stx_atime{,_ns}
	STATX_MTIME		Want/got stx_mtime{,_ns}
	STATX_CTIME		Want/got stx_ctime{,_ns}
	STATX_INO		Want/got stx_ino
	STATX_SIZE		Want/got stx_size
	STATX_BLOCKS		Want/got stx_blocks
	STATX_BASIC_STATS	[The stuff in the normal stat struct]
	STATX_BTIME		Want/got stx_btime{,_ns}
	STATX_ALL		[All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution.  Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does.  The following
attributes map to FS_*_FL flags and are the same numerical value:

	STATX_ATTR_COMPRESSED		File is compressed by the fs
	STATX_ATTR_IMMUTABLE		File is marked immutable
	STATX_ATTR_APPEND		File is append-only
	STATX_ATTR_NODUMP		File is not to be dumped
	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

	KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

	STATX_ATTR_AUTOMOUNT		Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

 (0) stx_dev_*, stx_blksize.

     These are local system information and are always available.

 (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
     stx_size, stx_blocks.

     These will be returned whether the caller asks for them or not.  The
     corresponding bits in stx_mask will be set to indicate whether they
     actually have valid values.

     If the caller didn't ask for them, then they may be approximated.  For
     example, NFS won't waste any time updating them from the server,
     unless as a byproduct of updating something requested.

     If the values don't actually exist for the underlying object (such as
     UID or GID on a DOS file), then the bit won't be set in the stx_mask,
     even if the caller asked for the value.  In such a case, the returned
     value will be a fabrication.

     Note that there are instances where the type might not be valid, for
     instance Windows reparse points.

 (2) stx_rdev_*.

     This will be set only if stx_mode indicates we're looking at a
     blockdev or a chardev, otherwise will be 0.

 (3) stx_btime.

     Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

	samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output.  Firstly, an NFS directory that crosses to
another FSID.  Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:26           Inode: 1703937     Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000
	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

	[root@andromeda ~]# /tmp/test-statx /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:27           Inode: 2           Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-02 20:51:15 -05:00
Ingo Molnar
174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Linus Torvalds
25c4e6c3f0 for-f2fs-4.11
This round introduces several interesting features such as on-disk NAT bitmaps,
 IO alignment, and a discard thread. And it includes a couple of major bug fixes
 as below.
 
 == Enhancement ==
 - introduce on-disk bitmaps to avoid scanning NAT blocks when getting free nids
 - support IO alignment to prepare open-channel SSD integration in future
 - introduce a discard thread to avoid long latency during checkpoint and fstrim
 - use SSR for warm node and enable inline_xattr by default
 - introduce in-memory bitmaps to check FS consistency for debugging
 - improve write_begin by avoiding needless read IO
 
 == Bug fix ==
 - fix broken zone_reset behavior for SMR drive
 - fix wrong victim selection policy during GC
 - fix missing behavior when preparing discard commands
 - fix bugs in atomic write support and fiemap
 - workaround to handle multiple f2fs_add_link calls having same name
 
 And it includes a bunch of clean-up patches as well.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYtmdOAAoJEEAUqH6CSFDSs0UP/AzngT37xVIhVBD13J9oHIuv
 rFA/eHVGRJmU1xc4SG1bghKm45xq8rwUX7irarfvLLc5aL+6VPGSdaRBykUr4A5N
 MN/bgK//EPp7If8EF+8PpY+9x7g67i0mtz5iD8dDrK+bUKV/IDKV1LWw5pR3g/g6
 RwMH0dUVOiD/HJ5iFp1ykTdVPe4vFY013uVmyPxUq+nCBlqlQm1nOvrGjF/HeYyX
 kqcD2LEc79GPfS5ebQIKfCfLE0rsWVnnS6YaqlDNCD5/oRim71CUtA4MPTYv29vp
 R/SebWlayEm+u68+uQUu6AyIk/1IdP0+AtRuQd/VxuteoyXmkTMHER662DqN4F8J
 npPdNrbNdlzwuAP77avy+hplqbD19yUa7o7Fl1No5rfheT3CiNTSj2uoriyEAffH
 1AM6tES7S7n5ttrXOr9iOxrK0u/vuaf7fbKVtK+RI09hwzdvyGB5HUdQB0iP/XR+
 obw8dru79ISMVZ9YuDhSfjI5ohAcfthfuqgjUt2RAfDv19IRsg5eayAp3T6nUfEX
 AGQbV/52dkO9svZztMbcBW95zmqkE0cMeX66KIMCPXNuDiE474t8k115K6kHpFwP
 e4Kx+mTSNhR1LEAaVdmCjbLb0gVrumVHTdjaZopnxTFmE70u/M6h1vY90m1LkReF
 ZDK5mhfMmGzU4wkvbgP8
 =tw8c
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This round introduces several interesting features such as on-disk NAT
  bitmaps, IO alignment, and a discard thread. And it includes a couple
  of major bug fixes as below.

  Enhancements:

   - introduce on-disk bitmaps to avoid scanning NAT blocks when getting
     free nids

   - support IO alignment to prepare open-channel SSD integration in
     future

   - introduce a discard thread to avoid long latency during checkpoint
     and fstrim

   - use SSR for warm node and enable inline_xattr by default

   - introduce in-memory bitmaps to check FS consistency for debugging

   - improve write_begin by avoiding needless read IO

  Bug fixes:

   - fix broken zone_reset behavior for SMR drive

   - fix wrong victim selection policy during GC

   - fix missing behavior when preparing discard commands

   - fix bugs in atomic write support and fiemap

   - workaround to handle multiple f2fs_add_link calls having same name

  ... and it includes a bunch of clean-up patches as well"

* tag 'for-f2fs-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (97 commits)
  f2fs: avoid to flush nat journal entries
  f2fs: avoid to issue redundant discard commands
  f2fs: fix a plint compile warning
  f2fs: add f2fs_drop_inode tracepoint
  f2fs: Fix zoned block device support
  f2fs: remove redundant set_page_dirty()
  f2fs: fix to enlarge size of write_io_dummy mempool
  f2fs: fix memory leak of write_io_dummy mempool during umount
  f2fs: fix to update F2FS_{CP_}WB_DATA count correctly
  f2fs: use MAX_FREE_NIDS for the free nids target
  f2fs: introduce free nid bitmap
  f2fs: new helper cur_cp_crc() getting crc in f2fs_checkpoint
  f2fs: update the comment of default nr_pages to skipping
  f2fs: drop the duplicate pval in f2fs_getxattr
  f2fs: Don't update the xattr data that same as the exist
  f2fs: kill __is_extent_same
  f2fs: avoid bggc->fggc when enough free segments are avaliable after cp
  f2fs: select target segment with closer temperature in SSR mode
  f2fs: show simple call stack in fault injection message
  f2fs: no need lock_op in f2fs_write_inline_data
  ...
2017-03-01 15:55:04 -08:00
Jaegeuk Kim
900f736251 f2fs: avoid to flush nat journal entries
This patch adds a missing condition which flushes nat journal entries
unnecessarily introduced by:

    f2fs: add bitmaps for empty or full NAT blocks

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-28 09:57:13 -08:00
Jaegeuk Kim
8b107f5b97 f2fs: avoid to issue redundant discard commands
If segs_per_sec is over 1 like under SMR, previously f2fs issues discard
commands redundantly on the same section, since we didn't move end position
for the previous discard command.

E.g.,

                       start  end
                         |    |
      prefree_bitmap = [01111100111100]

And, after issue discard for this section,
                             end      start
                              |        |
      prefree_bitmap = [01111100111100]

Select this section again by searching from (end + 1),
                             start  end
                                |   |
      prefree_bitmap = [01111100111100]

Fixes: 36abef4e79 ("f2fs: introduce mode=lfs mount option")
Cc: <stable@vger.kernel.org>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 13:44:08 -08:00
Hou Pengyang
37e79cd31c f2fs: fix a plint compile warning
fix such pclint warning:
...
Loss of precision (arg. no. 2) (unsigned long long to unsigned int))

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 10:51:21 -08:00
Hou Pengyang
b8d96a30b6 f2fs: add f2fs_drop_inode tracepoint
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 10:51:20 -08:00
Masato Suzuki
7bb3a371d1 f2fs: Fix zoned block device support
The introduction of the multi-device feature partially broke the support
for zoned block devices. In the function f2fs_scan_devices, sbi->devs
allocation and initialization is skipped in the case of a single device
mount. This result in no device information structure being allocated
for the device. This is fine if the device is a regular device, but in
the case of a zoned block device, the device zone type array is not
initialized, which causes the function __f2fs_issue_discard_zone to fail
as get_blkz_type is unable to determine the zone type of a section.

Fix this by always allocating and initializing the sbi->devs device
information array even in the case of a single device if that device is
zoned. For this particular case, make sure to obtain a reference on the
single device so that the call to blkdev_put() in destroy_device_list
operates as expected.

Fixes: 3c62be17d4 ("f2fs: support multiple devices")
Cc: <stable@vger.kernel.org> # v4.10
Signed-off-by: Masato Suzuki <masato.suzuki@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 10:40:12 -08:00
Yunlei He
4fcf589ad4 f2fs: remove redundant set_page_dirty()
This patch remove redundant set_page_dirty in truncate_blocks

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 10:40:11 -08:00
Chao Yu
a3ebfe4fd8 f2fs: fix to enlarge size of write_io_dummy mempool
It needs to double cache size of write_io_dummy mempool, otherwise we may
run out of cache in scenraio of Data/Node IOs were issued concurrently.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 10:40:10 -08:00
Chao Yu
b6895e8f99 f2fs: fix memory leak of write_io_dummy mempool during umount
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 10:40:09 -08:00
Chao Yu
540faedb00 f2fs: fix to update F2FS_{CP_}WB_DATA count correctly
We should only account F2FS_{CP_}WB_DATA IOs for write path, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 10:40:08 -08:00
Kinglong Mee
f0cdbfe6ef f2fs: use MAX_FREE_NIDS for the free nids target
F2FS has define MAX_FREE_NIDS for maximum of cached free nids target.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-27 10:07:48 -08:00