This patch shows the fault injection mount option in
f2fs_show_options().
Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We use following method to calculate size with current page index:
size = index << PAGE_SHIFT
If type of index has only 32-bits size, left shifting will incur overflow,
which makes result incorrect.
So let's cast index with 64-bits type to avoid such issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, in getxattr we will load all entries both in inline xattr and
xattr node block, and then do the lookup in all entries, but our lookup
flow shows low efficiency, since if we can lookup and hit in inline xattr
of inode page cache first, we don't need to load and lookup xattr node
block, which can obviously save cpu time and IO latency.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: initialize NULL to avoid warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
A dead loop can be triggered in f2fs_fiemap() using the test case
as below:
...
fd = open();
fallocate(fd, 0, 0, 4294967296);
ioctl(fd, FS_IOC_FIEMAP, fiemap_buf);
...
It's caused by an overflow in __get_data_block():
...
bh->b_size = map.m_len << inode->i_blkbits;
...
map.m_len is an unsigned int, and bh->b_size is a size_t which is 64 bits
on 64 bits archtecture, type conversion from an unsigned int to a size_t
will result in an overflow.
In the above-mentioned case, bh->b_size will be zero, and f2fs_fiemap()
will call get_data_block() at block 0 again an again.
Fix this by adding a force conversion before left shift.
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sheng Yong reports needless preallocation if write(small_buffer, large_size)
is called.
In that case, f2fs preallocates large_size, but vfs returns early due to
small_buffer size. Let's detect it before preallocation phase in f2fs.
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds stat information for flush and discard commands.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a kernel thread to issue discard commands.
It proposes three states, D_PREP, D_SUBMIT, and D_DONE to identify current
bio status.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds discard_cmd_control with the existing discarding controls.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch simply cleans up the names for flush/discard commands.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a mirror for sit version bitmap, and use it to detect
in-memory bitmap corruption which may be caused by bit-transition of
cache or memory overflow.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a mirror for nat version bitmap, and use it to detect
in-memory bitmap corruption which may be caused by bit-transition of
cache or memory overflow.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a mirror for valid block bitmap, and use it to detect
in-memory bitmap corruption which may be caused by bit-transition of
cache or memory overflow.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a new flag to indicate inode status of doing atomic
write committing, so that, we can keep atomic write status for inode
during atomic committing, then we can skip GCing pages of atomic write inode,
that avoids random GCed datas being mixed with current transaction, so
isolation of transaction can be kept.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is no candidate to submit discard command during f2fs_trim_fs, let's
return without checkpoint.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_trim_fs() doesn't need to do checkpoint if there are newly allocated
data blocks only which didn't change the critical checkpoint data such as nat
and sit entries.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlirP6wACgkQ8vlZVpUN
gaMwpQgApR67CxzlstxYjZpWPAqC8McJ2FBDX+mCOle5Vkc1WQDklwr0oCfQThTj
eDSFRhNfIvyPh0DJ589PxBCsWOqN5h6Si7hD5ZinomVNI+IL89OytaU5EV2OpWaW
iKdJgO9Tm8U7LuY6FOIoVdX57kUXVdkWoj61rC056B1SNhnNiVeofi7lYDM8Ix4q
IGSQ9W24iQKmCk4hCwgObhJBRK9RnlOH0GLUmpMaS+jnfnj/uePwdxWEFsPuCOob
8acAJ49lr55kjIw79E0BAyWxhEZ2aiArHk8PaWynT/DyNq3ftcapPlpftoeba8vo
glBJRX70QxPvt0iHEp0ykfExkhWhFA==
=Joki
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Various cleanups for the file system encryption feature"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: constify struct fscrypt_operations
fscrypt: properly declare on-stack completion
fscrypt: split supp and notsupp declarations into their own headers
fscrypt: remove redundant assignment of res
fscrypt: make fscrypt_operations.key_prefix a string
fscrypt: remove unused 'mode' member of fscrypt_ctx
ext4: don't allow encrypted operations without keys
fscrypt: make test_dummy_encryption require a keyring key
fscrypt: factor out bio specific functions
fscrypt: pass up error codes from ->get_context()
fscrypt: remove user-triggerable warning messages
fscrypt: use EEXIST when file already uses different policy
fscrypt: use ENOTDIR when setting encryption policy on nondirectory
fscrypt: use ENOKEY when file cannot be created w/o key
Previously, each filesystem configured without encryption support would
define all the public fscrypt functions to their notsupp_* stubs. This
list of #defines had to be updated in every filesystem whenever a change
was made to the public fscrypt functions. To make things more
maintainable now that we have three filesystems using fscrypt, split the
old header fscrypto.h into several new headers. fscrypt_supp.h contains
the real declarations and is included by filesystems when configured
with encryption support, whereas fscrypt_notsupp.h contains the inline
stubs and is included by filesystems when configured without encryption
support. fscrypt_common.h contains common declarations needed by both.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch relaxes async discard commands to avoid waiting its end_io during
checkpoint.
Instead of waiting them during checkpoint, it will be done when actually reusing
them.
Test on initial partition of nvme drive.
# time fstrim /mnt/test
Before : 6.158s
After : 4.822s
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
A test program gets the SEEK_DATA with two values between
a new created file and the exist file on f2fs filesystem.
F2FS filesystem, (the first "test1" is a new file)
SEEK_DATA size != 0 (offset = 8192)
SEEK_DATA size != 0 (offset = 4096)
PNFS filesystem, (the first "test1" is a new file)
SEEK_DATA size != 0 (offset = 4096)
SEEK_DATA size != 0 (offset = 4096)
int main(int argc, char **argv)
{
char *filename = argv[1];
int offset = 1, i = 0, fd = -1;
if (argc < 2) {
printf("Usage: %s f2fsfilename\n", argv[0]);
return -1;
}
/*
if (!access(filename, F_OK) || errno != ENOENT) {
printf("Needs a new file for test, %m\n");
return -1;
}*/
fd = open(filename, O_RDWR | O_CREAT, 0777);
if (fd < 0) {
printf("Create test file %s failed, %m\n", filename);
return -1;
}
for (i = 0; i < 20; i++) {
offset = 1 << i;
ftruncate(fd, 0);
lseek(fd, offset, SEEK_SET);
write(fd, "test", 5);
/* Get the alloc size by seek data equal zero*/
if (lseek(fd, 0, SEEK_DATA)) {
printf("SEEK_DATA size != 0 (offset = %d)\n", offset);
break;
}
}
close(fd);
return 0;
}
Reported-and-Tested-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the renaming bug on encrypted filenames, which was pointed by
(ext4: don't allow encrypted operations without keys)
Cc: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to show the max number of atomic operations which are
conducting concurrently.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements IO alignment by filling dummy blocks in DATA and NODE
write bios. If we can guarantee, for example, 32KB or 64KB for such the IOs,
we can eliminate underlying dummy page problem which FTL conducts in order to
close MLC or TLC partial written pages.
Note that,
- it requires "-o mode=lfs".
- IO size should be power of 2, not exceed BIO_MAX_PAGES, 256.
- read IO is still 4KB.
- do checkpoint at fsync, if dummy NODE page was written.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If userspace issue a fstrim with a range not involve prefree segments,
it will reuse these segments without discard. This patch fix it.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the range we write cover the whole valid data in the last page,
we do not need to read it.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
[Jaegeuk Kim: nullify the remaining area (fix: xfstests/f2fs/001)]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fix a problem of using memory after free
in function __try_merge_extent_node.
Fixes: 0f825ee6e8 ("f2fs: add new interfaces for extent tree")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We checked that "inode" is not an error pointer earlier so there is
no need to check again here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we run out of memory, in cache_nat_entry, it's better to avoid loop
for allocating memory to cache nat entry, so in low memory scenario, for
read path of node block, I expect this can avoid unneeded latency.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch remove unused values in function recover_fsync_data
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Collapsed the two patches, they were nonsensically split and broke
bisection.
Signed-off-by: Jens Axboe <axboe@fb.com>
There was an unnecessary amount of complexity around requesting the
filesystem-specific key prefix. It was unclear why; perhaps it was
envisioned that different instances of the same filesystem type could
use different key prefixes, or that key prefixes could be binary.
However, neither of those things were implemented or really make sense
at all. So simplify the code by making key_prefix a const char *.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As part of an effort to clean up fscrypt-related error codes, make
attempting to create a file in an encrypted directory that hasn't been
"unlocked" fail with ENOKEY. Previously, several error codes were used
for this case, including ENOENT, EACCES, and EPERM, and they were not
consistent between and within filesystems. ENOKEY is a better choice
because it expresses that the failure is due to lacking the encryption
key. It also matches the error code returned when trying to open an
encrypted regular file without the key.
I am not aware of any users who might be relying on the previous
inconsistent error codes, which were never documented anywhere.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull partial readlink cleanups from Miklos Szeredi.
This is the uncontroversial part of the readlink cleanup patch-set that
simplifies the default readlink handling.
Miklos and Al are still discussing the rest of the series.
* git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
vfs: make generic_readlink() static
vfs: remove ".readlink = generic_readlink" assignments
vfs: default to generic_readlink()
vfs: replace calling i_op->readlink with vfs_readlink()
proc/self: use generic_readlink
ecryptfs: use vfs_get_link()
bad_inode: add missing i_op initializers
needed for both ext4 and xfs dax changes to use iomap for DAX. It
also includes the fscrypt branch which is needed for ubifs encryption
work as well as ext4 encryption and fscrypt cleanups.
Lots of cleanups and bug fixes, especially making sure ext4 is robust
against maliciously corrupted file systems --- especially maliciously
corrupted xattr blocks and a maliciously corrupted superblock. Also
fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
mbcache so we don't miss some common xattr blocks that can be merged.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlhQQVEACgkQ8vlZVpUN
gaN9TQgAoCD+V4kJjMCFhiV8u6QR3hqD6bOZbggo5wJf4CHglWkmrbAmc3jANOgH
CKsXDRRjxuDjPXf1ukB1i4M7ArLYjkbbzKdsu7lismoJLS+w8uwUKSNdep+LYMjD
alxUcf5DCzLlUmdOdW4yE22L+CwRfqfs8IpBvKmJb7DrAKiwJVA340ys6daBGuu1
63xYx0QIyPzq0xjqLb6TVf88HUI4NiGVXmlm2wcrnYd5966hEZd/SztOZTVCVWOf
Z0Z0fGQ1WJzmaBB9+YV3aBi+BObOx4m2PUprIa531+iEW02E+ot5Xd4vVQFoV/r4
NX3XtoBrT1XlKagy2sJLMBoCavqrKw==
=j4KP
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"This merge request includes the dax-4.0-iomap-pmd branch which is
needed for both ext4 and xfs dax changes to use iomap for DAX. It also
includes the fscrypt branch which is needed for ubifs encryption work
as well as ext4 encryption and fscrypt cleanups.
Lots of cleanups and bug fixes, especially making sure ext4 is robust
against maliciously corrupted file systems --- especially maliciously
corrupted xattr blocks and a maliciously corrupted superblock. Also
fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
mbcache so we don't miss some common xattr blocks that can be merged"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
dax: Fix sleep in atomic contex in grab_mapping_entry()
fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
fscrypt: Delay bounce page pool allocation until needed
fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
fscrypt: Never allocate fscrypt_ctx on in-place encryption
fscrypt: Use correct index in decrypt path.
fscrypt: move the policy flags and encryption mode definitions to uapi header
fscrypt: move non-public structures and constants to fscrypt_private.h
fscrypt: unexport fscrypt_initialize()
fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
fscrypto: move ioctl processing more fully into common code
fscrypto: remove unneeded Kconfig dependencies
MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
ext4: do not perform data journaling when data is encrypted
ext4: return -ENOMEM instead of success
ext4: reject inodes with negative size
ext4: remove another test in ext4_alloc_file_blocks()
Documentation: fix description of ext4's block_validity mount option
ext4: fix checks for data=ordered and journal_async_commit options
...
This patch series contains several performance tuning patches regarding to the
IO submission flow, in addition to supporting new features such as a ZBC-base
drive and multiple devices.
It also includes some major bug fixes such as:
- checkpoint version control
- fdatasync-related roll-forward recovery routine
- memory boundary or null-pointer access in corner cases
- missing error cases
It has various minor clean-up patches as well.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYTx44AAoJEEAUqH6CSFDSnAQP/jeYJq5Zd0bweEF5g00Ec1Qg
qNKQ57e9EHDRaDLBUmHHEaCEPRL0bw6SOUUWWqzGA07KcsIK+Yb/dGAyIcuV7WMl
PjntVbYm4yARDYBHGupdOCzFSkzr8gDalb+98jJnoGUonsftljhES9jedQ1NjAms
GFPHDNtirZM/r0bjKkYKjpqJ6FCxFxcGPfb/GtohDajIpohWfKZiemaXGTgtYR4d
iBVek16h+Hprz90ycZBY69uz0TdAwu/gb+htMVBrAdExHWvlFzgp35OIywiAB/YX
3QD/x4t2HqOBaNYiiOAY4ukVW/Yyqa/ZAzbm+m5B5CAcFYiWXMy+cMXUY9HJJ/K0
wdvi//Avtvgpp2PVZFn2pASx14vgMFylBzuNgKpP6MPdtWTEL33jT7VYs9Nuz45E
dgZ9IpiDt4DeTRuZ4mPO5iH7bVHPvAVV80bpXzirCCzDeNZ1EFFIQzXh/2UAmCxI
twPXGBIYul0aIl9JkWAyhCZSd3XDSqedpfPudknjhzM9Xb1H5X0QJco7f/UwsWXH
WxV6lHr1Q7UH96wJ7x/GAqj8ArOAASRV18+K51dqU+DWHnFPpBArJe39FVf8NGWs
Fz1ZmlWBQ0ZgzvLkGa80llhjalXIEy/JabMrpy6VrzQGxHdmW4cVxe4dJ3710WxX
VysJUcNMRKxMUTWOKsxp
=Boum
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch series contains several performance tuning patches
regarding to the IO submission flow, in addition to supporting new
features such as a ZBC-base drive and multiple devices.
It also includes some major bug fixes such as:
- checkpoint version control
- fdatasync-related roll-forward recovery routine
- memory boundary or null-pointer access in corner cases
- missing error cases
It has various minor clean-up patches as well"
* tag 'for-f2fs-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (66 commits)
f2fs: fix a missing size change in f2fs_setattr
f2fs: fix to access nullified flush_cmd_control pointer
f2fs: free meta pages if sanity check for ckpt is failed
f2fs: detect wrong layout
f2fs: call sync_fs when f2fs is idle
Revert "f2fs: use percpu_counter for # of dirty pages in inode"
f2fs: return AOP_WRITEPAGE_ACTIVATE for writepage
f2fs: do not activate auto_recovery for fallocated i_size
f2fs: fix to determine start_cp_addr by sbi->cur_cp_pack
f2fs: fix 32-bit build
f2fs: set ->owner for debugfs status file's file_operations
f2fs: fix incorrect free inode count in ->statfs
f2fs: drop duplicate header timer.h
f2fs: fix wrong AUTO_RECOVER condition
f2fs: do not recover i_size if it's valid
f2fs: fix fdatasync
f2fs: fix to account total free nid correctly
f2fs: fix an infinite loop when flush nodes in cp
f2fs: don't wait writeback for datas during checkpoint
f2fs: fix wrong written_valid_blocks counting
...
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
This patch fix a missing size change in f2fs_setattr
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rename the FS_CFLG_INPLACE_ENCRYPTION flag to FS_CFLG_OWN_PAGES which,
when set, indicates that the fs uses pages under its own control as
opposed to writeback pages which require locking and a bounce buffer for
encryption.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Multiple bugs were recently fixed in the "set encryption policy" ioctl.
To make it clear that fscrypt_process_policy() and fscrypt_get_policy()
implement ioctls and therefore their implementations must take standard
security and correctness precautions, rename them to
fscrypt_ioctl_set_policy() and fscrypt_ioctl_get_policy(). Make the
latter take in a struct file * to make it consistent with the former.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If .readlink == NULL implies generic_readlink().
Generated by:
to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
f2fs_sync_file() remount_ro
- f2fs_readonly
- destroy_flush_cmd_control
- f2fs_issue_flush
- no fcc pointer!
So, this patch doesn't free fcc in this case, but just stop its kernel thread
which sends flush commands.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previous mkfs.f2fs allows small partition inappropriately, so f2fs should detect
that as well.
Refer this in f2fs-tools.
mkfs.f2fs: detect small partition by overprovision ratio and # of segments
Reported-and-Tested-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 1beba1b3a9.
The perpcu_counter doesn't provide atomicity in single core and consume more
DRAM. That incurs fs_mark test failure due to ENOMEM.
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should use AOP_WRITEPAGE_ACTIVATE when we bypass writing pages.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a file needs to keep its i_size by fallocate, we need to turn off auto
recovery during roll-forward recovery.
This will resolve the below scenario.
1. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4096" -c "fsync"
2. xfs_io -f /mnt/f2fs/file -c "falloc -k 4096 4096" -c "fsync"
3. md5sum /mnt/f2fs/file;
4. godown /mnt/f2fs/
5. umount /mnt/f2fs/
6. mount -t f2fs /dev/sdx /mnt/f2fs
7. md5sum /mnt/f2fs/file
Reported-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't guarantee cp_addr is fixed by cp_version.
This is to sync with f2fs-tools.
Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The addition of multiple-device support broke CONFIG_BLK_DEV_ZONED
on 32-bit machines because of a 64-bit division:
fs/f2fs/f2fs.o: In function `__issue_discard_async':
extent_cache.c:(.text.__issue_discard_async+0xd4): undefined reference to `__aeabi_uldivmod'
Fortunately, bdev_zone_size() is guaranteed to return a power-of-two
number, so we can replace the % operator with a cheaper bit mask.
Fixes: 792b84b74b54 ("f2fs: support multiple devices")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The struct file_operations instance serving the f2fs/status debugfs file
lacks an initialization of its ->owner.
This means that although that file might have been opened, the f2fs module
can still get removed. Any further operation on that opened file, releasing
included, will cause accesses to unmapped memory.
Indeed, Mike Marshall reported the following:
BUG: unable to handle kernel paging request at ffffffffa0307430
IP: [<ffffffff8132a224>] full_proxy_release+0x24/0x90
<...>
Call Trace:
[] __fput+0xdf/0x1d0
[] ____fput+0xe/0x10
[] task_work_run+0x8e/0xc0
[] do_exit+0x2ae/0xae0
[] ? __audit_syscall_entry+0xae/0x100
[] ? syscall_trace_enter+0x1ca/0x310
[] do_group_exit+0x44/0xc0
[] SyS_exit_group+0x14/0x20
[] do_syscall_64+0x61/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
<...>
---[ end trace f22ae883fa3ea6b8 ]---
Fixing recursive fault but reboot is needed!
Fix this by initializing the f2fs/status file_operations' ->owner with
THIS_MODULE.
This will allow debugfs to grab a reference to the f2fs module upon any
open on that file, thus preventing it from getting removed.
Fixes: 902829aa0b ("f2fs: move proc files to debugfs")
Reported-by: Mike Marshall <hubcap@omnibond.com>
Reported-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
While calculating inode count that we can create at most in the left space,
we should consider space which data/node blocks occupied, since we create
data/node mixly in main area. So fix the wrong calculation in ->statfs.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If i_size is already valid during roll_forward recovery, we should not update
it according to the block alignment.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For below two cases, we can't guarantee data consistence:
a)
1. xfs_io "pwrite 0 4195328" "fsync"
2. xfs_io "pwrite 4195328 1024" "fdatasync"
3. godown
4. umount & mount
--> isize we updated before fdatasync won't be recovered
b)
1. xfs_io "pwrite -S 0xcc 0 4202496" "fsync"
2. xfs_io "fpunch 4194304 4096" "fdatasync"
3. godown
4. umount & mount
--> dnode we punched before fdatasync won't be recovered
The reason is that normally fdatasync won't be aware of modification
of metadata in file, e.g. isize changing, dnode updating, so in ->fsync
we will skip flushing node pages for above cases, result in making
fdatasynced file being lost during recovery.
Currently we have introduced DIRTY_META global list in sbi for tracking
dirty inode selectively, so in fdatasync we can choose to flush nodes
depend on dirty state of current inode in the list.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thread A Thread B Thread C
- f2fs_create
- f2fs_new_inode
- f2fs_lock_op
- alloc_nid
alloc last nid
- f2fs_unlock_op
- f2fs_create
- f2fs_new_inode
- f2fs_lock_op
- alloc_nid
as node count still not
be increased, we will
loop in alloc_nid
- f2fs_write_node_pages
- f2fs_balance_fs_bg
- f2fs_sync_fs
- write_checkpoint
- block_operations
- f2fs_lock_all
- f2fs_lock_op
While creating new inode, we do not allocate and account nid atomically,
so that when there is almost no free nids left, we may encounter deadloop
like above stack.
In order to avoid that, reuse nm_i::available_nids for accounting free nids
and make nid allocation and counting being atomical during node creation.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thread A Thread B
- write_checkpoint
- block_operations
-blk_start_plug
-sync_node_pages - f2fs_do_sync_file
- fsync_node_pages
- f2fs_wait_on_page_writeback
Thread A wait for global F2FS_DIRTY_NODES decreased to zero,
it start a plug list, some requests have been added to this list.
Thread B lock one dirty node page, and wait this page write back.
But this page has been in plug list of thread A with PG_writeback flag.
Thread A keep on running and its plug list has no chance to finish,
so it seems a deadlock between cp and fsync path.
This patch add a wait on page write back before set node page dirty
to avoid this problem.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Pengyang Hou <houpengyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Normally, while committing checkpoint, we will wait on all pages to be
writebacked no matter the page is data or metadata, so in scenario where
there are lots of data IO being submitted with metadata, we may suffer
long latency for waiting writeback during checkpoint.
Indeed, we only care about persistence for pages with metadata, but not
pages with data, as file system consistent are only related to metadate,
so in order to avoid encountering long latency in above scenario, let's
recognize and reference metadata in submitted IOs, wait writeback only
for metadatas.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, written_valid_blocks was got by ckpt->valid_block_count. But if
the last checkpoint has some NEW_ADDR due to power-cut, we can get wrong value.
Fix it to get the number from actual written block count from sit entries.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If many threads hit has_not_enough_free_secs() in f2fs_balance_fs() at the same
time, all the threads would do FG_GC or BG_GC.
In this critical path, we totally don't need to do BG_GC at all.
Let's avoid that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In direct_IO path of f2fs_file_write_iter(),
1. f2fs_preallocate_blocks(F2FS_GET_BLOCK_PRE_DIO)
-> allocate LBA X
2. f2fs_direct_IO()
-> return 0;
Then,
f2fs_write_data_page() will allocate another LBA X+1.
This makes EIO triggered by HM-SMR.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements multiple devices support for f2fs.
Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big
volume under one f2fs instance.
Internal block management is very simple, but we will modify block allocation
and background GC policy to boost IO speed by exploiting them accoording to
each device speed.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can allow dio reads for LFS mode, while doing buffered writes for dio writes.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we don't need to be too much careful about storage alignment for dio, since
its speed becomes quite fast and we'd better avoid any misalignment first.
Revert: 38aa0889b2 (f2fs: align direct_io'ed data to section)
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If one block has been to written to a new place, just return
in move data process. This patch check it again with holding
page lock.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
i_times of inode will be set with current system time which can be
configured through 'date', so it's not safe to judge dnode block as
garbage data or unchanged inode depend on i_times.
Now, we have used enhanced 'cp_ver + cp' crc method to verify valid
dnode block, so I expect recoverying invalid dnode is almost not
possible.
This reverts commit 807b1e1c8e.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we assigned CURSEG_WARM_DATA for direct_io, but if we have two or
four logs, we do not use that type at all.
Let's fix it.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Shouldn't update in-memory i_atime with on-disk i_mtime of inode when
recovering inode.
Shuoran found this bug which is hidden for a long time, honour is belong
to him.
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should record updating status of inode only for living inode, for those
unlinked inode it needs to clear its ino cache, otherwise after the ino
was been reused, it will cause unneeded node page writing during ->fsync.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Similarly to the regular discard, trace zone reset events.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When a zoned block device is mounted, discarding sections
contained in sequential zones must reset the zone write pointer.
For sections contained in conventional zones, the regular discard
is used if the drive supports it.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With the zoned block device feature enabled, section discard
need to do a zone reset for sections contained in sequential
zones, and a regular discard (if supported) for sections
stored in conventional zones. Avoid the need for a costly
report zones to obtain a section zone type when discarding it
by caching the types of the device zones in the super block
information. This cache is initialized at mount time for mounts
with the zoned block device feature enabled.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The LFS mode is mandatory for host-managed zoned block devices as
update in place optimizations are not possible for segments in
sequential zones.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Zone write pointer reset acts as discard for zoned block
devices. So if the zoned block device feature is enabled,
always declare that discard is enabled, even if the device
does not actually support the command.
For the same reason, prevent the use the "nodicard" mount
option.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For zoned block devices, discard is replaced by zone reset. So
do not warn if the device does not supports discard.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The F2FS_FEATURE_BLKZONED feature indicates that the drive was formatted
with zone alignment optimization. This is optional for host-aware
devices, but mandatory for host-managed zoned block devices.
So check that the feature is set in this latter case.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
SMR stands for "Shingled Magnetic Recording" which makes sense
only for hard disk drives (spinning rust). The ZBC/ZAC standards
enable management of SMR disks, but solid state drives may also
support those standards. So rename the HMSMR feature to BLKZONED
to avoid a HDD centric terminology. For the same reason, rename
f2fs_sb_mounted_hmsmr to f2fs_sb_mounted_blkzoned.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Report error of f2fs_fill_dentries to ->iterate_shared, otherwise when
error ocurrs, user may just list part of dirents in target directory
without any hints.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
gcc is unsure about the use of last_ofs_in_node, which might happen
without a prior initialization:
fs/f2fs//git/arm-soc/fs/f2fs/data.c: In function ‘f2fs_map_blocks’:
fs/f2fs/data.c:799:54: warning: ‘last_ofs_in_node’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (prealloc && dn.ofs_in_node != last_ofs_in_node + 1) {
As pointed out by Chao Yu, the code is actually correct as 'prealloc'
is only set if the last_ofs_in_node has been set, the two always
get updated together.
This initializes last_ofs_in_node to dn.ofs_in_node for each
new dnode at the start of the 'next_block' loop, which at that
point is a correct initialization as well. I assume that compilers
that correctly track the contents of the variables and do not
warn about the condition also figure out that they can eliminate
the extra assignment here.
Fixes: 46008c6d42 ("f2fs: support in batch multi blocks preallocation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes percpu_count usage due to performance regression in iozone.
Fixes: 523be8a6b3 ("f2fs: use percpu_counter for page counters")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is to avoid no free segment bug during checkpoint caused by a number of
dirty inodes.
The case was reported by Chao like this.
1. mount with lazytime option
2. fill 4k file until disk is full
3. sync filesystem
4. read all files in the image
5. umount
In this case, we actually don't need to flush dirty inode to inode page during
checkpoint.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inode becomes dirty, we need to check the # of dirty inodes whether or not
further checkpoint would be required.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are a lot of dirty inodes, we need to flush all of them when doing
checkpoint. So, we need to count this for enough free space.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch makes sure it returns a positive value instead of a probable
casted negative value as shrink count.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let build_free_nids support sync/async methods, in allocation flow of nids,
we use synchronuous method, so that we can avoid looping in alloc_nid when
free memory is low; in unblock_operations and f2fs_balance_fs_bg we use
asynchronuous method in where low memory condition can interrupt us.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During free nid allocation, in order to do preallocation, we will tag free
nid entry as allocated one and still leave it in free nid list, for other
allocators who want to grab free nids, it needs to traverse the free nid
list for lookup. It becomes overhead in scenario of allocating free nid
intensively by multithreads.
This patch splits free nid list to two list: {free,alloc}_nid_list, to
keep free nids and preallocated free nids separately, after that, traverse
latency will be gone, besides split nid_cnt for separate statistic.
Additionally, introduce __insert_nid_to_list and __remove_nid_from_list for
cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: modify f2fs_bug_on to avoid needless branches]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to keep incomplete created inode in cache, so if we fail to
add link into directory during new inode creation, it's better to set
nlink of inode to zero, then we can evict inode immediately. Otherwise
release of nid belong to inode will be delayed until inode cache is being
shrunk, it may cause a seemingly endless loop while allocating free nids
in time of testing generic/269 case of fstest suit.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: add update_inode_page to fix kernel panic]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs contained a number of endianness conversion bugs.
Also, one function should have been 'static'.
Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/f2fs/'
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In fsync_node_pages, if f2fs was taged with CP_ERROR_FLAG, make sure bio
cache was flushed before return.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to avoid racing problem, make largest extent cache being updated
under lock.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs can support fallocating blocks beyond file size without changing the
size, but ->fiemap of f2fs was restricted and can't detect these extents
fallocated past EOF, now relieve the restriction.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_map_blocks, let f2fs_balance_fs detects node page modification
with dn.node_changed to avoid miss some corner cases.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_balance_fs should be called in between node page updating, otherwise
node page count will exceeded far beyond watermark of triggering
foreground garbage collection, result in facing high risk of hitting LFS
allocation failure.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is no dirty pages in inode, we should give a chance to detach
the inode from global dirty list, otherwise it needs to call another
unnecessary .writepages for detaching.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_fill_super, if there is any IO error occurs during recovery,
cached discard entries will be leaked, in order to avoid this, make
write_checkpoint() handle memory release by itself, besides, move
clear_prefree_segments to write_checkpoint for readability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During nid allocation, it needs to exclude building and allocating flow
of free nids, this is because while building free nid cache, there are two
steps: a) load free nids from unused nat entries in NAT pages, b) update
free nid cache by checking nat journal. The two steps should be atomical,
otherwise an used nid can be allocated as free one after a) and before b).
This patch adds missing lock which covers build_free_nids in
unlock_operation and f2fs_balance_fs_bg to avoid that.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the last ilen case, i was already increased, resulting in accessing out-
of-boundary entry of do_replace and blkaddr.
Fix to check ilen first to exit the loop.
Fixes: 2aa8fbb9693020 ("f2fs: refactor __exchange_data_block for speed up")
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Avoid re-use of page index as tweak for AES-XTS when multiple parts of
same page are encrypted. This will happen on multiple (partial) calls of
fscrypt_encrypt_page on same page.
page->index is only valid for writeback pages.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Not all filesystems work on full pages, thus we should allow them to
hand partial pages to fscrypt for en/decryption.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.
In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.
Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
f2fs_gc. It was newly introduced in 4.9-rc1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYBoL+AAoJEEAUqH6CSFDSQEoP/irufW4HDUszwKCxISLGksBN
/i85zfbL9pY11Ci78qW4N5pegd2ouKk9WtUdrtXwT6y8Y+CWs9Gx5FaYbdA8aj7T
nQBfRlfN/zKIJkRqqqXh/YRTMyQfUticur0dWZ00l7wpGAxA6Xhe+QDV1T7rxQxZ
W8Ne9jcAD+SLu8G5Ci8zOTDcK+q6SWeXFNFtM1MPqr/S86PXiTRmlWFNcubACBi9
iqLl5moD++oYBJSU6sqxaKXvf27GhJMZUGp+upYT9lEIGF98vyhw+yXT64HYDvrW
lMJspz80m5CRT3Pz7nxG+1d6ctONbXwDjyBPeEINUlHi4DN07UJZCJUnUWyMYM8D
JffKq4DGhmsh0CJoX5pKjzxDhuHtrWwN+d/gMuUZrgC7oecXs0U5O2WR49oMGCs8
MMh6T/MUh1fWBI7F1Y3+OvyshU5XA9LCcNguu0T/RSKsjvqxkYTlpS/Zo1U/mpqv
ar34mlvLXlLQNj8gTYDNybq/ufBfS2Fzq+juWbXCaVZ4OV0rsu66D9aUInQ66Iy6
1EqjlLaqAwTF+DJsyGeAxNEn6tMPiqMVhK9LTT9dz5Uk5vK4FdvuDxqeFuwwTnCi
xYHdsZn/yhs7Nn6akwOclJayegDKiV3OVEG4ZyUHrXVR+njFTJyW20VcuUvN9RCn
AxHW5MI9jaU/U+O4Opbb
=1MpQ
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs bugfix from Jaegeuk Kim:
"This fixes a bug which referenced the wrong pointer, sum_page, in
f2fs_gc. It was newly introduced in 4.9-rc1.
* tag 'for-f2fs-4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix wrong sum_page pointer in f2fs_gc
The mapping_set_error() helper sets the correct AS_ flag for the mapping
so there is no reason to open code it. Use the helper directly.
[akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO]
Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
Pull vfs xattr updates from Al Viro:
"xattr stuff from Andreas
This completes the switch to xattr_handler ->get()/->set() from
->getxattr/->setxattr/->removexattr"
* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Remove {get,set,remove}xattr inode operations
xattr: Stop calling {get,set,remove}xattr inode operations
vfs: Check for the IOP_XATTR flag in listxattr
xattr: Add __vfs_{get,set,remove}xattr helpers
libfs: Use IOP_XATTR flag for empty directory handling
vfs: Use IOP_XATTR flag for bad-inode handling
vfs: Add IOP_XATTR inode operations flag
vfs: Move xattr_resolve_name to the front of fs/xattr.c
ecryptfs: Switch to generic xattr handlers
sockfs: Get rid of getxattr iop
sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
kernfs: Switch to generic xattr handlers
hfs: Switch to generic xattr handlers
jffs2: Remove jffs2_{get,set,remove}xattr macros
xattr: Remove unnecessary NULL attribute name check
Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.
There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...
These inode operations are no longer used; remove them.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJX9pA6AAoJEPL5WVaVDYGj7fwH/0YcdQWBg0O5d7iXFnTcimh9
fiYkqKniBWQhgBAOFPMoNPRIW4tyeQmTtu8Rywx2Hr+v4lzJvuOaT18NDANdq/pp
u5eDrnJ4R+uqPJlgxVOzopLVJ6I2glgSSRdvAKYxwTYcv8F88ObzVfsJ4M415gPq
cbEKF+JT3l5hTGENR5sqmYvHYaNfOFkOqt4gulPtgk1eshy+BH/05M+qBSeA5a6k
srdon0pFRoUV68m+T4G8FqOZxdybeT5Yx6X0GJf0eQJoX7IaiQTPcDrXzlrbDBbN
rrzbpwsDeDKtgSOckbarCBroZKdToHFekfnOJ7IPWYq8IwYTSnZKFCWIRKO6z38=
=IvhS
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Lots of bug fixes and cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: remove unused variable
ext4: use journal inode to determine journal overhead
ext4: create function to read journal inode
ext4: unmap metadata when zeroing blocks
ext4: remove plugging from ext4_file_write_iter()
ext4: allow unlocked direct IO when pages are cached
ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
fscrypto: use standard macros to compute length of fname ciphertext
ext4: do not unnecessarily null-terminate encrypted symlink data
ext4: release bh in make_indexed_dir
ext4: Allow parallel DIO reads
ext4: allow DAX writeback for hole punch
jbd2: fix lockdep annotation in add_transaction_credits()
blockgroup_lock.h: simplify definition of NR_BG_LOCKS
blockgroup_lock.h: remove debris from bgl_lock_ptr() conversion
fscrypto: make filename crypto functions return 0 on success
fscrypto: rename completion callbacks to reflect usage
fscrypto: remove unnecessary includes
fscrypto: improved validation when loading inode encryption metadata
ext4: fix memory leak when symlink decryption fails
...
injection facility. With this, we could fix several corner cases. And, in order
to improve the performance, we set inline_dentry by default and enhance the
exisiting discard issue flow. In addition, we added f2fs_migrate_page for better
memory management.
= Enhancement =
- set inline_dentry by default
- improve discard issue flow
- add more fault injection cases in f2fs
- allow block preallocation for encrypted files
- introduce migrate_page callback function
- avoid truncating the next direct node block at every checkpoint
= Bug fixes =
- set page flag correctly between write_begin and write_end
- missing error handling cases detected by fault injection
- preallocate blocks regarding to 4KB alignement correctly
- dentry and filename handling of encryption
- lost xattrs of directories
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJX9sMhAAoJEEAUqH6CSFDSFhQQAIQ99GkcaPmSACHg7JNa9zG1
wb6eeKIDee+Jr4vu7yQ++T3Ih4lesl2ZLABVaP+IcXlsYWI2VUvlChczuwVSDQMg
ZiBIR2IwXVVY6Zpb0xuw8C/vmQAJjLZTBV33s+wgsYHaTDobYexVUjkCM+pekrzj
HBXrk7zx8NHUh41yr/kVQl6FY8KPC6bTtBH23UUp6Vuy1zMZDR/VjL440IyT5Ded
JRSBX0XSAC9He6n+kZ4S2kMc11kmqZYW7mE4SmiPDzAhGwUv4SmQ1871lK00EOUp
5EN1Lcy8M7kkl8en2zpZ002R/LDbzRTYjb1fjGJVR+s5Q3piGokxtwAMd0/a7k9v
wwZm64Bm4NMHBEK6uc/DPWFUmnUySrboTvOCDRunNogPGTjMJwnzAQmTcB/Hdpr5
oAJQwyAq7ZzkMk3xt0ifeNqy+78uiwfpPEnZDoWqU6zxa+vIyqpFDD+8wEPBO9qo
JLRocH0Yl7+ExJvi+2W9wMQq9DsxZWR+CwUc8pg68E+1oOEycJ3weAwg5XSVHoNr
59I2blZQU6P922sH2HVhp0n58xZfYrR7Z3NSsiSfKXeL4gN222dHHT1UfRUmY+A3
7EeuYm8EUecKV0fZimMcqCCrUXQpubT+qGZfI6NZhu3Qhno1Y8ApxqH8Ieypx7ol
YD5prZs2qqVKO5LjLV5o
=crpN
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've investigated how f2fs deals with errors given by
our fault injection facility. With this, we could fix several corner
cases. And, in order to improve the performance, we set inline_dentry
by default and enhance the exisiting discard issue flow. In addition,
we added f2fs_migrate_page for better memory management.
Enhancements:
- set inline_dentry by default
- improve discard issue flow
- add more fault injection cases in f2fs
- allow block preallocation for encrypted files
- introduce migrate_page callback function
- avoid truncating the next direct node block at every checkpoint
Bug fixes:
- set page flag correctly between write_begin and write_end
- missing error handling cases detected by fault injection
- preallocate blocks regarding to 4KB alignement correctly
- dentry and filename handling of encryption
- lost xattrs of directories"
* tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (69 commits)
f2fs: introduce update_ckpt_flags to clean up
f2fs: don't submit irrelevant page
f2fs: fix to commit bio cache after flushing node pages
f2fs: introduce get_checkpoint_version for cleanup
f2fs: remove dead variable
f2fs: remove redundant io plug
f2fs: support checkpoint error injection
f2fs: fix to recover old fault injection config in ->remount_fs
f2fs: do fault injection initialization in default_options
f2fs: remove redundant value definition
f2fs: support configuring fault injection per superblock
f2fs: adjust display format of segment bit
f2fs: remove dirty inode pages in error path
f2fs: do not unnecessarily null-terminate encrypted symlink data
f2fs: handle errors during recover_orphan_inodes
f2fs: avoid gc in cp_error case
f2fs: should put_page for summary page
f2fs: assign return value in f2fs_gc
f2fs: add customized migrate_page callback
f2fs: introduce cp_lock to protect updating of ckpt_flags
...
While we call ->writepages, there are two cases:
a. we didn't writeout any dirty pages, since they are writebacked by other
thread concurrently.
b. we writeout dirty pages, and have already submitted bio to block layer.
In these cases, we don't need to do additional bio flushing unnecessarily,
it may split bio in cache into smaller one.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In sync_node_pages, we won't check and commit last merged pages in private
bio cache of f2fs, as these pages were taged as writeback, someone who is
waiting for writebacking of the page will be blocked until the cache was
committed by someone else.
We need to commit node type bio cache to avoid potential deadlock or long
delay of waiting writeback.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There exists almost same codes when get the value of pre_version
and cur_version in function validate_checkpoint, this patch adds
get_checkpoint_version to clean up redundant codes.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to support checkpoint error injection in f2fs for testing
fatal error tolerance, it will be useful that it can simulate abnormal
power off by f2fs itself instead of calling godown ioctl by running apps.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In ->remount_fs, we didn't recover original fault injection config if
we encounter error, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Do fault injection initialization in default_options to keep consistent
with other default option configurating.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch remove redundant value definition in build_sit_entries
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we only support global fault injection configuration, so that
when we configure type/rate of fault injection through sysfs, mount
option, it will influence all f2fs partition which is being used.
It is not make sence, since it will be not convenient if developer want
to test separated partitions with different fault injection rate/type
simultaneously, also it's not possible to enable fault injection in one
partition and disable fault injection in other one.
>From now on, we move global configuration of fault injection in module
into per-superblock, hence injection testing can be more flexible.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When getting EIO while handling orphan inodes, we can get some dirty node
pages. Then, f2fs_write_node_pages() called by iput(node_inode) will try
to flush node pages. But in this case, we should prevent to do that, since
we will try again from the start.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Null-terminating the fscrypt_symlink_data on read is unnecessary because
it is not string data --- it contains binary ciphertext.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, we can hit
f2fs_bug_on(sbi, !PageUptodate(sum_page));
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should call put_page for preloaded summary pages in do_garbage_collect.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a return value of write_checkpoint for f2fs_gc.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch improves the migration of dirty pages and allows migrating atomic
written pages that F2FS uses in Page Cache. Instead of the fallback releasing
page path, it provides better performance for memory compaction, CMA and other
users of memory page migrating. For dirty pages, there is no need to write back
first when migrating. For an atomic written page before committing, we can
migrate the page and update the related 'inmem_pages' list at the same time.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix some coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces spinlock to protect updating process of ckpt_flags
field in struct f2fs_checkpoint, it avoids incorrectly updating in race
condition.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: add __is_set_ckpt_flags likewise __set_ckpt_flags]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Making updating of sbi flag atomic by using {test,set,clear}_bit,
otherwise in concurrency scenario, the flag could be updated incorrectly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we used cp_version only to detect recoverable dnodes.
In order to avoid same garbage cp_version, we needed to truncate the next
dnode during checkpoint, resulting in additional discard or data write.
If we can distinguish this by using crc in addition to cp_version, we can
remove this overhead.
There is backward compatibility concern where it changes node_footer layout.
So, this patch introduces a new checkpoint flag, CP_CRC_RECOVERY_FLAG, to
detect new layout. New layout will be activated only when this flag is set.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.
Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch allow preallocates data blocks for buffered aio writes
in encrypted file.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix to avoid BUG_ON]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to support IO error injection for testing IO error
tolerance of f2fs.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We treat all error in read_all_xattrs as a no memory error, which covers
the real reason of failure in it. Fix it by return correct errno in order
to reflect the real cause.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is no more user of f2fs_filetype_table outside of dir.c, make it
static.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
When file permissions are modified via chmod(2) and the user is not in
the owning group or capable of CAP_FSETID, the setgid bit is cleared in
inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file
permissions as well as the new ACL, but doesn't clear the setgid bit in
a similar way; this allows to bypass the check in chmod(2). Fix that.
References: CVE-2016-7097
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Several filename crypto functions: fname_decrypt(),
fscrypt_fname_disk_to_usr(), and fscrypt_fname_usr_to_disk(), returned
the output length on success or -errno on failure. However, the output
length was redundant with the value written to 'oname->len'. It is also
potentially error-prone to make callers have to check for '< 0' instead
of '!= 0'.
Therefore, make these functions return 0 instead of a length, and make
the callers who cared about the return value being a length use
'oname->len' instead. For consistency also make other callers check for
a nonzero result rather than a negative result.
This change also fixes the inconsistency of fname_encrypt() actually
already returning 0 on success, not a length like the other filename
crypto functions and as documented in its function comment.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
The macro f2fs_check_acl is defined but never used since
the initial commit, this patch removes the code that has
been dead for several years.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When src and dst is the same file, and the latter part of source region
overlaps with the former part of destination region, current implement
will overwrite data which hasn't been moved yet and truncate data in
overlapped region.
This patch return -EINVAL when such cases occur and return 0 when
source region and destination region is actually the same part of
the same file.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs_write_begin sets PageUptodate all the time. But, when user
tries to update the entire page (i.e., len == PAGE_SIZE), we need to consider
that the page is able to be copied partially afterwards. In such the case,
we will lose the remaing region in the page.
This patch fixes this by setting PageUptodate in f2fs_write_end as given copied
result. In the short copy case, it returns zero to let generic_perform_write
retry copying user data again.
As a result, f2fs_write_end() works:
PageUptodate len copied return retry
1. no 4096 4096 4096 false -> return 4096
2. no 4096 1024 0 true -> goto #1 case
3. yes 2048 2048 2048 false -> return 2048
4. yes 2048 1024 1024 false -> return 1024
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__exchange_data_block should take block indexes as parameters
instead of offsets in bytes.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When truncating cached inline_data, we don't need to allocate a new page
all the time. Instead, it must check its page cache only.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since setting an encryption policy requires writing metadata to the
filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
Otherwise, a user could cause a write to a frozen or readonly
filesystem. This was handled correctly by f2fs but not by ext4. Make
fscrypt_process_policy() handle it rather than relying on the filesystem
to get it right.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to make zeros beyond i_size, since we already wrote that through
NEW_ADDR case.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Some applications may create multimeida file with temporary name like
'*.jpg.tmp' or '*.mp4.tmp', then rename to '*.jpg' or '*.mp4'.
Now, f2fs can only detect multimedia filename with specified format:
"filename + '.' + extension", so it will make f2fs missing to detect
multimedia file with special temporary name, result in failing to set
cold flag on file.
This patch enhances detection flow for enabling lookup extension in the
middle of temporary filename.
Reported-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add roll-forward recovery process for encrypted dentry, so the first fsync
issued to an encrypted file does not need writing checkpoint.
This improves the performance of the following test at thousands of small
files: open -> write -> fsync -> close
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: modify kernel message to show encrypted names]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enhances the xattr consistency of dirs from suddern power-cuts.
Possible scenario would be:
1. dir->setxattr used by per-file encryption
2. file->setxattr goes into inline_xattr
3. file->fsync
In that case, we should do checkpoint for #1.
Otherwise we'd lose dir's key information for the file given #2.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Like most filesystems, f2fs will issue discard command synchronously, so
when user trigger fstrim through ioctl, multiple discard commands will be
issued serially with sync mode, which makes poor performance.
In this patch we try to support async discard, so that all discard
commands can be issued and be waited for endio in batch to improve
performance.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch sets encryption name flag in the add inline entry path
if filename is encrypted.
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When decrypting dirents in ->readdir, fscrypt_fname_disk_to_usr won't
change content of original encrypted dirent, we don't need to allocate
additional buffer for storing mirror of it, so get rid of it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When creating new inode, security_inode_init_security will be called for
initializing security info related to the inode, and filename is passed to
security module, it helps security module such as SElinux to know which
rule or label could be applied for the inode with specified name.
Previously, if new inode is created as an encrypted one, f2fs will transfer
encrypted filename to security module which may fail the check of security
policy belong to the inode. So in order to this issue, alter to transfer
original unencrypted filename instead.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In batch discard approach of fstrim will grab/release gc_mutex lock
repeatly, it makes contention of the lock becoming more intensive.
So after one batch discards were issued in checkpoint and the lock
was released, it's better to do schedule() to increase opportunity
of grabbing gc_mutex lock for other competitors.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Make inline_dentry as default mount option to improve space usage and
IO performance in scenario of numerous small directory.
It adds noinline_dentry mount option, instead.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the following scenario,
1) we don't have the key and doing a lookup for encrypted file,
2) and the encrypted filename is big name
we should use fname->hash as name hash value instead of what is
calculated by fname->disk_name. Because in such case,
fname->disk_name is empty.
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In write_begin(), we skip checking dnode block for preallocating block
when whole block needs to be updated since we preallocated its block in
f2fs_preallocate_blocks, for partial updated block, we will still try
to lock its node and do preallocation in write_begin(), so in
f2fs_preallocate_blocks we should not preallocate its block.
But previously, the calculation of preallocating block number is
incorrect, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fixes the following sparse warning:
fs/f2fs/data.c:969:12: warning:
symbol 'f2fs_grab_bio' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
`flags' is used to save value from userspace, there is no need to
initialize it, and FS_FL_USER_VISIBLE is the mask for getflags.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In available_free_memory, there are two same judgement conditions which
is used for checking NAT excess, remove one of them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During fstrim, if one of multiple write_checkpoint failed, break off and
return error number to caller.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we preallocate blocks with f2fs_reserve_blocks in f2fs_map_blocks, we
should call f2fs_balance_fs for checking and reclaiming space, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When building each sit entry in cache, firstly, we will load it from
sit page, and then check all entries in sit journal, if there is one
updated entry in journal, cover cached entry with the journaled one.
Actually, most of check operation is unneeded since we only need
to update cached entries with journaled entries in batch, so
changing the flow as below for more efficient:
1. load all sit entries into cache from sit pages;
2. update sit entries with journal.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes to check valid block number of one GCed section
directly instead of checking the number in all segments of section
one by one in order to clean up codes of foreground GC.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't guarantee integrity of user data after checkpoint, since we only
guarantee meta data integrity for data consistency of filesystem.
Due to above reason, we only need to set fs as dirty when meta data is
updated, so that we can skip writing checkpoint in some case of non-meta
data is updated.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enables to do fstrim without checkpoint, if there is no fs
change.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add discard block count to sys entry of f2fs status
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thread A Thread B
- inode_lock fileA
- inode_lock fileB
- inode_lock fileA
- inode_lock fileB
We may encounter above potential deadlock during moving file range in
concurrent scenario. This patch fixes the issue by using inode_trylock
instead.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Only if two input files are regular files, we allow copying data in
range of them, otherwise, deny it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit a2ee0a3003.
When testing with generic/032 of xfstest suit, failure message will be
reported as below:
generic/032 8s ... [failed, exit status 1] - output mismatch (see results/generic/032.out.bad)
--- tests/generic/032.out 2015-01-11 16:52:27.643681072 +0800
+++ results/generic/032.out.bad 2016-08-06 13:44:43.861330500 +0800
@@ -1,5 +1,5 @@
QA output created by 032
-100 iterations
-0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
-*
-0100000
+1: [768..775]: unwritten
+Unwritten extents found!
...
(Run 'diff -u tests/generic/032.out results/generic/032.out.bad' to see the entire diff)
Ran: generic/032
Failures: generic/032
Failed 1 of 1 tests
In write_end(), we should update i_size of inode before unlock page,
otherwise, we will lose newly updated data in following race condition.
Thread A Thread B
- write_end
- unlock page
- writepages
- lock_page
- writepage
if page is out-of-range of file size,
we will skip writting the page.
- update i_size
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
LKP reported -36.3% regression of fsmark.files_per_sec due to this patch.
I've confirmed that fxmark [1] has also slight regression for DWAL.
[1] https://github.com/sslab-gatech/fxmark
This reverts commit ec795418c4.
Pull qstr constification updates from Al Viro:
"Fairly self-contained bunch - surprising lot of places passes struct
qstr * as an argument when const struct qstr * would suffice; it
complicates analysis for no good reason.
I'd prefer to feed that separately from the assorted fixes (those are
in #for-linus and with somewhat trickier topology)"
* 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
qstr: constify instances in adfs
qstr: constify instances in lustre
qstr: constify instances in f2fs
qstr: constify instances in ext2
qstr: constify instances in vfat
qstr: constify instances in procfs
qstr: constify instances in fuse
qstr constify instances in fs/dcache.c
qstr: constify instances in nfs
qstr: constify instances in ocfs2
qstr: constify instances in autofs4
qstr: constify instances in hfs
qstr: constify instances in hfsplus
qstr: constify instances in logfs
qstr: constify dentry_init_security
Merge 4fc29c1aa3 included this extra line, but it's not needed (or
useful) since we'll bio_set_op_attrs() right after to properly set
the op and flags for the bio.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull vfs updates from Al Viro:
"Assorted cleanups and fixes.
Probably the most interesting part long-term is ->d_init() - that will
have a bunch of followups in (at least) ceph and lustre, but we'll
need to sort the barrier-related rules before it can get used for
really non-trivial stuff.
Another fun thing is the merge of ->d_iput() callers (dentry_iput()
and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
except the one in __d_lookup_lru())"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
fs/dcache.c: avoid soft-lockup in dput()
vfs: new d_init method
vfs: Update lookup_dcache() comment
bdev: get rid of ->bd_inodes
Remove last traces of ->sync_page
new helper: d_same_name()
dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
vfs: clean up documentation
vfs: document ->d_real()
vfs: merge .d_select_inode() into .d_real()
unify dentry_iput() and dentry_unlink_inode()
binfmt_misc: ->s_root is not going anywhere
drop redundant ->owner initializations
ufs: get rid of redundant checks
orangefs: constify inode_operations
missed comment updates from ->direct_IO() prototype change
file_inode(f)->i_mapping is f->f_mapping
trim fsnotify hooks a bit
9p: new helper - v9fs_parent_fid()
debugfs: ->d_parent is never NULL or negative
...
replacing redundant inode page updates with mark_inode_dirty calls. And we tried
to reduce lock contentions as well to improve filesystem scalability.
Other feature is setting F2FS automatically when detecting host-managed SMR.
= Enhancement =
- ioctl to move a range of data between files
- inject orphan inode errors
- avoid flush commands congestion
- support lazytime
= Bug fixes =
- return proper results for some dentry operations
- fix deadlock in add_link failure
- disable extent_cache for fcollapse/finsert
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXmDJFAAoJEEAUqH6CSFDSJeYP/0ru8+5/ui5VTCdNPQB9KxYD
DIUaDGpeoLvmn3ZdrMEdyNr6kWbgjCE9JjOGPQ7l1/apErOGVPyaBwflKcCDwloU
pAlEqVM1Q9j4qH4i9SWTlvPtsHBHB7G7YSe3vDB9fJGSTqumubIlnaBm+Wfjx31U
p53WcPn9LpOyzfmvZf2tOHmvZ7bWLkE/a07x9kPC6XHUFb9C17jLRFFGeuhZQHv1
Yo7HgokBnPExa8TnEILYyX/x+eecFS/1Cp/cN0STsebSu8pStTHTcAP7qEpKQB88
Cc51Lf+d5gFeydxKDFxwdH3VWOGIr9Ppako+lHW83gJcHP0zw8zdxULab+HJMa4n
MOByRRiafwu1sL0dl7TCfsYNIHdEnXhWbhcRhMVZbb5C2Q6+Htuac8ZrKSOWExNN
DUqRkzeTib9u+cHxUTFFPgOGdUjDLmg3XHU7mvb+2hViluVjIImC4tqD5XPpv7vt
WnaDJxLCGD/6DF2yhiVY9NysuxInLTNFFCF06LworZ4L24hlg5TvN0UeUNRO9954
ux6f+lSORCzV3TmrsHP5vwjSAW26FviPXV1q1HHJeTpWKMlhsZtHmOAJOtZKKmxP
WFnHT0aiWF+sQf4qfxVQL+lLqtgRKJAI9zqGRyfDJWJp5aXdRuVsZs9pWNQF7lCo
5gVnCYk3ULjXG3b23j2S
=tKTR
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"The major change in this version is mitigating cpu overheads on write
paths by replacing redundant inode page updates with mark_inode_dirty
calls. And we tried to reduce lock contentions as well to improve
filesystem scalability. Other feature is setting F2FS automatically
when detecting host-managed SMR.
Enhancements:
- ioctl to move a range of data between files
- inject orphan inode errors
- avoid flush commands congestion
- support lazytime
Bug fixes:
- return proper results for some dentry operations
- fix deadlock in add_link failure
- disable extent_cache for fcollapse/finsert"
* tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
f2fs: clean up coding style and redundancy
f2fs: get victim segment again after new cp
f2fs: handle error case with f2fs_bug_on
f2fs: avoid data race when deciding checkpoin in f2fs_sync_file
f2fs: support an ioctl to move a range of data blocks
f2fs: fix to report error number of f2fs_find_entry
f2fs: avoid memory allocation failure due to a long length
f2fs: reset default idle interval value
f2fs: use blk_plug in all the possible paths
f2fs: fix to avoid data update racing between GC and DIO
f2fs: add maximum prefree segments
f2fs: disable extent_cache for fcollapse/finsert inodes
f2fs: refactor __exchange_data_block for speed up
f2fs: fix ERR_PTR returned by bio
f2fs: avoid mark_inode_dirty
f2fs: move i_size_write in f2fs_write_end
f2fs: fix to avoid redundant discard during fstrim
f2fs: avoid mismatching block range for discard
f2fs: fix incorrect f_bfree calculation in ->statfs
f2fs: use percpu_rw_semaphore
...
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs. This gfp mask discrepancy is really unfortunate and easily
fixable. Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.
This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well. The same applies to read_cache_pages.
ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previous selected segment may become free after write_checkpoint,
if we do garbage collect on this segment, and then new_curseg happen
to reuse it, it may cause f2fs_bug_on as below.
panic+0x154/0x29c
do_garbage_collect+0x15c/0xaf4
f2fs_gc+0x2dc/0x444
f2fs_balance_fs.part.22+0xcc/0x14c
f2fs_balance_fs+0x28/0x34
f2fs_map_blocks+0x5ec/0x790
f2fs_preallocate_blocks+0xe0/0x100
f2fs_file_write_iter+0x64/0x11c
new_sync_write+0xac/0x11c
vfs_write+0x144/0x1e4
SyS_write+0x60/0xc0
Here, maybe we check sit and ssa type during reset_curseg. So, we check
segment is stale or not, and select a new victim to avoid this.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It's enough to show BUG or WARN by f2fs_bug_on for error case.
Then, we don't need to remain corrupted filesystem.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When fs utilization is almost full, f2fs_sync_file should do checkpoint if
there is not enough space for roll-forward later. (i.e. space_for_roll_forward)
So, currently we have no lock for sbi->alloc_valid_block_count, resulting in
race condition.
In rare case, we can get -ENOSPC when doing roll-forward which triggers
if (is_valid_blkaddr(sbi, dest, META_POR)) {
if (src == NULL_ADDR) {
err = reserve_new_block(&dn);
f2fs_bug_on(sbi, err);
...
}
...
}
in do_recover_data.
So, this patch avoids that situation in advance.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to report the right error number of f2fs_find_entry to
its caller.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The default value of idle interval is 2 mins, but for most time when
screen shutdown, there are still operations during the 2 mins interval,
and gc's sleep time is about 30 secs to 60 secs, so there is almost no
chance for GC thread to do garbage collecting.
Set default value of idle interval value from 2 mins to 5 secs for
fixing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch reverts 19a5f5e2ef (f2fs: drop any block plugging),
and adds blk_plug in write paths additionally.
The main reason is that blk_start_plug can be used to wake up from low-power
mode before submitting further bios.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Datas in file can be operated by GC and DIO simultaneously, so we will
face race case as below:
For write case:
Thread A Thread B
- generic_file_direct_write
- invalidate_inode_pages2_range
- f2fs_direct_IO
- do_blockdev_direct_IO
- do_direct_IO
- get_more_blocks
- f2fs_gc
- do_garbage_collect
- gc_data_segment
- move_data_page
- do_write_data_page
migrate data block to new block address
- dio_bio_submit
update user data to old block address
For read case:
Thread A Thread B
- generic_file_direct_write
- invalidate_inode_pages2_range
- f2fs_direct_IO
- do_blockdev_direct_IO
- do_direct_IO
- get_more_blocks
- f2fs_balance_fs
- f2fs_gc
- do_garbage_collect
- gc_data_segment
- move_data_page
- do_write_data_page
migrate data block to new block address
- write_checkpoint
- do_checkpoint
- clear_prefree_segments
- f2fs_issue_discard
discard old block adress
- dio_bio_submit
update user buffer from obsolete block address
In order to fix this, for one file, we should let DIO and GC getting exclusion
against with each other.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In 1TB storage, we need to admit 22841 prefree segments, which can consume
too much segments.
This patch sets 8GB in max. prefree segments in that case.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is to fix wrong error pointer handling flow reported by Dan.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With below test steps, f2fs will issue redundant discard when doing fstrim,
the reason is that we issue discards for both prefree segments and
consecutive freed region user wants to trim, part regions they covered are
overlapped, here, we change to do not to issue any discards for prefree
segments in trimmed range.
1. mount -t f2fs -o discard /dev/zram0 /mnt/f2fs
2. fstrim -o 0 -l 3221225472 -m 2097152 -v /mnt/f2fs/
3. dd if=/dev/zero of=/mnt/f2fs/a bs=2M count=1
4. dd if=/dev/zero of=/mnt/f2fs/b bs=1M count=1
5. sync
6. rm /mnt/f2fs/a /mnt/f2fs/b
7. fstrim -o 0 -l 3221225472 -m 2097152 -v /mnt/f2fs/
Before:
<...>-5428 [001] ...1 9511.052125: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x200
<...>-5428 [001] ...1 9511.052787: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x300
After:
<...>-6764 [000] ...1 9720.382504: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x300
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch skip discard block range smaller than trim_minlen,
and can not be merged by neighbour
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As manual described, f_bfree indicates total free blocks in fs, in f2fs, it
includes two parts: visible free blocks and over-provision blocks. This
patch corrrects the calculation.
fsblkcnt_t f_bfree; /* free blocks in fs */
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_set_page_dirty_nobuffer() copied from __set_page_dirty_buffer.
When appending 4KB blocks in f2fs on pmem with multiple cores, this improves the
overall performance.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When base_addr is NULL, there is no need to call kzfree,
it should return -ENOMEM directly. Additionally, it is
better to initialize variable 'error' with 0.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we fail to move data page during foreground GC, we should give another
chance to writeback that page which was set dirty previously by writer.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In procedure of synchonized read, after sending out the read request, reader
will try to lock the page for waiting device to finish the read jobs and
unlock the page, but meanwhile, truncater will race with reader, so after
reader get lock of the page, it should check page's mapping to detect
whether someone has truncated the page in advance, then reader has the
chance to do the retry if truncation was done, otherwise read can be failed
due to previous condition check.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For encrypted inode, if user overwrites data of the inode, f2fs will read
encrypted data into page cache, and then do the decryption.
However reader can race with overwriter, and it will see encrypted data
which has not been decrypted by overwriter yet. Fix it by moving decrypting
work to background and keep page non-uptodated until data is decrypted.
Thread A Thread B
- f2fs_file_write_iter
- __generic_file_write_iter
- generic_perform_write
- f2fs_write_begin
- f2fs_submit_page_bio
- generic_file_read_iter
- do_generic_file_read
- lock_page_killable
- unlock_page
- copy_page_to_iter
hit the encrypted data in updated page
- lock_page
- fscrypt_decrypt_page
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_map_blocks is very related to the performance, so let's avoid any
latency to read ahead node pages.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The readahead nat pages are more likely to be reclaimed quickly, so it'd better
to gather more free nids in advance.
And, let's keep some free nids as much as possible.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If dotdot directory is corrupted, its slot may be ocupied by another
file. In this case, dentry[1] is not the parent directory. Rename and
cross-rename will update the inode in dentry[1] incorrectly. This
patch finds dotdot dentry by name.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
[Jaegeuk Kim: remove wron bug_on]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
mkdir sync_dirty_inode
- init_inode_metadata
- lock_page(node)
- make_empty_dir
- filemap_fdatawrite()
- do_writepages
- lock_page(data)
- write_page(data)
- lock_page(node)
- f2fs_init_acl
- error
- truncate_inode_pages
- lock_page(data)
So, we don't need to truncate data pages in this error case, which will
be done by f2fs_evict_inode.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This mount option is to enable original log-structured filesystem forcefully.
So, there should be no random writes for main area.
Especially, this supports host-managed SMR device.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs, we don't need to keep block plugging for NODE and DATA writes, since
we already merged bios as much as possible.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is a data race between allocate_data_block() and f2fs_sbumit_page_mbio(),
which incur unnecessary reversed bio submission.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Separate the op from the rq_flag_bits and have f2fs
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
FI_DIRTY_INODE flag is not covered by inode page lock, so it can be unset
at any time like below.
Thread #1 Thread #2
- lock_page(ipage)
- update i_fields
- update i_size/i_blocks/and so on
- set FI_DIRTY_INODE
- reset FI_DIRTY_INODE
- set_page_dirty(ipage)
In this case, we can lose the latest i_field information.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs_write_data_pages() calls __f2fs_writepage() which calls
f2fs_write_data_page().
If f2fs_write_data_page() returns AOP_WRITEPAGE_ACTIVATE, __f2fs_writepage()
calls mapping_set_error(). But, this should not happen at every time, since
sometimes f2fs_write_data_page() tries to skip writing pages without error.
For example, volatile_write() gives EIO all the time, as Shuoran Liu pointed
out.
Reported-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we can report an error to f2fs_lookup given by f2fs_find_entry.
Suggested-by: He YunLei <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit aaf9607516 ("f2fs: check node page
contents all the time") pointed out that "sometimes it was reported that
its contents was missing", so it checks the page's mapping and contents.
When "nid != nid_of_node(page)", ERR_PTR(-EIO) will be returned to the
caller. However, commit e1c51b9f1d ("f2fs:
clean up node page updating flow") moves "nid != nid_of_node(page)" test
to "f2fs_bug_on(sbi, nid != nid_of_node(page))", this will return a
wrong page to the caller when F2FS_CHECK_FS is off when "sometimes it
was reported that its contents was missing" happens.
This patch restores to check node page contents all the time, and
returns the errno to make the caller known something is wrong and avoid
to use the page. This patch also moves f2fs_bug_on to its proper location.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is no cold page, we don't need to do a loop to flush dirty
data pages.
On /dev/pmem0,
1. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 conv=fsync
Before : 1.1 GB/s
After : 1.2 GB/s
2. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048
Before : 2.2 GB/s
After : 2.3 GB/s
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For data pages, let's try to flush as much as possible in background.
On /dev/pmem0,
1. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 conv=fsync
Before : 800 MB/s
After : 1.1 GB/s
2. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048
Before : 1.3 GB/s
After : 2.2 GB/s
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we get ENOMEM or EIO in f2fs_find_entry, we should stop right away.
Otherwise, for example, we can get duplicate directory entry by ->chash and
->clevel.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes writepages lock.
We can improve multi-threading performance.
tiobench, 32 threads, 4KB write per fsync on SSD
Before: 25.88 MB/s
After: 28.03 MB/s
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If flush commands do not incur any congestion, we don't need to throw that to
dispatching queue which causes unnecessary latency.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch reduces to call them across the whole tree.
- sync_inode_page()
- update_inode_page()
- update_inode()
- f2fs_write_inode()
Instead, checkpoint will flush all the dirty inode metadata before syncing
node pages.
Note that, this is doable, since we call mark_inode_dirty_sync() for all
inode's field change which needs to update on-disk inode as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch calls mark_inode_dirty_sync() for the following on-disk inode
changes.
-> largest
-> ctime/mtime/atime
-> i_current_depth
-> i_xattr_nid
-> i_pino
-> i_advise
-> i_flags
-> i_mode
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_i_links_write() to call mark_inode_dirty_sync() when
changing inode->i_links.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_i_blocks_write() to call mark_inode_dirty_sync() when
changing inode->i_blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
it's not needed for file_operations of inodes located on fs defined
in the hosting module and for file_operations that go into procfs.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- fs-specific prefix for fscrypto
- fault injection facility
- expose validity bitmaps for user to be aware of fragmentation
- fallocate/rm/preallocation speed up
- use percpu counters
Bug fixes
- some inline_dentry/inline_data bugs
- error handling for atomic/volatile/orphan inodes
- recover broken superblock
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQPu4AAoJEEAUqH6CSFDSILgP/1dj6fmtytr8c+55EBqXUGpo
M7rS93JTxlmU5BduIo9psJsEquTQoVEmxB/Gjd+ZnI5R6Rp1c/REaP0ba374rEhZ
ecMQh5QqzM1gRNFXrQhWFEL/KtfRqt3T80zebQP7pxFUm/m9NGMLWT43RzQ8AAhr
Y3P0NLdvxA4HAnipKptkPJcGZQlWnL9W/MR+LgsXLXqLDwJHkVu61GcF0y2ibcJM
lEtIRmyH5tg7hP5c5LTw9pKQFHkIZt5cHFLjrJ1x8FSm2TXOcJPbjOrThvcb+NKK
e0O+6R0meH2eMpak+BTkZp2YbPPyXOb1N00j//lmbPjCoJPd4ZuiJ+oRoHUlTxtU
FhO67t0brlDbMFQVRFrtv8VA8M6by+DTAAP3Ffx62I/TJkphKANCSoyQRhlWtxxO
kRU69N7ipnRNxO4WCv40FjaQjSIElCKysP1POazRmAOQm7UFTGT9Nj37+eqUcEPJ
HZ7O61DEHNemb0SMlJ8WSClstt0yUU+2cjRfTPAr2Wd3V8gYbRs0QUg5M2GLgywR
EmiJfpkXse3f/nR8W6g1hganSOXA0AZX+EUibed6VkV3oYemdFbm8OymeEmLmWpM
y2F3D7dPLW7MCoTXJqtwFWdoDwI+zkH4rJaPGTq5TVBRWVU/njX8OvoB47pOvKV1
kccL7zv2PekE1hSDO5WF
=6MSp
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, as Ted pointed out, fscrypto allows one more key prefix
given by filesystem to resolve backward compatibility issues. Other
than that, we've fixed several error handling cases by introducing
a fault injection facility. We've also achieved performance
improvement in some workloads as well as a bunch of bug fixes.
Summary:
Enhancements:
- fs-specific prefix for fscrypto
- fault injection facility
- expose validity bitmaps for user to be aware of fragmentation
- fallocate/rm/preallocation speed up
- use percpu counters
Bug fixes:
- some inline_dentry/inline_data bugs
- error handling for atomic/volatile/orphan inodes
- recover broken superblock"
* tag 'for-f2fs-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (73 commits)
f2fs: fix to update dirty page count correctly
f2fs: flush pending bios right away when error occurs
f2fs: avoid ENOSPC fault in the recovery process
f2fs: make exit_f2fs_fs more clear
f2fs: use percpu_counter for total_valid_inode_count
f2fs: use percpu_counter for alloc_valid_block_count
f2fs: use percpu_counter for # of dirty pages in inode
f2fs: use percpu_counter for page counters
f2fs: use bio count instead of F2FS_WRITEBACK page count
f2fs: manipulate dirty file inodes when DATA_FLUSH is set
f2fs: add fault injection to sysfs
f2fs: no need inc dirty pages under inode lock
f2fs: fix incorrect error path handling in f2fs_move_rehashed_dirents
f2fs: fix i_current_depth during inline dentry conversion
f2fs: correct return value type of f2fs_fill_super
f2fs: fix deadlock when flush inline data
f2fs: avoid f2fs_bug_on during recovery
f2fs: show # of orphan inodes
f2fs: support in batch fzero in dnode page
f2fs: support in batch multi blocks preallocation
...
init_f2fs_fs does:
1) f2fs_build_trace_ios
2) init_inodecache
3) create_node_manager_caches
4) create_segment_manager_caches
5) create_checkpoint_caches
6) create_extent_cache
7) kset_create_and_add
8) kobject_init_and_add
9) register_shrinker
10) register_filesystem
11) f2fs_create_root_stats
12) proc_mkdir
exit_f2fs_fs should do cleanup in the reverse order
to make the code more clear.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull vfs cleanups from Al Viro:
"More cleanups from Christoph"
* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd: use RWF_SYNC
fs: add RWF_DSYNC aand RWF_SYNC
ceph: use generic_write_sync
fs: simplify the generic_write_sync prototype
fs: add IOCB_SYNC and IOCB_DSYNC
direct-io: remove the offset argument to dio_complete
direct-io: eliminate the offset argument to ->direct_IO
xfs: eliminate the pos variable in xfs_file_dio_aio_write
filemap: remove the pos argument to generic_file_direct_write
filemap: remove pos variables in generic_file_read_iter
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
This patch introduces a new struct f2fs_fault_info and a global f2fs_fault
to save fault injection status. Fault injection entries are created in
/sys/fs/f2fs/fault_injection/ during initializing f2fs module.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix two bugs in error path of f2fs_move_rehashed_dirents:
- release dir's inode page if fail to call kmalloc
- recover i_current_depth if fail to converting
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With below steps, we will see that dentry page becoming unaccessable later.
This is because we forget updating i_current_depth in inode during inline
dentry conversion, after that, once we failed at somewhere, it will leave
i_current_depth as 0 in non-inline directory. Then, during ->lookup, the
current_depth value makes all dentry pages in first level invisible. Fix
it.
1) mount f2fs with inline_dentry option
2) mkdir dir
3) touch 180 files named [0-179] in dir
4) touch 180 in dir (fail after inline dir conversion)
5) ll dir
ls: cannot access /mnt/f2fs/dir/0: No such file or directory
ls: cannot access /mnt/f2fs/dir/1: No such file or directory
ls: cannot access /mnt/f2fs/dir/2: No such file or directory
ls: cannot access /mnt/f2fs/dir/3: No such file or directory
ls: cannot access /mnt/f2fs/dir/4: No such file or directory
drwxr-xr-x 2 root root 4096 may 13 21:47 ./
drwxr-xr-x 3 root root 4096 may 13 21:46 ../
-????????? ? ? ? ? ? 0
-????????? ? ? ? ? ? 1
-????????? ? ? ? ? ? 10
-????????? ? ? ? ? ? 100
-????????? ? ? ? ? ? 101
-????????? ? ? ? ? ? 102
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to use f2fs_bug_on() to treat with any error case when allocating
a block during recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to speedup fzero_range by making space preallocation and
address removal of blocks in one dnode page as in batch operation.
In virtual machine, with zram driver:
dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=4096
time xfs_io -f /mnt/f2fs/file -c "fzero 0 4096M"
Before:
real 0m3.276s
user 0m0.008s
sys 0m3.260s
After:
real 0m1.568s
user 0m0.000s
sys 0m1.564s
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: consider ENOSPC case]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces reserve_new_blocks to make preallocation of multi
blocks as in batch operation, so it can avoid lots of redundant
operation, result in better performance.
In virtual machine, with rotational device:
time fallocate -l 32G /mnt/f2fs/file
Before:
real 0m4.584s
user 0m0.000s
sys 0m4.580s
After:
real 0m0.292s
user 0m0.000s
sys 0m0.272s
In x86, with SSD:
time fallocate -l 500G $MNT/testfile
Before : 24.758 s
After : 1.604 s
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix bugs and add performance numbers measured in x86.]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
atomic/volatile ioctl interfaces are exposed to user like other file
operation interface, it needs to make them getting exclusion against
to each other to avoid potential conflict among these operations
in concurrent scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In interfaces of ioctl, mnt_{want,drop}_write_file should be used for:
- get exclusion against file system freezing which may used by lvm
snapshot.
- do telling filesystem that a write is about to be performed on it, and
make sure that the writes are permitted.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously f2fs_preallocate_blocks() tries to allocate unaligned blocks.
In f2fs_write_begin(), however, prepare_write_begin() does not skip its
allocation due to (len != 4KB).
So, it needs locking node page twice unexpectedly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enables reading node blocks in advance when truncating large
data blocks.
> time rm $MNT/testfile (500GB) after drop_cachees
Before : 9.422 s
After : 4.821 s
Reported-by: Stephen Bates <stephen.bates@microsemi.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch is to improve the expand_inode speed in fallocate by allocating
data blocks as many as possible in single locked node page.
In SSD,
# time fallocate -l 500G $MNT/testfile
Before : 1m 33.410 s
After : 24.758 s
Reported-by: Stephen Bates <stephen.bates@microsemi.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing f2fs with inline_dentry option, generic/342 reports:
VFS: Busy inodes after unmount of dm-0. Self-destruct in 5 seconds. Have a nice day...
After rmmod f2fs module, kenrel shows following dmesg:
=============================================================================
BUG f2fs_inode_cache (Tainted: G O ): Objects remaining in f2fs_inode_cache on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
Disabling lock debugging due to kernel taint
INFO: Slab 0xf51ca0e0 objects=22 used=1 fp=0xd1e6fc60 flags=0x40004080
CPU: 3 PID: 7455 Comm: rmmod Tainted: G B O 4.6.0-rc4+ #16
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
00000086 00000086 d062fe18 c13a83a0 f51ca0e0 d062fe38 d062fea4 c11c7276
c1981040 f51ca0e0 00000016 00000001 d1e6fc60 40004080 656a624f 20737463
616d6572 6e696e69 6e692067 66326620 6e695f73 5f65646f 68636163 6e6f2065
Call Trace:
[<c13a83a0>] dump_stack+0x5f/0x8f
[<c11c7276>] slab_err+0x76/0x80
[<c11cbfc0>] ? __kmem_cache_shutdown+0x100/0x2f0
[<c11cbfc0>] ? __kmem_cache_shutdown+0x100/0x2f0
[<c11cbfe5>] __kmem_cache_shutdown+0x125/0x2f0
[<c1198a38>] kmem_cache_destroy+0x158/0x1f0
[<c176b43d>] ? mutex_unlock+0xd/0x10
[<f8f15aa3>] exit_f2fs_fs+0x4b/0x5a8 [f2fs]
[<c10f596c>] SyS_delete_module+0x16c/0x1d0
[<c1001b10>] ? do_fast_syscall_32+0x30/0x1c0
[<c13c59bf>] ? __this_cpu_preempt_check+0xf/0x20
[<c10afa7d>] ? trace_hardirqs_on_caller+0xdd/0x210
[<c10ad50b>] ? trace_hardirqs_off+0xb/0x10
[<c1001b81>] do_fast_syscall_32+0xa1/0x1c0
[<c176d888>] sysenter_past_esp+0x45/0x74
INFO: Object 0xd1e6d9e0 @offset=6624
kmem_cache_destroy f2fs_inode_cache: Slab cache still has objects
CPU: 3 PID: 7455 Comm: rmmod Tainted: G B O 4.6.0-rc4+ #16
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
00000286 00000286 d062fef4 c13a83a0 f174b000 d062ff14 d062ff28 c1198ac7
c197fe18 f3c5b980 d062ff20 000d04f2 d062ff0c d062ff0c d062ff14 d062ff14
f8f20dc0 fffffff5 d062e000 d062ff30 f8f15aa3 d062ff7c c10f596c 73663266
Call Trace:
[<c13a83a0>] dump_stack+0x5f/0x8f
[<c1198ac7>] kmem_cache_destroy+0x1e7/0x1f0
[<f8f15aa3>] exit_f2fs_fs+0x4b/0x5a8 [f2fs]
[<c10f596c>] SyS_delete_module+0x16c/0x1d0
[<c1001b10>] ? do_fast_syscall_32+0x30/0x1c0
[<c13c59bf>] ? __this_cpu_preempt_check+0xf/0x20
[<c10afa7d>] ? trace_hardirqs_on_caller+0xdd/0x210
[<c10ad50b>] ? trace_hardirqs_off+0xb/0x10
[<c1001b81>] do_fast_syscall_32+0xa1/0x1c0
[<c176d888>] sysenter_past_esp+0x45/0x74
The reason is: in recovery flow, we use delayed iput mechanism for directory
which has recovered dentry block. It means the reference of inode will be
held until last dirty dentry page being writebacked.
But when we mount f2fs with inline_dentry option, during recovery, dirent
may only be recovered into dir inode page rather than dentry page, so there
are no chance for us to release inode reference in ->writepage when
writebacking last dentry page.
We can call paired iget/iput explicityly for inline_dentry case, but for
non-inline_dentry case, iput will call writeback_single_inode to write all
data pages synchronously, but during recovery, ->writepages of f2fs skips
writing all pages, result in losing dirent.
This patch fixes this issue by obsoleting old mechanism, and introduce a
new dir_list to hold all directory inodes which has recovered datas until
finishing recovery.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch allows fscrypto to handle a second key prefix given by filesystem.
The main reason is to provide backward compatibility, since previously f2fs
used "f2fs:" as a crypto prefix instead of "fscrypt:".
Later, ext4 should also provide key_prefix() to give "ext4:".
One concern decribed by Ted would be kinda double check overhead of prefixes.
In x86, for example, validate_user_key consumes 8 ms after boot-up, which turns
out derive_key_aes() consumed most of the time to load specific crypto module.
After such the cold miss, it shows almost zero latencies, which treats as a
negligible overhead.
Note that request_key() detects wrong prefix in prior to derive_key_aes() even.
Cc: Ted Tso <tytso@mit.edu>
Cc: stable@vger.kernel.org # v4.6
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The following panic occurs when truncating inode which has inline
xattr to max filesize.
[<ffffffffa013d3be>] get_dnode_of_data+0x4e/0x580 [f2fs]
[<ffffffffa013aca1>] ? read_node_page+0x51/0x90 [f2fs]
[<ffffffffa013ad99>] ? get_node_page.part.34+0xb9/0x170 [f2fs]
[<ffffffffa01235b1>] truncate_blocks+0x131/0x3f0 [f2fs]
[<ffffffffa01238e3>] f2fs_truncate+0x73/0x100 [f2fs]
[<ffffffffa01239d2>] f2fs_setattr+0x62/0x2a0 [f2fs]
[<ffffffff811a72c8>] notify_change+0x158/0x300
[<ffffffff8118a42b>] do_truncate+0x6b/0xa0
[<ffffffff8118e539>] ? __sb_start_write+0x49/0x100
[<ffffffff8118a798>] do_sys_ftruncate.constprop.12+0x118/0x170
[<ffffffff8118a82e>] SyS_ftruncate+0xe/0x10
[<ffffffff8169efcf>] tracesys+0xe1/0xe6
[<ffffffffa0139ae0>] get_node_path+0x210/0x220 [f2fs]
<ffff880206a89ce8>
--[ end trace 5fea664dfbcc6625 ]---
The reason is truncate_blocks tries to truncate all node and data blocks
start from specified block offset with value of (max filesize / block
size), but actually, our valid max block offset is (max filesize / block
size) - 1, so f2fs detects such invalid block offset with BUG_ON in
truncation path.
This patch lets f2fs skip truncating data which is exceeding max
filesize.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, generic_block_bmap is used in f2fs_bmap, its semantics is when
the mapping is been found, return position of target physical block,
otherwise return zero.
But, previously, when there is no mapping info for specified logical block,
f2fs_bmap will map target physical block to a uninitialized variable, which
should be wrong. Fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Even if an inode failed to release its blocks, it should be kept in an orphan
inode list, so it will be released later.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Restructure struct seg_entry to eliminate holes in it, after that,
in 32-bits machine, it reduces size from 32 bytes to 24 bytes; in
64-bits machine, it reduces size from 56 bytes to 40 bytes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Each of fields in struct f2fs_xattr_entry will be assigned later,
so previously we don't need to memset the struct.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In find_fsync_dnodes, get_tmp_page will read dnode page synchronously,
previously, ra_meta_page did the same work, which is redundant, remove
it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch modifies to retry truncating node blocks in -ENOMEM case.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes a couple of bugs regarding to orphan inodes when handling
errors.
This tries to
- call alloc_nid_done with add_orphan_inode in handle_failed_inode
- let truncate blocks in f2fs_evict_inode
- not make a bad inode due to i_mode change
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 28bc106b23 ("f2fs: support revoking atomic written pages")
forgot to clear page private flag correctly, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Private data in page should be removed during ->releasepage or
->invalidatepage, otherwise garbage data would be remained in that page.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The kiocb already has the new position, so use that. The only interesting
case is AIO, where we currently don't bother updating ki_pos. We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.
While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.
XXX: Will need a few additional audits for O_DSYNC
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 57b62d29ad ("f2fs: fix to report
error in f2fs_readdir") causes f2fs_readdir to return -ENOENT when
get_lock_data_page returns -ENOENT. However, the original logic is to
continue when get_lock_data_page returns -ENOENT, but it forgets to
reset err to 0.
This will cause getdents64 incorretly return -ENOENT when lastdirent is
NULL in getdents64. This will lead to a wrong return value for syscall
caller.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For foreground GC, we cache node blocks in victim section and set them
dirty, then we call sync_node_pages to flush these node pages, but
meanwhile, those node pages which does not locate in victim section
will be flushed together, so more bandwidth and continuous free space
would be occupied.
So for this condition, it's better to leave those unrelated node page
in cache for further write hit, and let CP or VM to flush them afterward.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The filename length in dirent of may become zero-sized after random junk
data injection, once encounter such dirent, find_target_dentry or
f2fs_add_inline_entries will run into an infinite loop. So let f2fs being
aware of that to avoid deadloop.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Under direct IO path with O_(D)SYNC, it needs to set proper APPEND or UPDATE
flags, so taht f2fs_sync_file can make its data safe.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to give atomic writes, we should consider power failure during
sync_node_pages in fsync.
So, this patch marks fsync flag only in the last dnode block.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The fsync_node_pages should return pass or failure so that user could know
fsync is completed or not.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch splits the existing sync_node_pages into (f)sync_node_pages.
The fsync_node_pages is used for f2fs_sync_file only.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The first page of volatile writes usually contains a sort of header information
which will be used for recovery.
(e.g., journal header of sqlite)
If this is written without other journal data, user needs to handle the stale
journal information.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When fsync is called, sync_node_pages finds a proper direct node pages to flush.
But, it locks unrelated direct node pages together unnecessarily.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If somebody wrote some data before atomic writes, we should flush them in order
to handle atomic data in a right period.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch returns -E2BIG if there is no space to add an xattr entry.
This should fix generic/026 in xfstests as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch resolves the redundant condition check reported by David.
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The atomic/volatile operation should be done in pair of start and commit
ioctl.
For example, if a killed process remains open-ended atomic operation, we should
drop its flag as well as its atomic data. Otherwise, if sqlite initiates another
operation which doesn't require atomic writes, it will lose every data, since
f2fs still treats with them as atomic writes; nobody will trigger its commit.
Reported-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When one reader closes its file while the other writer is doing atomic writes,
f2fs_release_file drops atomic data resulting in an empty commit.
This patch fixes this wrong commit problem by checking openess of the file.
Process0 Process1
open file
start atomic write
write data
read data
close file
f2fs_release_file()
clear atomic data
commit atomic write
Reported-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds BUG_ON instead of retrying loop.
In the case of node pages, we already got this inode page, but unlocked it.
By the fact that we don't truncate any node pages in operations, the page's
mapping should be unchangeable.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, after trylock_page is succeeded, it doesn't check its mapping.
In order to fix that, we can just give PGP_LOCK to pagecache_get_page.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With below serials, we will lose parts of dirents:
1) mount f2fs with inline_dentry option
2) echo 1 > /sys/fs/f2fs/sdX/dir_level
3) mkdir dir
4) touch 180 files named [1-180] in dir
5) touch 181 in dir
6) echo 3 > /proc/sys/vm/drop_caches
7) ll dir
ls: cannot access 2: No such file or directory
ls: cannot access 4: No such file or directory
ls: cannot access 5: No such file or directory
ls: cannot access 6: No such file or directory
ls: cannot access 8: No such file or directory
ls: cannot access 9: No such file or directory
...
total 360
drwxr-xr-x 2 root root 4096 Feb 19 15:12 ./
drwxr-xr-x 3 root root 4096 Feb 19 15:11 ../
-rw-r--r-- 1 root root 0 Feb 19 15:12 1
-rw-r--r-- 1 root root 0 Feb 19 15:12 10
-rw-r--r-- 1 root root 0 Feb 19 15:12 100
-????????? ? ? ? ? ? 101
-????????? ? ? ? ? ? 102
-????????? ? ? ? ? ? 103
...
The reason is: when doing the inline dir conversion, we didn't consider
that directory has hierarchical hash structure which can be configured
through sysfs interface 'dir_level'.
By default, dir_level of directory inode is 0, it means we have one bucket
in hash table located in first level, all dirents will be hashed in this
bucket, so it has no problem for us to do the duplication simply between
inline dentry page and converted normal dentry page.
However, if we configured dir_level with the value N (greater than 0), it
will expand the bucket number of first level hash table by 2^N - 1, it
hashs dirents into different buckets according their hash value, if we
still move all dirents to first bucket, it makes incorrent locating for
inline dirents, the result is, although we can iterate all dirents through
->readdir, we can't stat some of them in ->lookup which based on hash
table searching.
This patch fixes this issue by rehashing dirents into correct position
when converting inline directory.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a sbi flag, SBI_NEED_SB_WRITE, which indicates it needs to
recover superblock when (re)mounting as RW. This is set only when f2fs is
mounted as RO.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When one of superblocks is missing, f2fs recovers it with the valid one.
But, even if f2fs is mounted as RO, we'd better notify that too.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the issue introduced by the ext4 crypto fix in a same manner.
For F2FS, however, we flush the pending IOs and wait for a while to acquire free
memory.
Fixes: c9af28fdd4 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch synced with the below two ext4 crypto fixes together.
In 4.6-rc1, f2fs newly introduced accessing f_path.dentry which crashes
overlayfs. To fix, now we need to use file_dentry() to access that field.
Fixes: c0a37d4878 ("ext4: use file_dentry()")
Fixes: 9dd78d8c9a ("ext4: use dget_parent() in ext4_file_open()")
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull f2fs fixes from Jaegeuk Kim.
* tag 'f2fs-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: retrieve IO write stat from the right place
f2fs crypto: fix corrupted symlink in encrypted case
f2fs: cover large section in sanity check of super
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When get_acl() is called for an inode whose ACL is not cached yet, the
get_acl inode operation is called to fetch the ACL from the filesystem.
The inode operation is responsible for updating the cached acl with
set_cached_acl(). This is done without locking at the VFS level, so
another task can call set_cached_acl() or forget_cached_acl() before the
get_acl inode operation gets to calling set_cached_acl(), and then
get_acl's call to set_cached_acl() results in caching an outdate ACL.
Prevent this from happening by setting the cached ACL pointer to a
task-specific sentinel value before calling the get_acl inode operation.
Move the responsibility for updating the cached ACL from the get_acl
inode operations to get_acl(). There, only set the cached ACL if the
sentinel value hasn't changed.
The sentinel values are chosen to have odd values. Likewise, the value
of ACL_NOT_CACHED is odd. In contrast, ACL object pointers always have
an even value (ACLs are aligned in memory). This allows to distinguish
uncached ACLs values from ACL objects.
In addition, switch from guarding inode->i_acl and inode->i_default_acl
upates by the inode->i_lock spinlock to using xchg() and cmpxchg().
Filesystems that do not want ACLs returned from their get_acl inode
operations to be cached must call forget_cached_acl() to prevent the VFS
from doing so.
(Patch written by Al Viro and Andreas Gruenbacher.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the following patch,
f2fs: split journal cache from curseg cache
journal cache is split from curseg cache. So IO write statistics should be
retrived from journal cache but not curseg->sum_blk. Otherwise, it will
get 0, and the stat is lost.
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the encrypted symlink case, we should check its corrupted symname after
decrypting it.
Otherwise, we can report -ENOENT incorrectly, if encrypted symname starts with
'\0'.
Cc: stable 4.5+ <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the bug which does not cover a large section case when checking
the sanity of superblock.
If f2fs detects misalignment, it will fix the superblock during the mount time,
so it doesn't need to trigger fsck.f2fs further.
Reported-by: Matthias Prager <linux@matthiasprager.de>
Reported-by: David Gnedt <david.gnedt@davizone.at>
Cc: stable 4.5+ <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If many threads calls fsync with data writes, we don't need to flush every
bios having node page writes.
The f2fs_wait_on_page_writeback will flush its bios when the page is really
needed.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_setxattr() prototype for CONFIG_F2FS_FS_XATTR=n has
been wrong for a long time, since 8ae8f1627f ("f2fs: support
xattr security labels"), but there have never been any callers,
so it did not matter.
Now, the function gets called from f2fs_ioc_keyctl(), which
causes a build failure:
fs/f2fs/file.c: In function 'f2fs_ioc_keyctl':
include/linux/stddef.h:7:14: error: passing argument 6 of 'f2fs_setxattr' makes integer from pointer without a cast [-Werror=int-conversion]
#define NULL ((void *)0)
^
fs/f2fs/file.c:1599:27: note: in expansion of macro 'NULL'
value, F2FS_KEY_SIZE, NULL, type);
^
In file included from ../fs/f2fs/file.c:29:0:
fs/f2fs/xattr.h:129:19: note: expected 'int' but argument is of type 'void *'
static inline int f2fs_setxattr(struct inode *inode, int index,
^
fs/f2fs/file.c:1597:9: error: too many arguments to function 'f2fs_setxattr'
return f2fs_setxattr(inode, F2FS_XATTR_INDEX_KEY,
^
In file included from ../fs/f2fs/file.c:29:0:
fs/f2fs/xattr.h:129:19: note: declared here
static inline int f2fs_setxattr(struct inode *inode, int index,
Thsi changes the prototype of the empty stub function to match
that of the actual implementation. This will not make the key
management work when F2FS_FS_XATTR is disabled, but it gets it
to build at least.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During ->lookup, I_NEW state of inode was been cleared in f2fs_iget,
so in error path, we don't need to clear it again.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Just clean up opened code with existing function, no logic change.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The crc function is done bit by bit.
Optimize this by use cryptoapi
crc32 function which is backed by h/w acceleration.
Signed-off-by: Keith Mok <ek9852@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
ra_node_page() is used to read ahead one node page. Comparing to regular
read, it's faster because it doesn't wait for IO completion.
But if it is called twice for reading the same block, and the IO request
from the first call hasn't been completed before the second call, the second
call will have to wait until the read is over.
Here use the code in __do_page_cache_readahead() to solve this problem.
It does nothing when someone else already puts the page in mapping. The
status of page should be assured by whoever puts it there.
This implement also prevents alteration of page reference count.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds the renamed functions moved from the f2fs crypto files.
1. definitions for per-file encryption used by ext4 and f2fs.
2. crypto.c for encrypt/decrypt functions
a. IO preparation:
- fscrypt_get_ctx / fscrypt_release_ctx
b. before IOs:
- fscrypt_encrypt_page
- fscrypt_decrypt_page
- fscrypt_zeroout_range
c. after IOs:
- fscrypt_decrypt_bio_pages
- fscrypt_pullback_bio_page
- fscrypt_restore_control_page
3. policy.c supporting context management.
a. For ioctls:
- fscrypt_process_policy
- fscrypt_get_policy
b. For context permission
- fscrypt_has_permitted_context
- fscrypt_inherit_context
4. keyinfo.c to handle permissions
- fscrypt_get_encryption_info
- fscrypt_free_encryption_info
5. fname.c to support filename encryption
a. general wrapper functions
- fscrypt_fname_disk_to_usr
- fscrypt_fname_usr_to_disk
- fscrypt_setup_filename
- fscrypt_free_filename
b. specific filename handling functions
- fscrypt_fname_alloc_buffer
- fscrypt_fname_free_buffer
6. Makefile and Kconfig
Cc: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_lock_all() calls down_write_nest_lock() to acquire a rw_sem and check
a mutex, but down_write_nest_lock() is designed for two rw_sem accoring to the
comment in include/linux/rwsem.h. And, other than f2fs, it is just called in
mm/mmap.c with two rwsem.
So, it looks it is used wrongly by f2fs. And, it causes the below compile
warning on -rt kernel too.
In file included from fs/f2fs/xattr.c:25:0:
fs/f2fs/f2fs.h: In function 'f2fs_lock_all':
fs/f2fs/f2fs.h:962:34: warning: passing argument 2 of 'down_write_nest_lock' from incompatible pointer type [-Wincompatible-pointer-types]
f2fs_down_write(&sbi->cp_rwsem, &sbi->cp_mutex);
^
fs/f2fs/f2fs.h:27:55: note: in definition of macro 'f2fs_down_write'
#define f2fs_down_write(x, y) down_write_nest_lock(x, y)
^
In file included from include/linux/rwsem.h:22:0,
from fs/f2fs/xattr.c:21:
include/linux/rwsem_rt.h:138:20: note: expected 'struct rw_semaphore *' but argument is of type 'struct mutex *'
static inline void down_write_nest_lock(struct rw_semaphore *sem,
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs was corrupted with missing dot dentries in root dirctory,
it needs to recover them after fsck.f2fs set F2FS_INLINE_DOTS flag
in directory inode when fsck.f2fs detects missing dot dentries.
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Yong Sheng <shengyong1@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a new helper f2fs_flush_merged_bios to clean up redundant codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a new help f2fs_update_data_blkaddr to clean up redundant codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For now, flow of GCing an encrypted data page:
1) try to grab meta page in meta inode's mapping with index of old block
address of that data page
2) load data of ciphertext into meta page
3) allocate new block address
4) write the meta page into new block address
5) update block address pointer in direct node page.
Other reader/writer will use f2fs_wait_on_encrypted_page_writeback to
check and wait on GCed encrypted data cached in meta page writebacked
in order to avoid inconsistence among data page cache, meta page cache
and data on-disk when updating.
However, we will use new block address updated in step 5) as an index to
lookup meta page in inner bio buffer. That would be wrong, and we will
never find the GCing meta page, since we use the old block address as
index of that page in step 1).
This patch fixes the issue by adjust the order of step 1) and step 3),
and in step 1) grab page with index generated in step 3).
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. Inode mapping tree can index page in range of [0, ULONG_MAX], however,
in some places, f2fs only search or iterate page in ragne of [0, LONG_MAX],
result in miss hitting in page cache.
2. filemap_fdatawait_range accepts range parameters in unit of bytes, so
the max range it covers should be [0, LLONG_MAX], if we use [0, LONG_MAX]
as range for waiting on writeback, big number of pages will not be covered.
This patch corrects above two issues.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The D state of wait_on_all_pages_writeback should be waken by
function f2fs_write_end_io when all writeback pages have been
succesfully written to device. It's possible that wake_up comes
between get_pages and io_schedule. Maybe in this case it will
lost wake_up and still in D state even if all pages have been
write back to device, and finally, the whole system will be into
the hungtask state.
if (!get_pages(sbi, F2FS_WRITEBACK))
break;
<--------- wake_up
io_schedule();
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Biao He <hebiao6@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When flushing node pages, if current node page is an inline inode page, we
will try to merge inline data from data page into inline inode page, then
skip flushing current node page, it will decrease the number of nodes to
be flushed in batch in this round, which may lead to worse performance.
This patch gives a chance to flush just merged inline inode pages for
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes to show more info in message log about the recovery
of the corrupted superblock during ->mount, e.g. the index of corrupted
superblock and the result of recovery.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With a partition which was formated as multi segments in one section,
we stated incorrectly for count of gc operation.
e.g., for a partition with segs_per_sec = 4
cat /sys/kernel/debug/f2fs/status
GC calls: 208 (BG: 7)
- data segments : 104 (52)
- node segments : 104 (24)
GC called count should be (104 (data segs) + 104 (node segs)) / 4 = 52,
rather than 208. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch avoids to remain inefficient victim segment number selected by
a victim.
For example, if all the dirty segments has same valid blocks, we can get
the victim segments descending order due to keeping wrong last segment number.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_convert_inline_page introduce what read_inline_data
already does for copying out the inline data from inode_page.
We can use read_inline_data instead to simplify the code.
Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When doing test with fstests/generic/068 in inline_dentry enabled f2fs,
following oops dmesg will be reported:
------------[ cut here ]------------
WARNING: CPU: 5 PID: 11841 at fs/inode.c:273 drop_nlink+0x49/0x50()
Modules linked in: f2fs(O) ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
CPU: 5 PID: 11841 Comm: fsstress Tainted: G O 4.5.0-rc1 #45
Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.61 05/16/2013
0000000000000111 ffff88009cdf7ae8 ffffffff813e5944 0000000000002e41
0000000000000000 0000000000000111 0000000000000000 ffff88009cdf7b28
ffffffff8106a587 ffff88009cdf7b58 ffff8804078fe180 ffff880374a64e00
Call Trace:
[<ffffffff813e5944>] dump_stack+0x48/0x64
[<ffffffff8106a587>] warn_slowpath_common+0x97/0xe0
[<ffffffff8106a5ea>] warn_slowpath_null+0x1a/0x20
[<ffffffff81231039>] drop_nlink+0x49/0x50
[<ffffffffa07b95b4>] f2fs_rename2+0xe04/0x10c0 [f2fs]
[<ffffffff81231ff1>] ? lock_two_nondirectories+0x81/0x90
[<ffffffff813f454d>] ? lockref_get+0x1d/0x30
[<ffffffff81220f70>] vfs_rename+0x2e0/0x640
[<ffffffff8121f9db>] ? lookup_dcache+0x3b/0xd0
[<ffffffff810b8e41>] ? update_fast_ctr+0x21/0x40
[<ffffffff8134ff12>] ? security_path_rename+0xa2/0xd0
[<ffffffff81224af6>] SYSC_renameat2+0x4b6/0x540
[<ffffffff810ba8ed>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff810022ba>] ? exit_to_usermode_loop+0x7a/0xd0
[<ffffffff817e0ade>] ? int_ret_from_sys_call+0x52/0x9f
[<ffffffff810bdc90>] ? trace_hardirqs_on_caller+0x100/0x1c0
[<ffffffff81224b8e>] SyS_renameat2+0xe/0x10
[<ffffffff8121f08e>] SyS_rename+0x1e/0x20
[<ffffffff817e0957>] entry_SYSCALL_64_fastpath+0x12/0x6f
---[ end trace 2b31e17995404e42 ]---
This is because: in the same inline directory, when we renaming one file
from source name to target name which is not existed, once space of inline
dentry is not enough, inline conversion will be triggered, after that all
data in inline dentry will be moved to normal dentry page.
After attaching the new entry in coverted dentry page, still we try to
remove old entry in original inline dentry, since old entry has been
moved, so it obviously doesn't make any effect, result in remaining old
entry in converted dentry page.
Now, we have two valid dentries pointed to the same inode which has nlink
value of 1, deleting them both, above warning appears.
This issue can be reproduced easily as below steps:
1. mount f2fs with inline_dentry option
2. mkdir dir
3. touch 180 files named [001-180] in dir
4. rename dir/180 dir/181
5. rm dir/180 dir/181
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Should check and show correct return value of update_dent_inode in
->rename.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
>From the function name of get_valid_checkpoint, it seems to return
the valid cp or NULL for caller to check. If no valid one is found,
f2fs_fill_super will print the err log. But if get_valid_checkpoint
get one valid(the return value indicate that it's valid, however actually
it is invalid after sanity checking), then print another similar err
log. That seems strange. Let's keep sanity checking inside the procedure
of geting valid cp. Another improvement we gained from this move is
that even the large volume is supported, we check the cp in advanced
to skip the following procedure if failing the sanity checking.
Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
read_raw_super_block was introduced to help find the
first valid superblock. Commit da554e48ca ("f2fs:
recovering broken superblock during mount") changed the
behaviour to read both of them and check whether need
the recovery flag or not. So the comment before this
function isn't consistent with what it actually does.
Also, the origin code use two tags to round the err
cases, which isn't so readable. So this patch amend
the comment and slightly reorganize it.
Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When lookuping nat entry in cache_nat_entry, if we fail to hit nat cache,
we try to load nat entries a) from journal of current segment cache or b)
from NAT pages for updating, during the process, write lock of
nat_tree_lock will be held to avoid inconsistent condition in between
nid cache and nat cache caused by racing among nat entry shrinker,
checkpointer, nat entry updater.
But this way may cause low efficient when updating nat cache, because it
serializes accessing in journal cache or reading NAT pages.
Here, we reorder lock and update flow as below to enhance accessing
concurrency:
- get_node_info
- down_read(nat_tree_lock)
- lookup nat cache --- hit -> unlock & return
- lookup journal cache --- hit -> unlock & goto update
- up_read(nat_tree_lock)
update:
- down_write(nat_tree_lock)
- cache_nat_entry
- lookup nat cache --- nohit -> update
- up_write(nat_tree_lock)
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In curseg cache, f2fs caches two different parts:
- datas of current summay block, i.e. summary entries, footer info.
- journal info, i.e. sparse nat/sit entries or io stat info.
With this approach, 1) it may cause higher lock contention when we access
or update both of the parts of cache since we use the same mutex lock
curseg_mutex to protect the cache. 2) current summary block with last
journal info will be writebacked into device as a normal summary block
when flushing, however, we treat journal info as valid one only in current
summary, so most normal summary blocks contain junk journal data, it wastes
remaining space of summary block.
So, in order to fix above issues, we split curseg cache into two parts:
a) current summary block, protected by original mutex lock curseg_mutex
b) journal cache, protected by newly introduced r/w semaphore journal_rwsem
When loading curseg cache during ->mount, we store summary info and
journal info into different caches; When doing checkpoint, we combine
datas of two cache into current summary block for persisting.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Try to use block plug in more place as below to let process cache bios
as much as possbile, in order to reduce lock overhead of queue in IO
scheduler.
1) sync_meta_pages
2) ra_meta_pages
3) f2fs_balance_fs_bg
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adopts f2fs with codes of ext4, it removes unneeded memory
allocation in creating/accessing path of symlink.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch syncs f2fs with commit abdd438b26 ("ext4 crypto: handle
unexpected lack of encryption keys") from ext4.
Fix up attempts by users to try to write to a file when they don't
have access to the encryption key.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch syncs f2fs with commit 6bc445e0ff ("ext4 crypto: make
sure the encryption info is initialized on opendir(2)") from ext4.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs support atomic write with following semantics:
1. open db file
2. ioctl start atomic write
3. (write db file) * n
4. ioctl commit atomic write
5. close db file
With this flow we can avoid file becoming corrupted when abnormal power
cut, because we hold data of transaction in referenced pages linked in
inmem_pages list of inode, but without setting them dirty, so these data
won't be persisted unless we commit them in step 4.
But we should still hold journal db file in memory by using volatile
write, because our semantics of 'atomic write support' is incomplete, in
step 4, we could fail to submit all dirty data of transaction, once
partial dirty data was committed in storage, then after a checkpoint &
abnormal power-cut, db file will be corrupted forever.
So this patch tries to improve atomic write flow by adding a revoking flow,
once inner error occurs in committing, this gives another chance to try to
revoke these partial submitted data of current transaction, it makes
committing operation more like aotmical one.
If we're not lucky, once revoking operation was failed, EAGAIN will be
reported to user for suggesting doing the recovery with held journal file,
or retrying current transaction again.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split drop_inmem_pages from commit_inmem_pages for code readability,
and prepare for the following modification.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to eliminate garbage name lengths in dentries in order
to provide correct answers of readdir.
For example, if a valid dentry consists of:
bitmap : 1 1 1 1
len : 32 0 x 0,
readdir can start with second bit_pos having len = 0.
Or, it can start with third bit_pos having garbage.
In both of cases, we should avoid to try filling dentries.
So, this patch not only removes any garbage length, but also avoid entering
zero length case in readdir.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adopts:
ext4 crypto: check for too-short encrypted file names
An encrypted file name should never be shorter than an 16 bytes, the
AES block size. The 3.10 crypto layer will oops and crash the kernel
if ciphertext shorter than the block size is passed to it.
Fortunately, in modern kernels the crypto layer will not crash the
kernel in this scenario, but nevertheless, it represents a corrupted
directory, and we should detect it and mark the file system as
corrupted so that e2fsck can fix this.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adopts:
ext4 crypto: ext4_page_crypto() doesn't need a encryption context
Since ext4_page_crypto() doesn't need an encryption context (at least
not any more), this allows us to simplify a number function signature
and also allows us to avoid needing to allocate a context in
ext4_block_write_begin(). It also means we no longer need a separate
ext4_decrypt_one() function.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adopts:
ext4 crypto: replace some BUG_ON()'s with error checks
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch preallocates data blocks for buffered aio writes.
With this patch, we can avoid redundant locking and unlocking of node pages
given consecutive aio request.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fix missing skip pages info in f2fs_writepages trace event.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs use single bio buffer per type data (META/NODE/DATA) for caching
writes locating in continuous block address as many as possible, after
submitting, these writes may be still cached in bio buffer, so we have
to flush cached writes in bio buffer by calling f2fs_submit_merged_bio.
Unfortunately, in the scenario of high concurrency, bio buffer could be
flushed by someone else before we submit it as below reasons:
a) there is no space in bio buffer.
b) add a request of different type (SYNC, ASYNC).
c) add a discontinuous block address.
For this condition, f2fs_submit_merged_bio will be devastating, because
it could break the following merging of writes in bio buffer, split one
big bio into two smaller one.
This patch introduces f2fs_submit_merged_bio_cond which can do a
conditional submitting with bio buffer, before submitting it will judge
whether:
- page in DATA type bio buffer is matching with specified page;
- page in DATA type bio buffer is belong to specified inode;
- page in NODE type bio buffer is belong to specified inode;
If there is no eligible page in bio buffer, we will skip submitting step,
result in gaining more chance to merge consecutive block IOs in bio cache.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sometimes, if cp_error is set, there remains under-writeback pages, resulting in
kernel hang in put_super.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Likewise f2fs_write_cache_pages, let's do for node and meta pages too.
Especially, for node blocks, we should do this before marking its fsync
and dentry flags.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch makes f2fs_map_blocks supporting returning next potential
page offset which skips hole region in indirect tree of inode, and
use it to speed up fiemap in handling big hole case.
Test method:
xfs_io -f /mnt/f2fs/file -c "pwrite 1099511627776 4096"
time xfs_io -f /mnt/f2fs/file -c "fiemap -v"
Before:
time xfs_io -f /mnt/f2fs/file -c "fiemap -v"
/mnt/f2fs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2147483647]: hole 2147483648
1: [2147483648..2147483655]: 81920..81927 8 0x1
real 3m3.518s
user 0m0.000s
sys 3m3.456s
After:
time xfs_io -f /mnt/f2fs/file -c "fiemap -v"
/mnt/f2fs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2147483647]: hole 2147483648
1: [2147483648..2147483655]: 81920..81927 8 0x1
real 0m0.008s
user 0m0.000s
sys 0m0.008s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When seeking data in ->llseek, if we encounter a big hole which covers
several dnode pages, we will try to seek data from index of page which
is the first page of next dnode page, at most we could skip searching
(ADDRS_PER_BLOCK - 1) pages.
However it's still not efficient, because if our indirect/double-indirect
pointer are NULL, there are no dnode page locate in the tree indirect/
double-indirect pointer point to, it's not necessary to search the whole
region.
This patch introduces get_next_page_offset to calculate next page offset
based on current searching level and max searching level returned from
get_dnode_of_data, with this, we could skip searching the entire area
indirect or double-indirect node block is not exist.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are redundant pointer conversion in following call stack:
- at position a, inode was been converted to f2fs_file_info.
- at position b, f2fs_file_info was been converted to inode again.
- truncate_blocks(inode,..)
- fi = F2FS_I(inode) ---a
- ADDRS_PER_PAGE(node_page, fi)
- addrs_per_inode(fi)
- inode = &fi->vfs_inode ---b
- f2fs_has_inline_xattr(inode)
- fi = F2FS_I(inode)
- is_inode_flag_set(fi,..)
In order to avoid unneeded conversion, alter ADDRS_PER_PAGE and
addrs_per_inode to acept parameter with type of inode pointer.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch uses existing function f2fs_map_block to simplify implementation
of __allocate_data_blocks.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_map_blocks, we use duplicated codes to handle first block mapping
and the following blocks mapping, it's unnecessary. This patch simplifies
f2fs_map_blocks to avoid using copied codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces lifetime IO write statistics exposed to the sysfs interface.
The write IO amount is obtained from block layer, accumulated in the file system and
stored in the hot node summary of checkpoint.
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Pengyang Hou <houpengyang@huawei.com>
[Jaegeuk Kim: add sysfs documentation]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
On the worst case, we need to scan the whole radix tree and every rb-tree to
free the victimed extent_nodes when shrinking.
Pengyang initially introduced a victim_list to record the victimed extent_nodes,
and free these extent_nodes by just scanning a list.
Later, Chao Yu enhances the original patch to improve memory footprint by
removing victim list.
The policy of lru list shrinking becomes:
1) lock lru list's lock
2) trylock extent tree's lock
3) remove extent node from lru list
4) unlock lru list's lock
5) do shrink
6) repeat 1) to 5)
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are three steps to free an extent node:
1) list_del_init, 2)__detach_extent_node, 3) kmem_cache_free
In path f2fs_destroy_extent_tree, 1->2->3 to free a node,
But in path f2fs_update_extent_tree_range, it is 2->1->3.
This patch makes all the order to be: 1->2->3
It makes sense, since in the next patch, we import a victim list in the
path shrink_extent_tree, we could check if the extent_node is in the victim
list by checking the list_empty(). So it is necessary to put 1) first.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
variable nsearched in get_victim_by_default() indicates the number of
dirty segments we already checked. There are 2 problems about the way
it updates:
1. When p.ofs_unit is greater than 1, the victim we find consists
of multiple segments, possibly more than 1 dirty segment.
But nsearched always increases by 1.
2. If segments have been found but not been chosen, nsearched won't
increase. So even we have checked all dirty segments, nsearched
may still less than p.max_search.
All these problems could cause unnecessary search after all dirty
segments have already been checked.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
no need to wait inline file page writeback for no one
use it, so this patch delete unnecessary wait.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In write_begin, if storage supports stable_page, we don't need to wait for
writeback to update its contents.
This patch introduces to use wait_for_stable_page instead of
wait_on_page_writeback.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we configure section consist of multiple segments, foreground GC will
do the garbage collection with following approach:
for each segment in victim section
blk_start_plug
for each valid block in segment
write out by OPU method
submit bio cache <---
blk_finish_plug <---
There are two issue:
1) for most of the time, 'submit bio cache' will break the merging in
current bio buffer from writes of next segments, making a smaller bio
submitting.
2) block plug only cover IO submitting in one segment, which reduce
opportunity of merging IOs in plug with multiple segments.
So refactor the code as below structure to strive for biggest
opportunity of merging IOs:
blk_start_plug
for each segment in victim section
for each valid block in segment
write out by OPU method
submit bio cache
blk_finish_plug
Test method:
1. mkfs.f2fs -s 8 /dev/sdX
2. touch 32 files
3. write 2M data into each file
4. punch 1.5M data from offset 0 for each file
5. trigger foreground gc through ioctl
Before patch, there are totoally 40 bios submitted.
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65776, size = 122880
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66016, size = 122880
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66256, size = 122880
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66496, size = 32768
----repeat for 8 times
After patch, there are totally 35 bios submitted.
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880
----repeat 34 times
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 73696, size = 16384
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If end_io gets an error, we don't need to set the page as dirty, since we
already set f2fs_stop_checkpoint which will not flush any data.
This will resolve the following warning.
======================================================
[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
4.4.0+ #9 Tainted: G O
------------------------------------------------------
xfs_io/26773 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
(&(&sbi->inode_lock[i])->rlock){+.+...}, at: [<ffffffffc025483f>] update_dirty_page+0x6f/0xd0 [f2fs]
and this task is already holding:
(&(&q->__queue_lock)->rlock){-.-.-.}, at: [<ffffffff81396ea2>] blk_queue_bio+0x422/0x490
which would create a new lock dependency:
(&(&q->__queue_lock)->rlock){-.-.-.} -> (&(&sbi->inode_lock[i])->rlock){+.+...}
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In write_begin, if there is an inline_data, f2fs loads it into 0'th data page.
Since it's the read path, we don't need to sync its inode page.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In write_end, we don't need to sync inode page at every time.
Instead, we can expect f2fs_write_inode will update later.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The sceanrio is:
1. create fully node blocks
2. flush node blocks
3. write inline_data for all the node blocks again
4. flush node blocks redundantly
So, this patch tries to flush inline_data when flushing node blocks.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The scenario is:
1. create lots of node blocks
2. sync
3. write lots of inline_data
-> got panic due to no free space
In that case, we should flush node blocks when writing inline_data in #3,
and trigger gc as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are many writepages calls by multiple threads in background, we don't
need to serialize to merge all the bios, since it's background.
In such the case, it'd better to run writepages concurrently.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
get_new_segment starts from current segment position, tries to search a
free segment among its right neighbors locate in same section.
But previously our search area was set as [current segment, max segment],
which means we have to search to more bits in free_segmap bitmap for some
worse cases. So here we correct the search area to [current segment, last
segment in section] to avoid unnecessary searching.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold
of dirty nat entries, if current ratio exceeds configured threshold,
checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing f2fs with xfstest, generic/251 is stuck for long time,
the case uses below serials to obtain fresh released space in device,
in order to prepare for following fstrim test.
1. rm -rf /mnt/dir
2. mkdir /mnt/dir/
3. cp -axT `pwd`/ /mnt/dir/
4. goto 1
During preparing step, all nat entries will be cached in nat cache,
most of them are dirty entries with invalid blkaddr, which means
nodes related to these entries have been truncated, and they could
be reused after the dirty entries been checkpointed.
However, there was no checkpoint been triggered, so nid allocators
(e.g. mkdir, creat) will run into long journey of iterating all NAT
pages, looking for free nids in alloc_nid->build_free_nids.
Here, in f2fs_balance_fs_bg we give another chance to do checkpoint
to flush nat entries for reusing them in free nid cache when dirty
entry count exceeds 10% of max count.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Operations in is_merged_page is related to inner bio cache, move it to
data.c.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:
- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.
The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull f2fs updates from Jaegeuk Kim:
"This series adds two ioctls to control cached data and fragmented
files. Most of the rest fixes missing error cases and bugs that we
have not covered so far. Summary:
Enhancements:
- support an ioctl to execute online file defragmentation
- support an ioctl to flush cached data
- speed up shrinking of extent_cache entries
- handle broken superblock
- refector dirty inode management infra
- revisit f2fs_map_blocks to handle more cases
- reduce global lock coverage
- add detecting user's idle time
Major bug fixes:
- fix data race condition on cached nat entries
- fix error cases of volatile and atomic writes"
* tag 'for-f2fs-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (87 commits)
f2fs: should unset atomic flag after successful commit
f2fs: fix wrong memory condition check
f2fs: monitor the number of background checkpoint
f2fs: detect idle time depending on user behavior
f2fs: introduce time and interval facility
f2fs: skip releasing nodes in chindless extent tree
f2fs: use atomic type for node count in extent tree
f2fs: recognize encrypted data in f2fs_fiemap
f2fs: clean up f2fs_balance_fs
f2fs: remove redundant calls
f2fs: avoid unnecessary f2fs_balance_fs calls
f2fs: check the page status filled from disk
f2fs: introduce __get_node_page to reuse common code
f2fs: check node id earily when readaheading node page
f2fs: read isize while holding i_mutex in fiemap
Revert "f2fs: check the node block address of newly allocated nid"
f2fs: cover more area with nat_tree_lock
f2fs: introduce max_file_blocks in sbi
f2fs crypto: check CONFIG_F2FS_FS_XATTR for encrypted symlink
f2fs: introduce zombie list for fast shrinking extent trees
...
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
This patch fixes wrong decision for avaliable_free_memory.
The return valus is already set as false, so we should consider true condition
below only.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds last time that user requested filesystem operations.
This information is used to detect whether system is idle or not later.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull vfs xattr updates from Al Viro:
"Andreas' xattr cleanup series.
It's a followup to his xattr work that went in last cycle; -0.5KLoC"
* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr handlers: Simplify list operation
ocfs2: Replace list xattr handler operations
nfs: Move call to security_inode_listsecurity into nfs_listxattr
xfs: Change how listxattr generates synthetic attributes
tmpfs: listxattr should include POSIX ACL xattrs
tmpfs: Use xattr handler infrastructure
btrfs: Use xattr handler infrastructure
vfs: Distinguish between full xattr names and proper prefixes
posix acls: Remove duplicate xattr name definitions
gfs2: Remove gfs2_xattr_acl_chmod
vfs: Remove vfs_xattr_cmp
If there are no nodes in extent tree, let's skip releasing step to avoid
any overhead of grabbing/releasing extent tree lock.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. rename field in struct extent_tree from count to node_cnt for
readability.
2. alter to use atomic type for node_cnt.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to teach f2fs_fiemap to recognize encrypted data.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are duplicated code in between get_node_page and get_node_page_ra,
introduce __get_node_page to includes common parts of these two, and
export get_node_page and get_node_page_ra by reusing __get_node_page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add node id check in ra_node_page and get_node_page_ra like get_node_page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
make sure the isize we read doesn't change during the process.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There was a subtle bug on nat cache management which incurs wrong nid allocation
or wrong block addresses when try_to_free_nats is triggered heavily.
This patch enlarges the previous coverage of nat_tree_lock to avoid data race.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce max_file_blocks in sbi to store max block index of file in f2fs,
it could be used to avoid unneeded calculation of max block index in
runtime.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix overflow of sbi->max_file_blocks]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add missed CONFIG_F2FS_FS_XATTR for encrypted symlink inode in order
to avoid unneeded registry of ->{get,set,remove}xattr.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes refcount, and instead, adds zombie_list to shrink directly
without radix tree traverse.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes missing IPU condition when fdatasync is called.
With this patch, fdatasync is able to avoid additional node writes for recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is no report on this bug_on case, but if malicious attacker changed this
field intentionally, we can just reset it as a MAX value.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two rules to handle aborting volatile or atomic writes.
1. drop atomic writes
- we don't need to keep any stale db data.
2. write journal data
- we should keep the journal data with fsync for db recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If filesystem is readonly, leave user message info instead of recovering
inline dot inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, we can get mismatched largest extent information.
One example is:
1. mount f2fs w/ extent_cache
2. make a small extent
3. umount
4. mount f2fs w/o extent_cache
5. update the largest extent
6. umount
7. mount f2fs w/ extent_cache
8. get the old extent made by #2
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_sync_fs to clean up codes in f2fs_ioc_write_checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: remove unused err variable]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a max block check for get_data_block_bmap.
Trinity test program will send a block number as parameter into
ioctl_fibmap, which will be used in get_node_path(), when the block
number large than f2fs max blocks, it will trigger kernel bug.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
[Jaegeuk Kim: fix missing condition, pointed by Chao Yu]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fix bugs:
1. len could be updated incorrectly when start+len is beyond isize.
2. If there is a hole consisting of more than two blocks, it could
fail to add FIEMAP_EXTENT_LAST flag for the last extent.
3. If there is an extent beyond isize, when we search extents in a range
that ends at isize, it will also return the extent beyond isize,
which is outside the range.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sometimes we keep dumb when IO error occur in lower layer device, so user
will not receive any error return value for some operation, but actually,
the operation did not succeed.
This sould be avoided, so this patch reports such kind of error to user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__recover_do_dentries will try to grab free space in storage, so fix to
add missing f2fs_balance_fs here.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs_write_begin is to update data, we can bypass calling f2fs_lock_op() in
order to avoid the checkpoint latency in the write syscall.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If get_node_page() gets zero nid, we can return early without getting a wrong
page. For example, get_dnode_of_data() can try to do that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds prepare_write_begin to clean f2fs_write_begin.
The major role of this function is to convert any inline_data and allocate
or find block address.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inline_data option is disable, when truncating an inline inode with
size which is not exceed maxinum inline size, we should not convert
inline inode to regular one to avoid the overhead of synchronizing
conversion.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
do_checkpoint and write_checkpoint can fail due to reasons like triggering
in a readonly fs or encountering IO error of storage device.
So it's better to report such error info to user, let user be aware of
failure of doing checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user tries to update or read data, we don't need to call f2fs_balance_fs
which triggers f2fs_gc, which increases unnecessary long latency.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Only cover sbi->cp_rwsem on one dnode page's allocation and modification
instead of multiple's in f2fs_map_blocks, it can reduce the covered region
of cp_rwsem, then we can avoid potential long time delay for concurrent
checkpointer.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces recording node block allocation in dnode_of_data.
This information helps to figure out whether any node block is allocated during
specific file operations.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_balance_fs doesn't need to cover f2fs_new_inode or f2fs_find_entry
works.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can check inode's inline_data flag when calling to convert it.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is no candidates for shrinking slab entries, we don't need to traverse
any trees at all.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix missing initialization reported by Yunlei He]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It would be better to use atomic variable for total_extent_tree.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
check map->m_len right after it changes to avoid excess call
to update dnode_of_data.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, when finishing a checkpoint, we have persisted all fs meta
info including meta inode, node inode, dentry page of directory inode, so,
after a sudden power cut, f2fs can recover from last checkpoint with full
directory structure.
But during checkpoint, we didn't flush dirty pages of regular and symlink
inode, so such dirty datas still in memory will be lost in that moment of
power off.
In order to reduce the chance of lost data, this patch enables
f2fs_balance_fs_bg with the ability of data flushing. It will try to flush
user data before starting a checkpoint. So user's data written after last
checkpoint which may not be fsynced could be saved.
When we mount with data_flush option, after every period of cp_interval
(could be configured in sysfs: /sys/fs/f2fs/device/cp_interval) seconds
user data could be flushed into device once f2fs_balance_fs_bg was called
in kworker thread or gc thread.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add to stat dirty regular and symlink inode for showing in debugfs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a new option 'data_flush' to enable data flush functionality.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Maintain regular/symlink inode which has dirty pages in global dirty list
and record their total dirty pages count like the way of handling directory
inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce __f2fs_commit_super to include duplicated codes in
f2fs_commit_super for cleanup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We have already got one copy of valid super block in memory, do not grab
buffer header of super block all the time.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs use fields of f2fs_super_block struct directly in a grabbed buffer.
Once the buffer happen to be destroyed (e.g. through dd), it may bring
in unpredictable effect on f2fs.
This patch fixes to allocate additional buffer to store datas of super
block rather than using grabbed block buffer directly.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_map_blocks will set m_flags and m_len to 0, so we don't need to
reset m_flags ourselves, but have to reset m_len to correct value
before use it again.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce __remove_dirty_inode to clean up codes in remove_dirty_dir_inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a new dirt list node member in inode info for linking the inode to
global dirty list in superblock, instead of old implementation which
allocate slab cache memory as an entry to inode.
It avoids memory pressure due to slab cache allocation, and also makes
codes more clean.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
remove_dirty_dir_inode will be renamed to remove_dirty_inode as a generic
function in following patch for removing directory/regular/symlink inode
in global dirty list.
Here rename ino management related functions for readability, also in
order to avoid name conflict.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
map.m_len should be reduced after skip a block
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
@lend of filemap_write_and_wait_range is supposed to be a "offset
in bytes where the range ends (inclusive)". Subtract 1 to avoid
writing an extra page.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If read_node_page return LOCKED_PAGE, in its caller it's better a) skip
unneeded 'Update' flag and mapping info verfication; b) check nid value
stored in footer structure of node page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Change the list operation to only return whether or not an attribute
should be listed. Copying the attribute names into the buffer is moved
to the callers.
Since the result only depends on the dentry and not on the attribute
name, we do not pass the attribute name to list operations.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Previously, f2fs_commit_super hacks the bh->blocknr to write the broken
alternate superblock.
Instead of it, we should use the correct logic to retrieve its buffer head
with locking it appropriately.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enhances the existing bit operation when f2fs allocates SSR
blocks.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.
It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.
new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add an additional "name" field to struct xattr_handler. When the name
is set, the handler matches attributes with exactly that name. When the
prefix is set instead, the handler matches attributes with the given
prefix and with a non-empty suffix.
This patch should avoid bugs like the one fixed in commit c361016a in
the future.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In commit 3c45414527 ("f2fs: do not trim preallocated blocks when
truncating after i_size"), in order to follow the regulation: "truncate(x)
where x > i_size will not trim all blocks past i_size." like other file
systems, in ->setattr we invoked truncate_setsize instead of f2fs_truncate
to avoid unneeded block trimming in such case, but forgot to call
f2fs_convert_inline_inode keep consistency of inline data conversion rule.
This patch fixes to convert inline data if necessary.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use sbi->blocks_per_seg directly to avoid unnecessary calculation when using
1 << sbi->log_blocks_per_seg.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For direct IO, f2fs only allocate new address for the block which is not
exist in the disk before, its mapping info should not exist in extent
cache previously, so here we do not need to call f2fs_drop_largest_extent
to drop related cache.
Due to no more callers for f2fs_drop_largest_extent now, kill it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In recover_data, value of argument 'type' will be CURSEG_WARM_NODE all
the time, remove it for cleanup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Clean up codes in lookup_journal_in_cursum() with __has_cursum_space().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If device does not support discard, some obsolete dnodes can be recovered
by roll-forward. This patch enhances the recovery flow.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
While handling extent trees, we can enter into a reclaiming path anytime.
If it tries to release some extent nodes in the same extent tree,
write_lock(&et->lock) would be hanged.
In order to avoid the deadlock, we can just skip it.
Note that, if it is an unreferenced tree, we should get write_lock(&et->lock)
successfully and release all of therein nodes.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
get_lock_data_page in f2fs_readdir can fail due to a lot of reasons (i.e.
no memory or IO error...), it's better to report this kind of error to
user rather than ignoring it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should clear uptodate flag for all pages atomic written when we drop
them, otherwise before these cached pages were reclaimed or invalidated
eventually, we will see invalid data when hitting them again.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. Skip __reverse_ulong if the bitmap is empty.
2. Reduce branches and codes.
According to my test, the performance of this new version is 5% higher on
an empty bitmap of 64bytes, and remains about the same in the worst scenario.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If last dirty dentry page was writebacked in reclaim path, we should
remove its directory inode from global dirty list to avoid unnecessary
flush for this inode when doing checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In 64-bit kernel f2fs can supports 32-bit ioctl system call by identifying
encoded code which is converted from 32-bit one to 64-bit one in
->compat_ioctl.
When we introduced new interfaces in ->ioctl, we forgot to enable them in
->compat_ioctl, so enable them for fixing.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix wrongly added spaces together]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a new ioctl F2FS_IOC_DEFRAGMENT to support file
defragment in a specified range of regular file.
This ioctl can be used in very limited workload: if user expects high
sequential read performance in randomly written file, this interface
can be used for defragmentation, after that file can be written as
continuous as possible in the device.
Meanwhile, it has side-effect, it will make holes in segments where
blocks located originally, so it's better to trigger GC to eliminate
fragment in segments.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should always commit atomic written pages in LFS mode, otherwise data
will become corrupted if we encounter suddent power cut after partial
pages committed in IPU mode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_create_root_stats can fail due to no memory, report it to user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now that the xattr handler is passed to the xattr handler operations, we
have access to the attribute name prefix, so simplify
f2fs_xattr_generic_list.
Also, f2fs_xattr_advise_list is only ever called for
f2fs_xattr_advise_handler; there is no need to double check for that.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <chao2.yu@samsung.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The xattr_handler operations are currently all passed a file system
specific flags value which the operations can use to disambiguate between
different handlers; some file systems use that to distinguish the xattr
namespace, for example. In some oprations, it would be useful to also have
access to the handler prefix. To allow that, pass a pointer to the handler
to operations instead of the flags value alone.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new_valid_dev() always returns 1, so the !new_valid_dev() check is not
needed. Remove it.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull security subsystem update from James Morris:
"This is mostly maintenance updates across the subsystem, with a
notable update for TPM 2.0, and addition of Jarkko Sakkinen as a
maintainer of that"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (40 commits)
apparmor: clarify CRYPTO dependency
selinux: Use a kmem_cache for allocation struct file_security_struct
selinux: ioctl_has_perm should be static
selinux: use sprintf return value
selinux: use kstrdup() in security_get_bools()
selinux: use kmemdup in security_sid_to_context_core()
selinux: remove pointless cast in selinux_inode_setsecurity()
selinux: introduce security_context_str_to_sid
selinux: do not check open perm on ftruncate call
selinux: change CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE default
KEYS: Merge the type-specific data with the payload data
KEYS: Provide a script to extract a module signature
KEYS: Provide a script to extract the sys cert list from a vmlinux file
keys: Be more consistent in selection of union members used
certs: add .gitignore to stop git nagging about x509_certificate_list
KEYS: use kvfree() in add_key
Smack: limited capability for changing process label
TPM: remove unnecessary little endian conversion
vTPM: support little endian guests
char: Drop owner assignment from i2c_driver
...
In f2fs_shrink_extent_tree we should stop shrink flow if we have already
shrunk enough nodes in extent cache.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, in ->symlink of f2fs, we kept the fixed invoking order between
f2fs_add_link and page_symlink since we should init node info firstly
in f2fs_add_link, then such node info can be used in page_symlink.
But we didn't fix to release meta info which was done before page_symlink
in our error path, so this will leave us corrupt symlink entry in its
parent's dentry page. Fix this issue by adding f2fs_unlink in the error
path for removing such linking.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Atomic write page can be GCed, after committing this kind of page, we should
clear the GCed flag for it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we got failure during commit_atomic_write, abort_volatile_write will be
called, but will not drop the inmemory pages due to no FI_ATOMIC_FILE.
Actually, there is no reason to check the flag in abort_volatile_write.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Users expect bmap will give allocated block addresses.
Let's play likewise ext4.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Once f2fs_gc is done, wait_ms is changed once more.
So, its tracepoint would be located after it.
Reported-by: He YunLei <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
different competitors
Since we use different page cache (normally inode's page cache for R/W
and meta inode's page cache for GC) to cache the same physical block
which is belong to an encrypted inode. Writeback of these two page
cache should be exclusive, but now we didn't handle writeback state
well, so there may be potential racing problem:
a)
kworker: f2fs_gc:
- f2fs_write_data_pages
- f2fs_write_data_page
- do_write_data_page
- write_data_page
- f2fs_submit_page_mbio
(page#1 in inode's page cache was queued
in f2fs bio cache, and be ready to write
to new blkaddr)
- gc_data_segment
- move_encrypted_block
- pagecache_get_page
(page#2 in meta inode's page cache
was cached with the invalid datas
of physical block located in new
blkaddr)
- f2fs_submit_page_mbio
(page#1 was submitted, later, page#2
with invalid data will be submitted)
b)
f2fs_gc:
- gc_data_segment
- move_encrypted_block
- f2fs_submit_page_mbio
(page#1 in meta inode's page cache was
queued in f2fs bio cache, and be ready
to write to new blkaddr)
user thread:
- f2fs_write_begin
- f2fs_submit_page_bio
(we submit the request to block layer
to update page#2 in inode's page cache
with physical block located in new
blkaddr, so here we may read gabbage
data from new blkaddr since GC hasn't
writebacked the page#1 yet)
This patch fixes above potential racing problem for encrypted inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After finishing building free nid cache, we will try to readahead
asynchronously 4 more pages for the next reloading, the count of
readahead nid pages is fixed.
In some case, like SMR drive, read less sectors with fixed count
each time we trigger RA may be low efficient, since we will face
high seeking overhead, so we'd better let user to configure this
parameter from sysfs in specific workload.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When there is no free nid in nid cache, all new node allocaters stop their
job to wait for reloading of free nids, however reloading is synchronous as
we will read 4 NAT pages for building nid cache, it cause the long latency.
This patch tries to readahead more NAT pages with READA request flag after
reloading of free nids. It helps to improve performance when users allocate
node id intensively.
Env: Sandisk 32G sd card
time for i in `seq 1 60000`; { echo -n > /mnt/f2fs/$i; echo XXXXXX > /mnt/f2fs/$i;}
Before:
real 0m2.814s
user 0m1.220s
sys 0m1.536s
After:
real 0m2.711s
user 0m1.136s
sys 0m1.568s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, we use ra_meta_pages to reads continuous physical blocks as much as
possible to improve performance of following reads. However, ra_meta_pages
uses a synchronous readahead approach by submitting bio with READ, as READ
is with high priority, it can not be used in the case of preloading blocks,
and it's not sure when these RAed pages will be used.
This patch supports asynchronous readahead in ra_meta_pages by tagging bio
with READA flag in order to allow preloading.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In recovery or checkpoint flow, we grab pages temperarily in meta inode's
mapping for caching temperary data, actually, datas in these pages were
not meta data of f2fs, but still we tag them with REQ_META flag. However,
lower device like eMMC may do some optimization for data of such type.
So in order to avoid wrong optimization, we'd better remove such flag
for temperary non-meta pages.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a tracepoint for f2fs_read_data_pages to trace when pages
are readahead by VFS.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For normal inodes, their pages are allocated with __GFP_FS, which can cause
filesystem calls when reclaiming memory.
This can incur a dead lock condition accordingly.
So, this patch addresses this problem by introducing
f2fs_grab_cache_page(.., bool for_write), which calls
grab_cache_page_write_begin() with AOP_FLAG_NOFS.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_collapse_range and f2fs_insert_range changes the block addresses
directly. But that can cause uncovered SSA updates.
In that case, we need to give up to change the block addresses and do buffered
writes to keep filesystem consistency.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The periodic checkpoint can resolve the previous issue.
So, now we can use this again to improve the reported performance regression:
https://lkml.org/lkml/2015/10/8/20
This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.
This patch introduces F2FS_GOING_DOWN_METAFLUSH which flushes meta pages like
SSA blocks and then blocks all the writes.
This can be used by power-failure tests.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a periodic checkpoint feature.
Note that, this is not enforcing to conduct checkpoints very strictly in terms
of trigger timing, instead just hope to help user experiences.
The default value is 60 seconds.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a tracepoint to monitor background gc behaviors.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduce a new ioctl for those users who want to trigger
checkpoint from userspace through ioctl.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch drops in batches gc triggered through ioctl, since user
can easily control the gc by designing the loop around the ->ioctl.
We support synchronous gc by forcing using FG_GC in f2fs_gc, so with
it, user can make sure that in this round all blocks gced were
persistent in the device until ioctl returned.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When searching victim during gc, if there are no dirty segments in
filesystem, we will still take the time to search the whole dirty segment
map, it's not needed, it's better to skip in this condition.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When doing gc, we search a victim in dirty map, starting from position of
last victim, we will reset the current searching position until we touch
the end of dirty map, and then search the whole diryt map. So sometimes we
will search the range [victim, last] twice, it's redundant, this patch
avoids this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our hit stat of extent cache will increase all the time until remount,
and we use atomic_t type for the stat variable, so it may easily incur
overflow when we query extent cache frequently in a long time running
fs.
So to avoid that, this patch uses atomic64_t for hit stat variables.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we do not call get_victim first, we cannot get a new victim for retrial
path.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to maintain the right section count freed in garbage
collecting when triggering a foreground gc.
Besides, when a foreground gc is running on current selected section, once
we fail to gc one segment, it's better to abandon gcing the left segments
in current section, because anyway we will select next victim for
foreground gc, so gc on the left segments in previous section will become
overhead and also cause the long latency for caller.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to update ctime and atime correctly when truncating
larger in ->setattr.
The bug is reported by xfstest generic/313 as below:
generic/313 2s ... - output mismatch (see ./results/generic/313.out.bad)
--- tests/generic/313.out 2015-08-04 15:28:53.430798882 +0800
+++ results/generic/313.out.bad 2015-09-28 17:04:27.294278016 +0800
@@ -1,2 +1,4 @@
QA output created by 313
Silence is golden
+ctime not updated after truncate up
+mtime not updated after truncate up
...
(Run 'diff -u tests/generic/313.out tests/generic/313.out.bad' to see the entire diff)
Ran: generic/313
Failures: generic/313
Failed 1 of 1 tests
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory
pressure and the number of dirty pages is pretty small.
But, we didn't skip for normal data writes, which gives us not much big impact
on overall performance.
Moreover, by skipping some data writes, kworker falls into infinite loop to try
to write blocks, when many dir inodes have only one dentry block.
So, this patch removes skipping data writes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Protecting recovery flow by using cp_rwsem is not needed, since we have
prevent triggering any checkpoint by locking cp_mutex previously.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In update_sit_info, we use div_u64 to handle 'u64 divide u64' case, but
div_u64 can only handle 32-bits divisor, so our divisor with u64 type
passed to div_u64 will overflow, result in the wrong calculation when
show debug info of f2fs as below:
BDF: 464, avg. vblocks: 23509
(BDF should never exceed 100)
So change to use div64_u64 to handle this case correctly.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a new helper __try_update_largest_extent for cleanup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This fixes error handling for calls to various functions in the
function recover_inline_data to check if these particular functions
either return a error code or the boolean value false to signal their
caller they have failed internally and if this arises return false
to signal failure immediately to the caller of recover_inline_data
as we cannot continue after failures to calling either the function
truncate_inline_inode or truncate_blocks.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Swith extent_cache option dynamically when remount may casue consistency
issue between extent cache and dnode page. Fix in this patch to avoid
that condition.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We introduce F2FS_GET_BLOCK_READ in commit e2b4e2bc88 ("f2fs: fix
incorrect mapping for bmap"), but forget to use this flag in the right
place, fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
truncate_data_blocks_range can do in batches truncation which makes all
changes in dnode page content, dnode page status, extent cache, block
count updating together.
But previously, truncate_hole() always truncates one block in dnode page
at a time by invoking truncate_data_blocks_range(,1), which make thing
slow.
This patch changes truncate_hole() to do in batches truncation for all
target blocks in one direct node inside truncate_data_blocks_range, which
can make our punch hole operation in ->fallocate more efficent.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix 2 potential problems:
1. when largest extent needs to be invalidated, it will be reset in
__drop_largest_extent, which makes __is_extent_same after always
return false, and largest extent unchanged. Now we update it properly.
2. when extent is split and the latter part remains in tree, next_en
should be the latter part instead of next extent of original extent.
It will cause merge failure if there is in-place update, although
there is not, I think this fix will still makes codes less ambiguous.
This patch also simplifies codes of invalidating extents, and optimizes the
procedues that split extent into two.
There are a few modifications after last patch:
1. prev_en now is updated properly.
2. more codes and branches are simplified.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
now we update extent by range, fofs may not be on the largest
extent if the new extent overlaps with it. so add a new function
to drop largest extent properly.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch avoids to produce new checkpoint blocks before the previous meta
pages were written completely.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We got dentry pages from high_mem, and its address space directly goes into the
decryption path via f2fs_fname_disk_to_usr.
But, sg_init_one assumes the address is not from high_mem, so we can get this
panic since it doesn't call kmap_high but kunmap_high is triggered at the end.
kernel BUG at ../../../../../../kernel/mm/highmem.c:290!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
...
(kunmap_high+0xb0/0xb8) from [<c0114534>] (__kunmap_atomic+0xa0/0xa4)
(__kunmap_atomic+0xa0/0xa4) from [<c035f028>] (blkcipher_walk_done+0x128/0x1ec)
(blkcipher_walk_done+0x128/0x1ec) from [<c0366c24>] (crypto_cbc_decrypt+0xc0/0x170)
(crypto_cbc_decrypt+0xc0/0x170) from [<c0367148>] (crypto_cts_decrypt+0xc0/0x114)
(crypto_cts_decrypt+0xc0/0x114) from [<c035ea98>] (async_decrypt+0x40/0x48)
(async_decrypt+0x40/0x48) from [<c032ca34>] (f2fs_fname_disk_to_usr+0x124/0x304)
(f2fs_fname_disk_to_usr+0x124/0x304) from [<c03056fc>] (f2fs_fill_dentries+0xac/0x188)
(f2fs_fill_dentries+0xac/0x188) from [<c03059c8>] (f2fs_readdir+0x1f0/0x300)
(f2fs_readdir+0x1f0/0x300) from [<c0218054>] (vfs_readdir+0x90/0xb4)
(vfs_readdir+0x90/0xb4) from [<c0218418>] (SyS_getdents64+0x64/0xcc)
(SyS_getdents64+0x64/0xcc) from [<c0105ba0>] (ret_fast_syscall+0x0/0x30)
Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch, we try to reorganize f2fs_map_blocks to make block mapping
flow more clear by using following structure:
/* check status of mapping */
if (unmapped) {
/* blkaddr == NULL_ADDR || blkaddr == NEW_ADDR */
if (create) {
/* write path, handle dio write case here */
alloc_and_map;
} else {
/*
* handle read cases from all call paths:
* 1. generic read;
* 2. dio read;
* 3. fiemap;
* 4. bmap
*/
}
}
/* map buffer_header */
Besides, this patch handles the missing case correctly for dio write:
When we fail in __allocate_data_blocks, then in f2fs_map_blocks, we will
not allocate blocks correctly for preallocated blocks, but returning with
an unmapped buffer head, which will result in failure of dio write.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We have potential overflow issue when calculating size of object, when
we left shift index with PAGE_CACHE_SHIFT bits, if type of index has only
32-bits space in 32-bit architecture, left shifting will incur overflow,
i.e:
pgoff_t index = 0xFFFFFFFF;
loff_t size = index << PAGE_CACHE_SHIFT;
size: 0xFFFFF000
So we should cast index with 64-bits type to avoid this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When shrinking extent cache, we have two steps in the flow:
1) shrink objects which are unreferenced by inodes;
2) shrink objects from LRU list of extent cache.
In step 1, if we haven't shrunk enough number of objects, we will try
step 2, but before that we didn't update the searching position which
may point to last inode index in global extent tree, result in failing
to shrink objects by traversing the all inodes' extent tree.
In this patch, we reset searching position to beginning of global extent
tree for fixing.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes to verify file type early in f2fs_fallocate for
cleanup, meanwhile this also fixes to add missing verification for
expand_inode_data.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As comment says, we don't need to call f2fs_lock_op in write_inode to prevent
from producing dirty node pages all the time.
That happens only when there is not enough free sections and we can avoid that
by calling balance_fs in prior to that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This number is referenced by checkpoint under node_write lock.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This fixes the incorrect return statement at the end of the function
f2fs_ioc_release_volatile_write's body for returning zero as this is
incorrect due to the function call before this return statement to
the function punch_hole being able to fail and we should return this
function's return fail directly in order to signal to callers of the
function f2fs_ioc_release_volatile if a failure arises with this call
to punch_hole fails.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rename trace_f2fs_update_extent_tree to trace_f2fs_update_extent_tree_range,
then expand and enable it to trace in batches extent info updates.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull f2fs updates from Jaegeuk Kim:
"The major work includes fixing and enhancing the existing extent_cache
feature, which has been well settling down so far and now it becomes a
default mount option accordingly.
Also, this version newly registers a f2fs memory shrinker to reclaim
several objects consumed by a couple of data structures in order to
avoid memory pressures.
Another new feature is to add ioctl(F2FS_GARBAGE_COLLECT) which
triggers a cleaning job explicitly by users.
Most of the other patches are to fix bugs occurred in the corner cases
across the whole code area"
* tag 'for-f2fs-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (85 commits)
f2fs: upset segment_info repair
f2fs: avoid accessing NULL pointer in f2fs_drop_largest_extent
f2fs: update extent tree in batches
f2fs: fix to release inode correctly
f2fs: handle f2fs_truncate error correctly
f2fs: avoid unneeded initializing when converting inline dentry
f2fs: atomically set inode->i_flags
f2fs: fix wrong pointer access during try_to_free_nids
f2fs: use __GFP_NOFAIL to avoid infinite loop
f2fs: lookup neighbor extent nodes for merging later
f2fs: split __insert_extent_tree_ret for readability
f2fs: kill dead code in __insert_extent_tree
f2fs: adjust showing of extent cache stat
f2fs: add largest/cached stat in extent cache
f2fs: fix incorrect mapping for bmap
f2fs: add annotation for space utilization of regular/inline dentry
f2fs: fix to update cached_en of extent tree properly
f2fs: fix typo
f2fs: check the node block address of newly allocated nid
f2fs: go out for insert_inode_locked failure
...
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
This patch introduce a new helper f2fs_update_extent_tree_range which can
do extent mapping update at a specified range.
The main idea is:
1) punch all mapping info in extent node(s) which are at a specified range;
2) try to merge new extent mapping with adjacent node, or failing that,
insert the mapping into extent tree as a new node.
In order to see the benefit, I add a function for stating time stamping
count as below:
uint64_t rdtsc(void)
{
uint32_t lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
My test environment is: ubuntu, intel i7-3770, 16G memory, 256g micron ssd.
truncation path: update extent cache from truncate_data_blocks_range
non-truncataion path: update extent cache from other paths
total: all update paths
a) Removing 128MB file which has one extent node mapping whole range of
file:
1. dd if=/dev/zero of=/mnt/f2fs/128M bs=1M count=128
2. sync
3. rm /mnt/f2fs/128M
Before:
total count average
truncation: 7651022 32768 233.49
Patched:
total count average
truncation: 3321 33 100.64
b) fsstress:
fsstress -d /mnt/f2fs -l 5 -n 100 -p 20
Test times: 5 times.
Before:
total count average
truncation: 5812480.6 20911.6 277.95
non-truncation: 7783845.6 13440.8 579.12
total: 13596326.2 34352.4 395.79
Patched:
total count average
truncation: 1281283.0 3041.6 421.25
non-truncation: 7355844.4 13662.8 538.38
total: 8637127.4 16704.4 517.06
1) For the updates in truncation path:
- we can see updating in batches leads total tsc and update count reducing
explicitly;
- besides, for a single batched updating, punching multiple extent nodes
in a loop, result in executing more operations, so our average tsc
increase intensively.
2) For the updates in non-truncation path:
- there is a little improvement, that is because for the scenario that we
just need to update in the head or tail of extent node, new interface
optimize to update info in extent node directly, rather than removing
original extent node for updating and then inserting that updated one
into cache as new node.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In following call stack, if unfortunately we lose all chances to truncate
inode page in remove_inode_page, eventually we will add the nid allocated
previously into free nid cache, this nid is with NID_NEW status and with
NEW_ADDR in its blkaddr pointer:
- f2fs_create
- f2fs_add_link
- __f2fs_add_link
- init_inode_metadata
- new_inode_page
- new_node_page
- set_node_addr(, NEW_ADDR)
- f2fs_init_acl failed
- remove_inode_page failed
- handle_failed_inode
- remove_inode_page failed
- iput
- f2fs_evict_inode
- remove_inode_page failed
- alloc_nid_failed cache a nid with valid blkaddr: NEW_ADDR
This may not only cause resource leak of previous inode, but also may cause
incorrect use of the previous blkaddr which is located in NO.nid node entry
when this nid is reused by others.
This patch tries to add this inode to orphan list if we fail to truncate
inode, so that we can obtain a second chance to release it in orphan
recovery flow.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to return error number of f2fs_truncate, so that we
can handle the error correctly in callers.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When converting inline dentry, we will zero out target dentry page before
duplicating data of inline dentry into target page, it become overhead
since inline dentry size is not small.
So this patch tries to remove unneeded initializing in the space of target
dentry page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
According to commit 5f16f3225b ("ext4: atomically set inode->i_flags in
ext4_set_inode_flags()").
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__GFP_NOFAIL can avoid retrying the whole path of kmem_cache_alloc and
bio_alloc.
And, it also fixes the use cases of GFP_ATOMIC correctly.
Suggested-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __lookup_extent_tree_ret we will not try to find neighbor nodes if
we find the target node, in this condition, we will lost the chance to
merge the new mapping with exist extent node later.
So our extent cache of inode will be fragmented after overwrite exist
file, we can see the number of extent node increases intensively in
following test case:
dd if=/dev/zero of=/mnt/f2fs/4m bs=4K count=1024
Extent Cache:
- Hit Count: L1-1:0 L1-2:0 L2:0
- Hit Ratio: 0% (0 / 3072)
- Inner Struct Count: tree: 1, node: 1
dd if=/dev/zero of=/mnt/f2fs/4m bs=4K count=1024 conv=notrunc
Extent Cache:
- Hit Count: L1-1:2048 L1-2:0 L2:0
- Hit Ratio: 33% (2048 / 6144)
- Inner Struct Count: tree: 1, node: 961
This patch fixes to lookup neighbors of target node for further
merging.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch splits __insert_extent_tree_ret into __try_merge_extent_node &
__insert_extent_tree for code readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After commit 0f825ee6e8 ("f2fs: add new interfaces for extent tree"),
f2fs_init_extent_tree becomes the only caller of __insert_extent_tree, and
in f2fs_init_extent_tree, we will only insert extent node in an empty tree,
so __try_{back,front}_merge in __insert_extent_tree will never be called.
This patch removes these dead codes, besides, rename __insert_extent_tree
to __init_extent_tree for readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch alters to replace total hit stat with rbtree hit stat,
and then adjust showing of extent cache stat:
Hit Count:
L1-1: for largest node hit count;
L1-2: for last cached node hit count;
L2: for extent node hit after lookuping in rbtree.
Hit Ratio:
ratio (hit count / total lookup count)
Inner Struct Count:
tree count, node count.
Before:
Extent Hit Ratio: 0 / 2
Extent Tree Count: 3
Extent Node Count: 2
Patched:
Exten Cacache:
- Hit Count: L1-1:4871 L1-2:2074 L2:208
- Hit Ratio: 1% (7153 / 550751)
- Inner Struct Count: tree: 26560, node: 11824
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to stat the hit count of largest/cached node for showing
in debugfs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The test step is like below:
1. touch file
2. truncate -s $((1024*1024)) file
3. fallocate -o 0 -l $((1024*1024)) file
4. fibmap.f2fs file
Our result of fibmap.f2fs showed below is not correct:
file_pos start_blk end_blk blks
0 -937166132 -937166132 1
4096 -937166132 -937166132 1
8192 -937166132 -937166132 1
12288 -937166132 -937166132 1
16384 -937166132 -937166132 1
20480 -937166132 -937166132 1
...
1040384 -937166132 -937166132 1
1044480 -937166132 -937166132 1
This is because f2fs_map_blocks will return with no error when meeting
a hole or preallocated block, the caller __get_data_block will map the
uninitialized variable value to bh->b_blocknr.
Unfortunately generic_block_bmap will neither check the return value of
get_data() nor check mapping info of buffer_head, result in returning
the random block address.
After fixing the issue, our result shows correctly:
file_pos start_blk end_blk blks
0 0 0 256
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_lookup_extent_tree, et->cached_en was read and updated with only
read lock held,
it could cause __lookup_extent_tree within return entirely wrong
extent_node, if other
thread update et->cached_en just before __lookup_extent_tree return.
However, there are two things about this patch that need to be noticed:
1. It does no good to arrange the order of concurrent read/write, the result
would still
be random in such case.
2. It's built on this assumption: the mix up of reads and writes on a single
pointer would
not make the pointer partially wrong at any time. Please let me know if I'm
wrong, thx.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a routine which checks the block address of newly allocated nid.
If an nid has already allocated by other thread due to subtle data races, it
will result in filesystem corruption.
So, it needs to check whether its block address was already allocated or not
in prior to nid allocation as the last chance.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should not call unlock_new_inode when insert_inode_locked failed.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If FG_GC failed to reclaim one section, let's retry with another section
from the start, since we can get anoterh good candidate.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, update_inode_page is not called under f2fs_lock_op.
Instead we should call with f2fs_write_inode.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we can reuse nids as many as possible, we can mitigate producing obsolete
node pages in the page cache.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If node blocks were already moved, we don't need to move them again.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As the below comment of bio_alloc_bioset, f2fs can allocate multiple bios at the
same time. So, we can't guarantee that bio is allocated all the time.
"
* When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
* able to allocate a bio. This is due to the mempool guarantees. To make this
* work, callers must never allocate more than 1 bio at a time from this pool.
* Callers that need to allocate more than 1 bio must always submit the
* previously allocated bio for IO before attempting to allocate a new one.
* Failure to do so can cause deadlocks under memory pressure.
"
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch increases the number of maximum hard links for one file.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should avoid needless checkpoints when there is no dirty and prefree segment.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces __count_free_nids/try_to_free_nids and registers
them in slab shrinker for shrinking under memory pressure.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_delete_entry, if last dirent is remove from the dentry page,
we will try to punch that page since it has no valid date in it.
But truncate_hole which is used for punching could fail because of
no memory or IO error, if that happened, we'd better skip clearing
this valid dentry page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should not write node pages when deleting orphan inodes.
In order to do that, we can eaisly set POR_DOING flag earlier before entering
orphan inode routine.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In recover_orphan_inode, whenever f2fs_iget fail, we will make kernel panic,
but it's not reasonable, because f2fs_iget can fail due to a lot of reasons
including out of memory.
So we change error handling method as below:
a) when finding no entry for the orphan inode, bug_on for catching bugs;
b) for other reasons, report it to caller.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is not enough free segment, we should not assign a new segment
explicitly. Otherwise, we can run out of free segment.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Previously, we use radix tree to index all registered page entries for
atomic file, but now we only use radix tree to see whether current page
is indexed or not, since the other user of radix tree is gone in commit
042b7816aa ("f2fs: remove unnecessary call to invalidate inmemory pages").
So in this patch, we try to use one more efficient way:
Introducing a macro ATOMIC_WRITTEN_PAGE, and setting it as page private
value to indicate page indexing status. By using this way, we can save
memory and lookup time.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We run ltp testcase with f2fs and obtain a TFAIL in diotest4, the result in
detail is as fallow:
dio04
<<<test_start>>>
tag=dio04 stime=1432278894
cmdline="diotest4"
contacts=""
analysis=exit
<<<test_output>>>
diotest4 1 TPASS : Negative Offset
diotest4 2 TPASS : removed
diotest4 3 TFAIL : diotest4.c:129: write allows odd count.returns 1: Success
diotest4 4 TFAIL : diotest4.c:183: Odd count of read and write
diotest4 5 TPASS : Read beyond the file size
......
the result of ext4 with same environment:
dio04
<<<test_start>>>
tag=dio04 stime=1432259643
cmdline="diotest4"
contacts=""
analysis=exit
<<<test_output>>>
diotest4 1 TPASS : Negative Offset
diotest4 2 TPASS : removed
diotest4 3 TPASS : Odd count of read and write
diotest4 4 TPASS : Read beyond the file size
......
The reason is that when triggering DIO in f2fs, we will return zero value
in ->direct_IO if writer's buffer offset, file offset and transfer size is
not alignment to block size of filesystem, resulting in falling back into
buffered write instead of returning -EINVAL.
This patch fixes that problem by returning correct error number for above
case, and removing the judgement condition in check_direct_IO to make sure
the verification will be enabled for direct reader too.
Besides, Jaegeuk Kim pointed out that there is expectional cases we should
always make direct-io falling back into buffered write, such as dio in
encrypted file.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
[Chao Yu make small change and add detail description in commit message]
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fill_zero can fail due to a lot of reason, but previously we do not handle
its return value, so its callers such as punch_hole/f2fs_zero_range may
report success, but actually can fail because of error occurs inside
fill_zero.
This patch fixes to report correct return value of fill_zero.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing with generic/101 in xfstests, error message outputed as below:
--- tests/generic/101.out
+++ results//generic/101.out.bad
@@ -10,10 +10,14 @@
File foo content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
-0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0372000
...
(Run 'diff -u tests/generic/101.out results/generic/101.out.bad' to see the entire diff)
The test flow is like below:
1. pwrite foo -S 0xaa 0 64K
2. pwrite foo -S 0xbb 64K 61K
3. sync
4. truncate foo 64K
5. truncate foo 125K
6. fsync foo
7. flakey drop writes
8. umount
After this test, we expect the data of recovered file will have the first
64k of data filling with value 0xaa and the next 61k of data filling with
value 0x00 because we have fsynced it before dropping writes in dm.
In f2fs, during recovering, we will only recover the valid block address
in direct node page if it is marked as a fsynced dnode, but block address
which means invalid/reserved (with value NULL_ADDR/NEW_ADDR) will not be
recovered. So, the file recovered shows its incorrect data 0xbb in range of
[61k, 125k].
In this patch, we fix to recover invalid/reserved block during recover flow.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In some cases, we only need the block address when we call
f2fs_reserve_block,
other fields of struct dnode_of_data aren't necessary.
We can try extent cache first for such cases in order to speed up the
process.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
To avoid meeting garbage data in next free node block at the end of warm
node chain when doing recovery, we will try to zero out that invalid block.
If the device is not support discard, our way for zeroing out block is:
grabbing a temporary zeroed page in meta inode, then, issue write request
with this page.
But, we forget to release that temporary page, so our memory usage will
increase without gaining any hit ratio benefit, so it's better to free it
for saving memory.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In following call path, we will pass a locked and referenced ipage
pointer to get_new_data_page:
- init_inode_metadata
- make_empty_dir
- get_new_data_page
There are two exit paths in get_new_data_page when error occurs:
1) grab_cache_page fails, ipage will not be released;
2) f2fs_reserve_block fails, ipage will be released in callee.
So, it's not consistent for error handling in get_new_data_page.
For f2fs_reserve_block, it's not very easy to change the rule
of error handling, since it's already complicated.
Here we deside to choose an easy way to fix this issue:
If any error occur in get_new_data_page, we will ensure releasing
ipage in this function.
The same issue is in f2fs_convert_inline_dir, fix that too.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Replace BUG_ON with f2fs_bug_on to deal with
block and segment validity check failed.
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In get_meta_page, we guarantee no failure for the returned page,
but sometimes, IO error from device will incur returning an
non-updated page.
Then, we still use this page as updated one, exception could happen
when using this kind of page.
So in this condition, we'd better freeze fs by making fs readonly and
and stop doing checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
some backing devices need pages to be stable during writeback. It doesn't
matter if
the page is completely overwritten or already uptodate, it needs to wait
before write.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to handle error cases in commit_inmem_pages.
If an error occurs, it stops to write the pages and return the error right
away.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When there is no enough free nids in free nid cache, we will try to
readahead FREE_NID_PAGES:4 nat pages into page cache of meta_inode,
then, reading nat entries in nat page for adding free nids to free nid
cache.
But when traversing all nat pages we readaheaded in a circulation,
our exit condition is not set right, one more nat page will be scanned
without readaheading, resulting worse read performance.
This patch fixes to read the correct number nat pages to avoid bad
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we clear inline data/dentry flag in handle_failed_inode, we will fail
to decline the stat count of inline data/dentry in f2fs_evict_inode due
to no flag in inode. So remove the wrong clearing.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_ioc_start_{atomic,volatile}_write, if we failed in converting
inline data, we will report error to user, but still remain atomic/volatile
flag in inode, it will impact further writes for this file. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the incorrect range (0, LONG_MAX) which is used
in ranged fsync. If we use LONG_MAX as the parameter for indicating
the end of file we want to synchronize, in 32-bits architecture
machine, these datas after 4GB offset may not be persisted in
storage after ->fsync returned.
Here, we alter LONG_MAX to LLONG_MAX to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When flushing comes from background, if there is no dirty page in the
mapping of inode, we'd better to skip seeking dirty page from mapping
for writebacking.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The if statement "goto continue_unlock" is exactly the same when
each if condition is true that is depended on the value of both
"step" and "is_cold_data(page)" are 0 or 1. That means when the
value of "step" equals to "is_cold_data(page)", the if condition
is true and the if statement "goto continue_unlock" appears only
once, so it can be optimized to reduce the duplicated code.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In handle_failed_inode, there is a potential deadlock which can happen
in below call path:
- f2fs_create
- f2fs_lock_op down_read(cp_rwsem)
- f2fs_add_link
- __f2fs_add_link
- init_inode_metadata
- f2fs_init_security failed
- truncate_blocks failed
- handle_failed_inode
- f2fs_truncate
- truncate_blocks(..,true)
- write_checkpoint
- block_operations
- f2fs_lock_all down_write(cp_rwsem)
- f2fs_lock_op down_read(cp_rwsem)
So in this path, we pass parameter to f2fs_truncate to make sure
cp_rwsem in truncate_blocks will not be locked again.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_do_collapse, region cp_rwsem covered is large, since it will be
held until all blocks are left shifted, so if we try to collapse small
area at the beginning of large file, checkpoint who want to grab writer's
lock of cp_rwsem will be delayed for long time.
In order to avoid this condition, altering to lock/unlock cp_rwsem each
shift operation.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a lookup and a insertion interface for extent tree.
The new lookup return the insert position and the prev/next
extents closest to the offset we lookup when find no match.
The new insertion uses above parameters to improve performance.
There are three possible insertions after the lookup in
f2fs_update_extent_tree, two of them insert parts of removed extent
back to tree, since no merge happens during this process, new insertion
skips the merge check in this scanario; the another insertion inserts a
new extent to tree, new insertion uses prev/next extent and insert
position to insert this extent directly, and save the time of searching
down the tree.
As long as tree remains unchanged between lookup and insertion, this
would work fine. And the new lookup would be useful when add
multi-blocks extent support for insertion interface.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes for a caller to handle the page after its bio gets an error.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Variables for recording extent cache ratio info were updated without
protection, this patch tries to alter them to atomic_t type for more
accurate stat.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to stat the number of inline xattr inode for
showing in debugfs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
That encrypted page is used temporarily, so we don't need to mark it accessed.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are gced dirty pages and normal dirty pages in the mapping
of one inode, we might writeback them alternately with discontinuous
block address, resulting in low performance.
This patch introduces f2fs_write_cache_pages with codes copied from
write_cache_pages in mm/page-writeback.c.
In this function, we refactor flow with two steps:
1) writeback all cold type pages.
2) writeback all non-cold type pages.
By using this method, f2fs will writeback dirty pages with the same
temperature in bunch mode, it makes writeouted block being with
more continuous address, so they can be merged as much as possible
in f2fs bio cache, and also it will reduce the chance of submiting
small IO from block layer.
Test environment: 8g nokia sd card (very old sd card, but it shows
better effect when testing with this patch, and with a 32g kingston
sd card, I didn't see much more improvement).
Test step:
1. touch testfile;
2. truncate -s 512K testfile;
3. write all pages with odd index;
4. trigger gc by ioctl;
5. write all pages with even index;
6. time fsync testfile.
before:
real 0m0.402s
user 0m0.000s
sys 0m0.000s
after:
real 0m0.143s
user 0m0.004s
sys 0m0.004s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to return correct error number of ->setxattr, which
is reported by xfstest tests/generic/026 as below:
generic/026 - output mismatch
--- tests/generic/026.out
+++ results/generic/026.out.bad
@@ -4,6 +4,6 @@
1 below acl max
acl max
1 above acl max
-chacl: cannot set access acl on "largeaclfile": Argument list too long
+chacl: cannot set access acl on "largeaclfile": Numerical result out of range
use 16 aces
use 17 aces
...
Ran: generic/026
Failures: generic/026
Failed 1 of 1 tests
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, since 'commit 4531929e39 ("f2fs: move grabing orphan
pages out of protection region")' was committed, in write_orphan_inodes(),
we will grab all meta page in a batch before we use them under spinlock,
so that we can avoid large time delay of grabbing meta pages under
spinlock.
Now, 'commit d6c67a4fee ("f2fs: revmove spin_lock for
write_orphan_inodes")' remove the spinlock in write_orphan_inodes,
so there is no issue we describe above, we'd better recover to move
the grab operation to original place for readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With cost-benifit method, background gc will consider old section with
fewer valid blocks as candidate victim, these old blocks in section will
be treated as cold data, and laterly will be moved into cold segment.
But if the gcing page is attached by user through buffered or mmaped
write, we should reset the page as non-cold one, because this page may
have more opportunity for further updating.
So fix to add clearing code for the missed 'mmap' case.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When background gc is off, the only way to trigger gc is executing
a force gc in some operations who wants to grab space in disk.
The executing condition is limited: to execute force gc, we should
wait for the time when there is almost no more free section for LFS
allocation. This seems not reasonable for our user who wants to
control triggering gc by himself.
This patch introduces F2FS_IOC_GARBAGE_COLLECT interface for
triggering garbage collection by using ioctl. It provides our users
one more option to trigger gc.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves extent cache related code from data.c into extent_cache.c
since extent cache is independent feature, and its codes are not relate to
others in data.c, it's better for us to maintain them in separated place.
There is no functionality change, but several small coding style fixes
including:
* rename __drop_largest_extent to f2fs_drop_largest_extent for exporting;
* rename misspelled word 'untill' to 'until';
* remove unneeded 'return' in the end of f2fs_destroy_extent_tree().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since only parts of extents longer than F2FS_MIN_EXTENT_LEN will
be kept in extent cache after split, extents already shorter than
F2FS_MIN_EXTENT_LEN don't need to try split at all.
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to update page flag (e.g. Uptodate/cold flag) in
->write_begin.
Otherwise, page will be non-uptodate when we try to write entire
page, and cold data flag in page will not be clean when gced page
is being rewritten.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an extent_tree entry has a zero reference count, we can drop it from the
cache in higher priority rather than currently referencing entries.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In ->writepages, we use writepages mutex lock to serialize all block
address allocation and page submitting pairs from different inodes.
This method makes our delayed dirty pages of one inode being written
continously as many as possible.
But there is one problem that we did not submit current cached bio in
protection region of writepages mutex lock, so there is a small chance
that we submit the one of other thread's as below, resulting in
splitting more bios.
thread 1 thread 2
->writepages
lock(writepages)
->write_cache_pages
unlock(writepages)
lock(writepages)
->write_cache_pages
->f2fs_submit_merged_bio
->writepage
unlock(writepages)
fs_mark-6535 [002] .... 2242.270230: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5766152, size = 524288
fs_mark-6536 [000] .... 2242.270361: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767176, size = 4096
fs_mark-6536 [000] .... 2242.270370: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, NODE, sector = 8138112, size = 4096
fs_mark-6535 [002] .... 2242.270776: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767184, size = 516096
This may really increase time of block layer works, and may cause
larger IO lantency.
This patch moves the submitting operation into region of writepages
mutex lock to avoid bio splits when concurrently writebacking is
intensive.
my test environment: virtual machine,
intel cpu i5 2500, 8GB size memory, 4GB size ramdisk
time fs_mark -t 16 -L 1 -s 524288 -S 1 -d /mnt/f2fs/
before:
real 0m4.244s
user 0m0.088s
sys 0m12.336s
after:
real 0m3.822s
user 0m0.072s
sys 0m10.760s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing with fs_mark, some blocks were written out as cold
data which were mixed with warm data, resulting in splitting more
bios.
This is because fs_mark will create file with random filename as
below:
559551ee~~~~~~~~15Z29OCC05JCKQP60JQ42MKV
559551ee~~~~~~~~NZAZ6X8OA8LHIIP6XD0L58RM
559551ef~~~~~~~~B15YDSWAK789HPSDZKYTW6WM
559551f1~~~~~~~~2DAE5DPS79785BUNTFWBEMP3
559551f1~~~~~~~~1MYDY0BKSQCJPI32Q8C514RM
559551f1~~~~~~~~YQOTMAOMN5CVRFOUNI026MP4
559551f3~~~~~~~~1WF42LPRTQJNPPGR3EINKMPE
559551f3~~~~~~~~8Y2NRK7CEPPAA02LY936PJPG
They are regarded as cold file since their filename are ended with
multimedia files' extension, but this should be wrong as we only
match the extension of filename, not the whole one.
In this patch, we try to fix the format of multimedia filename to:
"filename + '.' + extension", then we set cold file only its
filename matches the format.
So after this change, it will reduce the probability we set the
wrong cold file, also it helps a little for fs_mark's performance
on f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This makes the function check_dnode have a return type of bool
due to this particular function only ever returning either one
or zero as its return value and changes the name of the function
to is_alive in order to better explain this function's intended
work of checking if a dnode is still in use by the filesystem.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
[Jaegeuk Kim: change the return value check for the renamed function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because of the extent shrinker or other -ENOMEM scenarios, it cannot guarantee
that the largest extent would be cached in the tree all the time.
Instead of relying on extent_tree, we can simply check the cached one in extent
tree accordingly.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to handle the duplicate extent information.
The integrated rule is:
- update on-disk extent with largest one tracked by in-memory extent_cache
- destroy extent_tree for the truncation case
- drop per-inode extent_cache by shrinker
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a shrinker targeting to reduce memory footprint consumed
by a number of in-memory f2fs data structures.
In addition, it newly adds:
- sbi->umount_mutex to avoid data races on shrinker and put_super
- sbi->shruinker_run_no to not revisit objects
Note that the basic implementation was copied from fs/ubifs/shrinker.c
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch relocates cached_en not only to be covered by spin_lock, but also
to set once after checking out completely.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs_update_extent_cache() updates in-memory extent_cache all the
time, and then finally preserves its up-to-date extent into on-disk one during
f2fs_evict_inode.
But, in the following scenario:
1. mount
2. open & write an extent X
3. f2fs_evict_inode; on-disk extent is X
4. open & update the extent X with Y
5. sync; trigger checkpoint
6. power-cut
after power-on, f2fs should serve extent Y, but we have an on-disk extent X.
This causes a failure on xfstests/311.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes wrong calculation on block address field when an extent is
split.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For newly added fallocate types, it should convert inline_data before handling
block swapping.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Before iput is called, the inode number used by a bad inode can be reassigned
to other new inode, resulting in any abnormal behaviors on the new inode.
This should not happen for the new inode.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The write_checkpoint can update stat information, so we should destroy the stat
structure after it.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Dirty page can be exist in mapping of newly created symlink, but previously
we did not maintain the counting of dirty page for symlink like we maintained
for regular/directory, so the counting we lookuped should be wrong.
This patch adds missed dirty page counting for symlink to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The key_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...
Pull f2fs updates from Jaegeuk Kim:
"New features:
- per-file encryption (e.g., ext4)
- FALLOC_FL_ZERO_RANGE
- FALLOC_FL_COLLAPSE_RANGE
- RENAME_WHITEOUT
Major enhancement/fixes:
- recovery broken superblocks
- enhance f2fs_trim_fs with a discard_map
- fix a race condition on dentry block allocation
- fix a deadlock during summary operation
- fix a missing fiemap result
.. and many minor bug fixes and clean-ups were done"
* tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (83 commits)
f2fs: do not trim preallocated blocks when truncating after i_size
f2fs crypto: add alloc_bounce_page
f2fs crypto: fix to handle errors likewise ext4
f2fs: drop the volatile_write flag only
f2fs: skip committing valid superblock
f2fs: setting discard option in parse_options()
f2fs: fix to return exact trimmed size
f2fs: support FALLOC_FL_INSERT_RANGE
f2fs: hide common code in f2fs_replace_block
f2fs: disable the discard option when device doesn't support
f2fs crypto: remove alloc_page for bounce_page
f2fs: fix a deadlock for summary page lock vs. sentry_lock
f2fs crypto: clean up error handling in f2fs_fname_setup_filename
f2fs crypto: avoid f2fs_inherit_context for symlink
f2fs crypto: do not set encryption policy for non-directory by ioctl
f2fs crypto: allow setting encryption policy once
f2fs crypto: check context consistent for rename2
f2fs: avoid duplicated code by reusing f2fs_read_end_io
f2fs crypto: use per-inode tfm structure
f2fs: recovering broken superblock during mount
...
When we perform generic/092 in xfstests, output is like below:
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
0: [0..10239]: data
0: [0..10239]: data
-1: [10240..20479]: unwritten
+1: [10240..14335]: unwritten
This is because with this testcase, we redefine the regulation for
truncate in perallocated space past i_size as below:
"There was some confused about what the fs was supposed to do when you
truncate at i_size with preallocated space past i_size. We decided on the
following things.
1) truncate(i_size) will trim all blocks past i_size.
2) truncate(x) where x > i_size will not trim all blocks past i_size.
"
This method is used in xfs, and then ext4/btrfs will follow the rule.
This patch fixes to follow the new rule for f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In recovery procedure for superblock, we try to write data of valid
superblock into invalid one for recovery, work should be finished here,
but then still we will write the valid one with its original data.
This operation is not needed. Let's skip doing this unnecessary work.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For the first mount of f2fs image with realtime discard option, we will
disable discard option if device is not supported, but for remount
operation, our discard option can still be set, this should be avoided.
This patch moves configuring of discard option to parse_options() to fix
this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, we add all the candidates for trim commands and then finally issue
discard commands.
So, we should count the trimmed size in back-end.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
FALLOC_FL_INSERT_RANGE flag for ->fallocate was introduced in commit
dd46c78778 ("fs: Add support FALLOC_FL_INSERT_RANGE for fallocate").
The effect of FALLOC_FL_INSERT_RANGE command is the opposite of
FALLOC_FL_COLLAPSE_RANGE, if this command was performed, all data from
offset to EOF in our file will be shifted to right as given length, and
then range [offset, offset + length] becomes a hole.
This command is useful for our user who wants to add some data in the
middle of the file, for example: video/music editor will insert a keyframe
in specified position of media file, with this command we can easily create
a hole for inserting without removing original data.
This patch introduces f2fs_insert_range() to support FALLOC_FL_INSERT_RANGE.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch clean up codes through:
1.rename f2fs_replace_block to __f2fs_replace_block().
2.introduce new f2fs_replace_block() to include __f2fs_replace_block()
and some common related codes around __f2fs_replace_block().
Then, newly introduced function f2fs_replace_block can be used by
following patch.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.
This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.
v2: fs/fat build failure fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.
* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
balanced_dirty_ratelimit, completions and dirty_exceeded.
* writeback_chunk_size() and over_bground_thresh() now take @wb
instead of @bdi.
* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
[__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)
* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
respectively. Note that explicit zeroing is dropped in the process
as wb's are cleared in entirety anyway.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.
v2: Typo in description fixed as suggested by Jan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Current f2fs check the whether the blk device can support discard.
However, the code will cause the discard option cannot be enabled.
Because the clear_opt(sbi, DISCARD) will be invoked forever.
This patch can fix this issue.
Jaegeuk Kim:
The original patch was intended to disable the discard option when device
does not support trim command.
Rather than remaining the buggy patch, let's replace with this patch as
an integrated one.
Signed-off-by: Chenxi Mao <chenxi.mao2013@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to call alloc_page() prior to mempool_alloc(), since the
mempool_alloc() calls alloc_page() internally.
And, if __GFP_WAIT is set, it never fails on page allocation, so let's
give GFP_NOWAIT and handle ENOMEM by writepage().
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to call f2fs_inherit_context twice for newly created symlink.
The original one is called by f2fs_add_link(), which invokes f2fs_setxattr.
If the second one is called again, f2fs_setxattr is triggered again with same
encryption index.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Encryption policy should only be set to an empty directory through ioctl,
This patch add a judgement condition to verify type of the target inode
to avoid incorrectly configuring for non-directory.
Additionally, remove unneeded inline data conversion since regular or symlink
file should not be processed here.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add XATTR_CREATE flag in setxattr when setting encryption
context for inode. Without this flag the context could be set more than
once, this should never happen. So, fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For exchange rename, we should check context consistent of encryption
between new_dir and old_inode or old_dir and new_inode. Otherwise
inheritance of parent's encryption context will be broken.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: sync with ext4 approach]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to clean up code because part code of f2fs_read_end_io
and mpage_end_io are the same, so it's better to merge and reuse them.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch applies the following ext4 patch:
ext4 crypto: use per-inode tfm structure
As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
we read or write a page. Instead we can use a single tfm hanging off
the inode's crypt_info structure for all of our encryption needs for
that inode, since the tfm can be used by multiple crypto requests in
parallel.
Also use cmpxchg() to avoid races that could result in crypt_info
structure getting doubly allocated or doubly freed.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch recovers a broken superblock with the other valid one.
Signed-off-by: hujianyang <hujianyang@huawei.com>
[Jaegeuk Kim: reinitialize local variables in f2fs_fill_super for retrial]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As the description of rename in manual, RENAME_WHITEOUT is a special operation
that only makes sense for overlay/union type filesystem.
When performing rename with RENAME_WHITEOUT, dst will be replace with src, and
meanwhile, a 'whiteout' will be create with name of src.
A "whiteout" is designed to be a char device with 0,0 device number, it has
specially meaning for stackable filesystem. In these filesystems, there are
multiple layers exist, and only top of these can be modified. So a whiteout
in top layer is used to hide a corresponding file in lower layer, as well
removal of whiteout will make the file appear.
Now in overlayfs, when we rename a file which is exist in lower layer, it
will be copied up to upper if it is not on upper layer yet, and then rename
it on upper layer, source file will be whiteouted to hide corresponding file
in lower layer at the same time.
So in upper layer filesystem, implementation of RENAME_WHITEOUT provide a
atomic operation for stackable filesystem to support rename operation.
There are multiple ways to implement RENAME_WHITEOUT in log of this commit:
7dcf5c3e45 ("xfs: add RENAME_WHITEOUT support") which pointed out by
Dave Chinner.
For now, we just try to follow the way that xfs/ext4 use.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a help function update_meta_page() to update meta page with specified
buffer.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now page cache of meta inode is used by garbage collection for encrypted page,
it may contain random data, so we should zero it before issuing discard.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch splits f2fs_crypto_init/exit with two parts: base initialization and
memory allocation.
Firstly, f2fs module declares the base encryption memory pointers.
Then, allocating internal memories is done at the first encrypted inode access.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When encryption feature is enable, if we rmmod f2fs module,
we will encounter a stack backtrace reported in syslog:
"BUG: Bad page state in process rmmod pfn:aaf8a
page:f0f4f148 count:0 mapcount:129 mapping:ee2f4104 index:0x80
flags: 0xee2830a4(referenced|lru|slab|private_2|writeback|swapbacked|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x2030a0(lru|slab|private_2|writeback|mlocked)
Modules linked in: f2fs(O-) fuse bnep rfcomm bluetooth dm_crypt binfmt_misc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm
snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device joydev ppdev mac_hid lp hid_generic i2c_piix4
parport_pc psmouse snd serio_raw parport soundcore ext4 jbd2 mbcache usbhid hid e1000 [last unloaded: f2fs]
CPU: 1 PID: 3049 Comm: rmmod Tainted: G B O 4.1.0-rc3+ #10
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
00000000 00000000 c0021eb4 c15b7518 f0f4f148 c0021ed8 c112e0b7 c1779174
c9b75674 000aaf8a 01b13ce1 c17791a4 f0f4f148 ee2830a4 c0021ef8 c112e3c3
00000000 f0f4f148 c0021f34 f0f4f148 ee2830a4 ef9f0000 c0021f20 c112fdf8
Call Trace:
[<c15b7518>] dump_stack+0x41/0x52
[<c112e0b7>] bad_page.part.72+0xa7/0x100
[<c112e3c3>] free_pages_prepare+0x213/0x220
[<c112fdf8>] free_hot_cold_page+0x28/0x120
[<c1073380>] ? try_to_wake_up+0x2b0/0x2b0
[<c112ff15>] __free_pages+0x25/0x30
[<c112c4fd>] mempool_free_pages+0xd/0x10
[<c112c5f1>] mempool_free+0x31/0x90
[<f0f441cf>] f2fs_exit_crypto+0x6f/0xf0 [f2fs]
[<f0f456c4>] exit_f2fs_fs+0x23/0x95f [f2fs]
[<c10c30e0>] SyS_delete_module+0x130/0x180
[<c11556d6>] ? vm_munmap+0x46/0x60
[<c15bd888>] sysenter_do_call+0x12/0x12"
The reason is that:
since commit 0827e645fd35
("f2fs crypto: shrink size of the f2fs_crypto_ctx structure") is merged,
some fields in f2fs_crypto_ctx structure are merged into a union as they
will never be used simultaneously in write path, read path or on free list.
In f2fs_exit_crypto, we traverse each crypto ctx from free list, in this
moment, our free_list field in union is valid, but still we will try to
release memory space which is pointed by other invalid field in union
structure for each ctx.
Then the error occurs, let's fix it with this patch.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes memory leak issue in error path of f2fs_fname_setup_filename().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch integrates the below patch into f2fs.
"ext4 crypto: shrink size of the ext4_crypto_ctx structure
Some fields are only used when the crypto_ctx is being used on the
read path, some are only used on the write path, and some are only
used when the structure is on free list. Optimize memory use by using
a union."
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch integrates the below patch into f2fs.
"ext4 crypto: get rid of ci_mode from struct ext4_crypt_info
The ci_mode field was superfluous, and getting rid of it gets rid of
an unused hole in the structure."
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch integrates the below patch into f2fs.
"ext4 crypto: use slab caches
Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for
slighly better memory efficiency and debuggability."
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Hu reported, F2FS has a space leak problem, when conducting:
1) format a 4GB f2fs partition
2) dd a 3G file,
3) unlink it.
So, when doing f2fs_drop_inode(), we need to truncate data blocks
before skipping it.
We can also drop unused caches assigned to each inode.
Reported-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The return was not indented far enough so it looked like it was supposed
to go with the other if statement.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
A bug fix to the debug output extended the type of some local
variables to 64-bit, which now causes the kernel to fail building
because of missing 64-bit division functions:
ERROR: "__aeabi_uldivmod" [fs/f2fs/f2fs.ko] undefined!
In the kernel, we have to use div_u64 or do_div to do this,
in order to annotate that this is an expensive operation.
As the function is only called for debug out, we know this
is not performance critical, so it is safe to use div_u64.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: d1f85bd38db19 ("f2fs: avoid value overflow in showing current status")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
introduce compat_ioctl to regular files, but doesn't add this
functionality to f2fs_dir_operations.
While running a 32-bit busybox, I met an error like this:
(A is a directory)
chattr: reading flags on A: Inappropriate ioctl for device
This patch copies compat_ioctl from f2fs_file_operations and
fix this problem.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
I use f2fs filesystem with /data partition on my Android phone
by the default mount options. When I remount /data in order to
adding discard option to run some benchmarks, I find the default
options such as background_gc, user_xattr and acl turned off.
So I introduce a function named default_options in super.c. It do
some default setting, and both mount and remount operations will
call this function to complete default setting.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
No matter what the key is valid or not, readdir shows the dir entries correctly.
So, lookup should not failed.
But, we expect further accesses should be denied from open, rename, link, and so
on.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. mount $mnt
2. cp data $mnt/
3. umount $mnt
4. log out
5. log in
6. cat $mnt/data
-> panic, due to no i_crypt_info.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements encryption support for symlink.
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a bit flag to indicate whether or not i_name in the inode
is encrypted.
If this name is encrypted, we can't do recover_dentry during roll-forward.
So, f2fs_sync_file() needs to do checkpoint, if this will be needed in future.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements filename encryption support for f2fs_lookup.
Note that, f2fs_find_entry should be outside of f2fs_(un)lock_op().
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds encryption support in read and write paths.
Note that, in f2fs, we need to consider cleaning operation.
In cleaning procedure, we must avoid encrypting and decrypting written blocks.
So, this patch implements move_encrypted_block().
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch activates the following APIs for encryption support.
The rules quoted by ext4 are:
- An unencrypted directory may contain encrypted or unencrypted files
or directories.
- All files or directories in a directory must be protected using the
same key as their containing directory.
- Encrypted inode for regular file should not have inline_data.
- Encrypted symlink and directory may have inline_data and inline_dentry.
This patch activates the following APIs.
1. f2fs_link : validate context
2. f2fs_lookup : ''
3. f2fs_rename : ''
4. f2fs_create/f2fs_mkdir : inherit its dir's context
5. f2fs_direct_IO : do buffered io for regular files
6. f2fs_open : check encryption info
7. f2fs_file_mmap : ''
8. f2fs_setattr : ''
9. f2fs_file_write_iter : '' (Called by sys_io_submit)
10. f2fs_fallocate : do not support fcollapse
11. f2fs_evict_inode : free_encryption_info
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds filename encryption infra.
Most of codes are copied from ext4 part, but changed to adjust f2fs
directory structure.
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch copies from encrypt_key.c in ext4, and modifies for f2fs.
Use GFP_NOFS, since _f2fs_get_encryption_info is called under f2fs_lock_op.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Most of parts were copied from ext4, except:
- add f2fs_restore_and_release_control_page which returns control page and
restore control page
- remove ext4_encrypted_zeroout()
- remove sbi->s_file_encryption_mode & sbi->s_dir_encryption_mode
- add f2fs_end_io_crypto_work for mpage_end_io
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds encryption policy and password salt support through ioctl
implementation.
It adds three ioctls:
F2FS_IOC_SET_ENCRYPTION_POLICY,
F2FS_IOC_GET_ENCRYPTION_POLICY,
F2FS_IOC_GET_ENCRYPTION_PWSALT, which use xattr operations.
Note that, these definition and codes are taken from ext4 crypto support.
For f2fs, xattr operations and on-disk flags for superblock and inode were
changed.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs encryption config.
This patch integrates:
"ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is enabled
On arm64 this is apparently needed for CTS mode to function correctly.
Otherwise attempts to use CTS return ENOENT."
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes overflow when do cat /sys/kernel/debug/f2fs/status.
If a section is relatively large, dist value can be overflowed.
Reported-by: Yossi Goldfill <ygoldfill@radianmemory.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, FALLOC_FL_ZERO_RANGE flag in ->fallocate is supported in ext4/xfs.
In commit, the semantics of this flag is descripted as following:"
1) Make sure that both offset and len are block size aligned.
2) Update the i_size of inode by len bytes.
3) Compute the file's logical block number against offset. If the computed
block number is not the starting block of the extent, split the extent
such that the block number is the starting block of the extent.
4) Shift all the extents which are lying between
[offset, last allocated extent] towards right by len bytes. This step
will make a hole of len bytes at offset."
This patch implements fallocate's FALLOC_FL_ZERO_RANGE for f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, FALLOC_FL_COLLAPSE_RANGE flag in ->fallocate is supported in ext4/xfs.
In commit, the semantics of this flag is descripted as following:"
1) It collapses the range lying between offset and length by removing any
data blocks which are present in this range and than updates all the
logical offsets of extents beyond "offset + len" to nullify the hole
created by removing blocks. In short, it does not leave a hole.
2) It should be used exclusively. No other fallocate flag in combination.
3) Offset and length supplied to fallocate should be fs block size aligned
in case of xfs and ext4.
4) Collaspe range does not work beyond i_size."
This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce a generic function replace_block base on recover_data_page,
and export it. So with it we can operate file's meta data which is in
CP/SSA area when we invoke fallocate with FALLOC_FL_COLLAPSE_RANGE
flag.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In set_node_addr, we try to lookup cached nat entry of inode and then
set flag in it.
But previously in this function, we have already grabbed nat entry with
current node id, if the node id is the same as the one of inode, we
do not need to lookup it in cache again.
So this patch adds condition judgment for reducing unneeded lookup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove f2fs_make_empty() declaration, since the main body of this function
is move into do_make_empty_dir() and the function is obsolete now.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch determines to issue discard commands by comparing given minlen and
the length of produced final candidates.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a bitmap for discard issues from f2fs_trim_fs.
There-in rule is to issue discard commands only for invalidated blocks
after mount.
Once mount is done, f2fs_trim_fs trims out whole invalid area.
After ehn, it will not issue and discrads redundantly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes spin_lock, since this is covered by f2fs_lock_op already.
And, we should avoid to use page operations inside spin_lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch splits find_data_page as follows.
1. f2fs_gc
- use get_read_data_page() with read only
2. find_in_level
- use find_data_page without locked page
3. truncate_partial_page
- In the case cache_only mode, just drop cached page.
- Ohterwise, use get_lock_data_page() and guarantee to truncate
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two threads:
f2fs_delete_entry() get_new_data_page()
f2fs_reserve_block()
dn.blkaddr = XXX
lock_page(dentry_block)
truncate_hole()
dn.blkaddr = NULL
unlock_page(dentry_block)
lock_page(dentry_block)
fill the block from XXX address
add new dentries
unlock_page(dentry_block)
Later, f2fs_write_data_page() will truncate the dentry_block, since
its block address is NULL.
The reason for this was due to the wrong lock order.
In this case, we should do f2fs_reserve_block() after locking its dentry block.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_sb_info and page pointers in f2fs_io_info structure.
With this change, we can reduce a lot of parameters for IO functions.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements f2fs_mpage_readpages for further optimization on
encryption support.
The basic code was taken from fs/mpage.c, and changed to be simple by adjusting
that block_size is equal to page_size in f2fs.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks. Stored pointer
is ignored in all cases except the last one.
Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().
b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is. In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
has_fsynced_inode() has no other caller out of node.c, make it static.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the punch_hole(), if offset bigger than inode size, it returns SUCCESS.
Then f2fs_fallocate() will update time and dirty mark.
In that case, inode has not been modified actually.
So I have added offset check routine that prevent to call the punch_hole().
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Export is_valid_blkaddr() and use it to replace some codes for readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our f2fs_acl_create is copied from posix_acl_create in ./fs/posix_acl.c and
modified to avoid deadlock bug when inline_dentry feature is enabled.
Dan Carpenter rewrites posix_acl_create in commit 2799563b281f
("fs/posix_acl.c: make posix_acl_create() safer and cleaner") to make this
function more safer, so that we can avoid potential bug in its caller,
especially for ocfs2.
Let's back port the patch to f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The page_follow_link_light returns NULL and its error pointer was remained
in nd->path.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reports performance regression by Yuanhan Liu.
The basic idea was to reduce one-point mutex, but it turns out this causes
another contention like context swithes.
https://lkml.org/lkml/2015/4/21/11
Until finishing the analysis on this issue, I'd like to revert this for a while.
This reverts commit 78373b7319.
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
Pull f2fs updates from Jaegeuk Kim:
"New features:
- in-memory extent_cache
- fs_shutdown to test power-off-recovery
- use inline_data to store symlink path
- show f2fs as a non-misc filesystem
Major fixes:
- avoid CPU stalls on sync_dirty_dir_inodes
- fix some power-off-recovery procedure
- fix handling of broken symlink correctly
- fix missing dot and dotdot made by sudden power cuts
- handle wrong data index during roll-forward recovery
- preallocate data blocks for direct_io
... and a bunch of minor bug fixes and cleanups"
* tag 'for-f2fs-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (71 commits)
f2fs: pass checkpoint reason on roll-forward recovery
f2fs: avoid abnormal behavior on broken symlink
f2fs: flush symlink path to avoid broken symlink after POR
f2fs: change 0 to false for bool type
f2fs: do not recover wrong data index
f2fs: do not increase link count during recovery
f2fs: assign parent's i_mode for empty dir
f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries
f2fs: fix mismatching lock and unlock pages for roll-forward recovery
f2fs: fix sparse warnings
f2fs: limit b_size of mapped bh in f2fs_map_bh
f2fs: persist system.advise into on-disk inode
f2fs: avoid NULL pointer dereference in f2fs_xattr_advise_get
f2fs: preallocate fallocated blocks for direct IO
f2fs: enable inline data by default
f2fs: preserve extent info for extent cache
f2fs: initialize extent tree with on-disk extent info of inode
f2fs: introduce __{find,grab}_extent_tree
f2fs: split set_data_blkaddr from f2fs_update_extent_cache
f2fs: enable fast symlink by utilizing inline data
...
This patch adds CP_RECOVERY to remain recovery information for checkpoint.
And, it makes sure writing checkpoint in this case.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When f2fs_symlink was triggered and checkpoint was done before syncing its
link path, f2fs can get broken symlink like "xxx -> \0\0\0".
This incurs abnormal path_walk by VFS.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to avoid broken symlink case after POR in best effort.
This results in performance regression.
But, if f2fs has inline_data and the target path is under 3KB-sized long,
the page would be stored in its inode_block, so that there would be no
performance regression.
Note that, if user wants to keep this file atomically, it needs to trigger
dir->fsync.
And, there is still a hole to produce broken symlink.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
in the f2fs_fill_super function, variable "retry" is bool type
i think that it should be set as false.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
During the roll-forward recovery, if we found a new data index written fsync
lastly, we need to recover new block address.
But, if that address was corrupted, we should not recover that.
Otherwise, f2fs gets kernel panic from:
In check_index_in_prev_nodes(),
sentry = get_seg_entry(sbi, segno);
--------------------------> out-of-range segno.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are multiple fsynced dnodes having a dent flag, roll-forward routine
sets FI_INC_LINK for their inode, and recovery_dentry increases its link count
accordingly.
That results in normal file having a link count as 2, so we can't unlink those
files.
This was added to handle several inode blocks having same inode number with
different directory paths.
But, current f2fs doesn't replay all of path changes and only recover its dentry
for the last fsynced inode block.
So, there is no reason to do this.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs was corrupted with missing dot dentries, it needs to recover them after
fsck.f2fs detection.
The underlying precedure is:
1. The fsck.f2fs remains F2FS_INLINE_DOTS flag in directory inode, if it detects
missing dot dentries.
2. When f2fs looks up the corrupted directory, it triggers f2fs_add_link with
proper inode numbers and their dot and dotdot names.
3. Once f2fs recovers the directory without errors, it removes F2FS_INLINE_DOTS
finally.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the below warning.
sparse warnings: (new ones prefixed by >>)
>> fs/f2fs/inode.c:56:23: sparse: restricted __le32 degrades to integer
>> fs/f2fs/inode.c:56:52: sparse: restricted __le32 degrades to integer
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Map bh over max size which caller defined is not needed, limit it in
f2fs_map_bh.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to dirty inode for persisting i_advise of f2fs inode info into
on-disk inode if user sets system.advise through setxattr. Otherwise the new
value will be lost.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Normally, due to DIO_SKIP_HOLES flag is set by default, blockdev_direct_IO in
f2fs_direct_IO tries to skip DIO in holes when writing inside i_size, this
makes us falling back to buffered IO which shows lower performance.
So in commit 59b802e5a4 ("f2fs: allocate data blocks in advance for
f2fs_direct_IO"), we improve perfromance by allocating data blocks in advance
if we meet holes no matter in i_size or not, since with it we can avoid falling
back to buffered IO.
But we forget to consider for unwritten fallocated block in this commit.
This patch tries to fix it for fallocate case, this helps to improve
performance.
Test result:
Storage info: sandisk ultra 64G micro sd card.
touch /mnt/f2fs/file
truncate -s 67108864 /mnt/f2fs/file
fallocate -o 0 -l 67108864 /mnt/f2fs/file
time dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=64 conv=notrunc oflag=direct
Time before applying the patch:
67108864 bytes (67 MB) copied, 36.16 s, 1.9 MB/s
real 0m36.162s
user 0m0.000s
sys 0m0.180s
Time after applying the patch:
67108864 bytes (67 MB) copied, 27.7776 s, 2.4 MB/s
real 0m27.780s
user 0m0.000s
sys 0m0.036s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Enable inline_data feature by default since it brings us better
performance and space utilization and now has already stable.
Add another option noinline_data to disable it during mount.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Suggested-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to preserve last extent info in extent tree cache into on-disk
inode, so this can help us to reuse the last extent info next time for
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With normal extent info cache, we records largest extent mapping between logical
block and physical block into extent info, and we persist extent info in on-disk
inode.
When we enable extent tree cache, if extent info of on-disk inode is exist, and
the extent is not a small fragmented mapping extent. We'd better to load the
extent info into extent tree cache when inode is loaded. By this way we can have
more chance to hit extent tree cache rather than taking more time to read dnode
page for block address.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces __{find,grab}_extent_tree for reusing by following
patches.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split __set_data_blkaddr from f2fs_update_extent_cache for readability.
Additionally rename __set_data_blkaddr to set_data_blkaddr for exporting.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fast symlink can utilize inline data flow to avoid using any
i_addr region, since we need to handle many cases such as
truncation, roll-forward recovery, and fsck/dump tools.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch is to avoid some punch_hole overhead when releasing volatile data.
If volatile data was not written yet, we just can make the first page as zero.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes wrong f2fs_bug_on in truncate_inline_inode.
When there is no space, it can happen a corner case where i_isze is over
MAX_INLINE_SIZE while its inode is still inline_data.
The scenario is
1. write small data into file #A.
2. fill the whole partition to 100%.
3. truncate 4096 on file #A.
4. write data at 8192 offset.
--> f2fs_write_begin
-> -ENOSPC = f2fs_convert_inline_page
-> f2fs_write_failed
-> truncate_blocks
-> truncate_inline_inode
BUG_ON, since i_size is 4096.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs_write_data_pages has a mutex, sbi->writepages, to serialize
data writes to maximize write bandwidth, while sacrificing multi-threads
performance.
Practically, however, multi-threads environment is much more important for
users. So this patch tries to remove the mutex.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch modifies to call set_buffer_new, if new blocks are allocated.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to set SBI_NEED_FSCK flag into sbi only when we fail to recover
in fill_super, so we could skip fscking image when we fail to fill super for
other reason.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the following call stack, f2fs changes the bitmap for dirty segments and # of
dirty sentries without grabbing sit_i->sentry_lock.
This can result in mismatch on bitmap and # of dirty sentries, since if there
are some direct_io operations.
In allocate_data_block,
- __allocate_new_segments
- mutex_lock(&curseg->curseg_mutex);
- s_ops->allocate_segment
- new_curseg/change_curseg
- reset_curseg
- __set_sit_entry_type
- __mark_sit_entry_dirty
- set_bit(dirty_sentries_bitmap)
- dirty_sentries++;
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __allocate_data_blocks, we should check current blkaddr which is located at
ofs_in_node of dnode page instead of checking first blkaddr all the time.
Otherwise we can only allocate one blkaddr in each dnode page. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously if inode is with inline data, we will try to invalid partial inline
data in page #0 when we truncate size of inode in truncate_partial_data_page().
And then we set page #0 to dirty, after this we can synchronize inode page with
page #0 at ->writepage().
But sometimes we will fail to operate page #0 in truncate_partial_data_page()
due to below reason:
a) if offset is zero, we will skip setting page #0 to dirty.
b) if page #0 is not uptodate, we will fail to update it as it has no mapping
data.
So with following operations, we will meet recent data which should be
truncated.
1.write inline data to file
2.sync first data page to inode page
3.truncate file size to 0
4.truncate file size to max_inline_size
5.echo 1 > /proc/sys/vm/drop_caches
6.read file --> meet original inline data which is remained in inode page.
This patch renames truncate_inline_data() to truncate_inline_inode() for code
readability, then use truncate_inline_inode() to truncate inline data in inode
page in truncate_blocks() and truncate page #0 in truncate_partial_data_page()
for fixing.
v2:
o truncate partially #0 page in truncate_partial_data_page to avoid keeping
old data in #0 page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our f2fs_acl_create is copied and modified from posix_acl_create to avoid
deadlock bug when inline_dentry feature is enabled.
Now, we got reference leaks in posix_acl_create, and this has been fixed in
commit fed0b588be ("posix_acl: fix reference leaks in posix_acl_create")
by Omar Sandoval.
https://lkml.org/lkml/2015/2/9/5
Let's fix this issue in f2fs_acl_create too.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Changman Lee <cm224.lee@ssamsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When lookuping for creating, we will try to record the level of current dentry
hash table if current dentry has enough contiguous slots for storing name of new
file which will be created later, this can save our lookup time when add a link
into parent dir.
But currently in find_target_dentry, our current length of contiguous free slots
is not calculated correctly. This make us leaving some holes in dentry block
occasionally, it wastes our space of dentry block.
Let's refactor the lookup flow for max slots as following to fix this issue:
a) increase max_len if current slot is free;
b) update max_slots with max_len if max_len is larger than max_slots;
c) reset max_len to zero if current slot is not free.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
nm_i->nat_tree_lock is used to sync both the operations of nat entry
cache tree and nat set cache tree, however, it isn't held when flush
nat entries during checkpoint which lead to potential race, this patch
fix it by holding the lock when gang lookup nat set cache and delete
item from nat set cache.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Through each macro, we can read the meaning easily.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the unnecessary condition judgment, because
'max_slots' has been initialized to '0' at the beginging
of the function, as following:
if (max_slots)
*max_slots = 0;
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The function 'find_in_inline_dir()' contain 'res_page'
as an argument. So, we should initiaize 'res_page' before
this function.
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __set_free we will check whether all segment are free in one section
when free one segment, in order to set section to free status. But the
searching region of segmap is from start segno to last segno of main
area, it's not necessary. So let's just only check all segment bitmap
of target section.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
extent tree/node slab cache is created during f2fs insmod,
how, it isn't destroyed during f2fs rmmod, this patch fix
it by destroy extent tree/node slab cache once rmmod f2fs.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs has been shipped on many smartphone devices during a couple of years.
So, it is worth to relocate Kconfig into main page from misc filesystems for
developers to choose it more easily.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inode has inline_data, it should report -ENOENT when accessing out-of-bound
region.
This is used by f2fs_fiemap which treats -ENOENT with no error.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When fsync is done through checkpoint, previous f2fs missed to clear append
and update flag. This patch fixes to clear them.
This was originally catched by Changman Lee before.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch is for looking into gc performance of f2fs in detail.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: fix build errors]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now in f2fs, we share functions and structures for batch mode and real-time mode
discard. For real-time mode discard, in shared function add_discard_addrs, we
will use uninitialized trim_minlen in struct cp_control to compare with length
of contiguous free blocks to decide whether skipping discard fragmented freespace
or not, this makes us ignore small discard sometimes. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by : Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In a preempt-off enviroment a alot of FS activity (write/delete) I run
into a CPU stall:
| NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u2:2:59]
| Modules linked in:
| CPU: 0 PID: 59 Comm: kworker/u2:2 Tainted: G W 3.19.0-00010-g10c11c51ffed #153
| Workqueue: writeback bdi_writeback_workfn (flush-179:0)
| task: df230000 ti: df23e000 task.ti: df23e000
| PC is at __submit_merged_bio+0x6c/0x110
| LR is at f2fs_submit_merged_bio+0x74/0x80
…
| [<c00085c4>] (gic_handle_irq) from [<c0012e84>] (__irq_svc+0x44/0x5c)
| Exception stack(0xdf23fb48 to 0xdf23fb90)
| fb40: deef3484 ffff0001 ffff0001 00000027 deef3484 00000000
| fb60: deef3440 00000000 de426000 deef34ec deefc440 df23fbb4 df23fbb8 df23fb90
| fb80: c02191f0 c0218fa0 60000013 ffffffff
| [<c0012e84>] (__irq_svc) from [<c0218fa0>] (__submit_merged_bio+0x6c/0x110)
| [<c0218fa0>] (__submit_merged_bio) from [<c02191f0>] (f2fs_submit_merged_bio+0x74/0x80)
| [<c02191f0>] (f2fs_submit_merged_bio) from [<c021624c>] (sync_dirty_dir_inodes+0x70/0x78)
| [<c021624c>] (sync_dirty_dir_inodes) from [<c0216358>] (write_checkpoint+0x104/0xc10)
| [<c0216358>] (write_checkpoint) from [<c021231c>] (f2fs_sync_fs+0x80/0xbc)
| [<c021231c>] (f2fs_sync_fs) from [<c0221eb8>] (f2fs_balance_fs_bg+0x4c/0x68)
| [<c0221eb8>] (f2fs_balance_fs_bg) from [<c021e9b8>] (f2fs_write_node_pages+0x40/0x110)
| [<c021e9b8>] (f2fs_write_node_pages) from [<c00de620>] (do_writepages+0x34/0x48)
| [<c00de620>] (do_writepages) from [<c0145714>] (__writeback_single_inode+0x50/0x228)
| [<c0145714>] (__writeback_single_inode) from [<c0146184>] (writeback_sb_inodes+0x1a8/0x378)
| [<c0146184>] (writeback_sb_inodes) from [<c01463e4>] (__writeback_inodes_wb+0x90/0xc8)
| [<c01463e4>] (__writeback_inodes_wb) from [<c01465f8>] (wb_writeback+0x1dc/0x28c)
| [<c01465f8>] (wb_writeback) from [<c0146dd8>] (bdi_writeback_workfn+0x2ac/0x460)
| [<c0146dd8>] (bdi_writeback_workfn) from [<c003c3fc>] (process_one_work+0x11c/0x3a4)
| [<c003c3fc>] (process_one_work) from [<c003c844>] (worker_thread+0x17c/0x490)
| [<c003c844>] (worker_thread) from [<c0041398>] (kthread+0xec/0x100)
| [<c0041398>] (kthread) from [<c000ed10>] (ret_from_fork+0x14/0x24)
As it turns out, the code loops in sync_dirty_dir_inodes() and waits for
others to make progress but since it never leaves the CPU there is no
progress made. At the time of this stall, there is also a rm process
blocked:
| rm R running 0 1989 1774 0x00000000
| [<c047c55c>] (__schedule) from [<c00486dc>] (__cond_resched+0x30/0x4c)
| [<c00486dc>] (__cond_resched) from [<c047c8c8>] (_cond_resched+0x4c/0x54)
| [<c047c8c8>] (_cond_resched) from [<c00e1aec>] (truncate_inode_pages_range+0x1f0/0x5e8)
| [<c00e1aec>] (truncate_inode_pages_range) from [<c00e1fd8>] (truncate_inode_pages+0x28/0x30)
| [<c00e1fd8>] (truncate_inode_pages) from [<c00e2148>] (truncate_inode_pages_final+0x60/0x64)
| [<c00e2148>] (truncate_inode_pages_final) from [<c020c92c>] (f2fs_evict_inode+0x4c/0x268)
| [<c020c92c>] (f2fs_evict_inode) from [<c0137214>] (evict+0x94/0x140)
| [<c0137214>] (evict) from [<c01377e8>] (iput+0xc8/0x134)
| [<c01377e8>] (iput) from [<c01333e4>] (d_delete+0x154/0x180)
| [<c01333e4>] (d_delete) from [<c0129870>] (vfs_rmdir+0x114/0x12c)
| [<c0129870>] (vfs_rmdir) from [<c012d644>] (do_rmdir+0x158/0x168)
| [<c012d644>] (do_rmdir) from [<c012dd90>] (SyS_unlinkat+0x30/0x3c)
| [<c012dd90>] (SyS_unlinkat) from [<c000ec40>] (ret_fast_syscall+0x0/0x4c)
As explained by Jaegeuk Kim:
|This inode is the directory (c.f., do_rmdir) causing a infinite loop on
|sync_dirty_dir_inodes.
|The sync_dirty_dir_inodes tries to flush dirty dentry pages, but if the
|inode is under eviction, it submits bios and do it again until eviction
|is finished.
This patch adds a cond_resched() (as suggested by Jaegeuk) after a BIO
is submitted so other thread can make progress.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[Jaegeuk Kim: change fs/f2fs to f2fs in subject as naming convention]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
cp_payload is introduced for sit bitmap to support large volume, and it is
just after the block of f2fs_checkpoint + nat bitmap, so the first segment
should include F2FS_CP_PACKS + NR_CURSEG_TYPE + cp_payload + orphan blocks.
However, current max orphan inodes calculation don't consider cp_payload,
this patch fix it by reducing the number of cp_payload from total blocks of
the first segment when calculate max orphan inodes.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't need to collect dirty sit entries and flush sit journal to sit
entries when there's no dirty sit entries. This patch check dirty_sentries
earlier just like flush_nat_entries.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
block operations is used to flush all dirty node and dentry blocks in
the page cache and suspend ordinary writing activities, however, there
are some facts such like cp error or mount read-only etc which lead to
block operations can't be invoked. Current trace point print block_ops
start premature even if block_ops doesn't have opportunity to execute.
This patch fix it by move block_ops trace point just before block_ops.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a page is cached but its block was deallocated, we don't need to make
the page dirty again by gc and truncate_partial_data_page.
In that case, it needs to check its block allocation all the time instead
of giving up-to-date page.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If page's on-disk block was deallocated, let's remove up-to-date flag to avoid
further access with wrong contents.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
cp_pack_start_sum is calculated in do_checkpoint and is equal to
cpu_to_le32(1 + cp_payload_blks + orphan_blocks). The number of
orphan inode blocks is take advantage of by recover_orphan_inodes
to readahead meta pages and recovery inodes. However, current codes
forget to reduce the number of cp payload blocks when calculate
the number of orphan inode blocks. This patch fix it.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a generic ioctl for fs shutdown, which was used by xfs.
If this shutdown is triggered, filesystem stops any further IOs according to the
following options.
1. FS_GOING_DOWN_FULLSYNC
: this will flush all the data and dentry blocks, and do checkpoint before
shutdown.
2. FS_GOING_DOWN_METASYNC
: this will do checkpoint before shutdown.
3. FS_GOING_DOWN_NOSYNC
: this will trigger shutdown as is.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
During the roll-forward recovery, -ENOENT for f2fs_iget can be skipped.
So, this error value should not be propagated.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We update extent cache for all user inode of f2fs including dir inode, so this
patch gives another chance to try to get physical address of page from extent
cache for dir inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch switch to check FI_NO_EXTENT in f2fs_{lookup,update}_extent_cache
instead of f2fs_{lookup,update}_extent_tree or {lookup,update}_extent_info.
No functionality modification in this patch.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a fast lookup path for rb-tree extent cache.
In this patch we add a recently accessed extent node pointer 'cached_en' in
extent tree. In lookup path of extent cache, we will firstly lookup the last
accessed extent node which cached_en points, if we do not hit in this node,
we will try to lookup extent node in rb-tree.
By this way we can avoid unnecessary slow lookup in rb-tree sometimes.
Note that, side-effect of this patch is that we will increase memory cost,
because we will store a pointer variable in each struct extent tree
additionally.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds trace for lookup/update/shrink/destroy ops in rb-tree extent cache.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add and show stat info of total memory footprint for extent tree,node
in debugfs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enables rb-tree based extent cache in f2fs.
When we mount with "-o extent_cache", f2fs will try to add recently accessed
page-block mappings into rb-tree based extent cache as much as possible, instead
of original one extent info cache.
By this way, f2fs can support more effective cache between dnode page cache and
disk. It will supply high hit ratio in the cache with fewer memory when dnode
page cache are reclaimed in environment of low memory.
Storage: Sandisk sd card 64g
1.append write file (offset: 0, size: 128M);
2.override write file (offset: 2M, size: 1M);
3.override write file (offset: 4M, size: 1M);
...
4.override write file (offset: 48M, size: 1M);
...
5.override write file (offset: 112M, size: 1M);
6.sync
7.echo 3 > /proc/sys/vm/drop_caches
8.read file (size:128M, unit: 4k, count: 32768)
(time dd if=/mnt/f2fs/128m bs=4k count=32768)
Extent Hit Ratio:
before patched
Hit Ratio 121 / 1071 1071 / 1071
Performance:
before patched
real 0m37.051s 0m35.556s
user 0m0.040s 0m0.026s
sys 0m2.990s 0m2.251s
Memory Cost:
before patched
Tree Count: 0 1 (size: 24 bytes)
Node Count: 0 45 (size: 1440 bytes)
v3:
o retest and given more details of test result.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a mount option 'extent_cache' in f2fs.
It is try to use a rb-tree based extent cache to cache more mapping information
with less memory if this option is set, otherwise we will use the original one
extent info cache.
Suggested-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds core functions including slab cache init function and
init/lookup/update/shrink/destroy function for rb-tree based extent cache.
Thank Jaegeuk Kim and Changman Lee as they gave much suggestion about detail
design and implementation of extent cache.
Todo:
* register rb-based extent cache shrink with mm shrink interface.
v2:
o move set_extent_info and __is_{extent,back,front}_mergeable into f2fs.h.
o introduce __{attach,detach}_extent_node for code readability.
o add cond_resched() when fail to invoke kmem_cache_alloc/radix_tree_insert.
o fix some coding style and typo issues.
v3:
o fix oops due to using an unassigned pointer.
o use list_del to remove extent node in shrink list.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: add static for some funcitons and declare in f2fs.h]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce infra macro and data structure for rb-tree based extent cache:
Macros:
* EXT_TREE_VEC_SIZE: indicate vector size for gang lookup in extent tree.
* F2FS_MIN_EXTENT_LEN: indicate minimum length of extent managed in cache.
* EXTENT_CACHE_SHRINK_NUMBER: indicate number of extent in cache will be shrunk.
Basic data structures for extent cache:
* struct extent_tree: extent tree entry per inode.
* struct extent_node: extent info node linked in extent tree.
Besides, adding new extent cache related fields in f2fs_sb_info.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch, we do these jobs:
1. rename {check,update}_extent_cache to {lookup,update}_extent_info;
2. introduce universal lookup/update interface of extent cache:
f2fs_{lookup,update}_extent_cache including above two real functions, then
export them to function callers.
So after above cleanup, we can add new rb-tree based extent cache into exported
interfaces.
v2:
o remove "f2fs_" for inner function {lookup,update}_extent_info suggested by
Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_map_bh to clean codes of check_extent_cache.
v2:
o cleanup f2fs_map_bh pointed out by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rename a filed name from 'blk_addr' to 'blk' in struct {f2fs_extent,extent_info}
as annotation of this field descripts its meaning well to us.
By this way, we can avoid long statement in code of following patches.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Move ext_lock out of struct extent_info, then in the following patches we can
use variables with struct extent_info type as a parameter to pass pure data.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should stat inline data information for temp file in f2fs_tmpfile if we
enable inline_data feature.
Otherwise, inline data stat number will be wrong after this temp file is
evicted.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch uses ->writepage of meta mapping in sync_meta_pages instead of
f2fs_write_meta_page, by this way, in its caller we can ignore any changes
(e.g. changing name) of this registered function.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_update_dentry to remove redundant code in
f2fs_add_inline_entry and __f2fs_add_link.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
inline_dentry_addr is introduced with inline dentry feature without being used,
now we do not need to keep it for any reason, so remove it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull f2fs updates from Jaegeuk Kim:
"Major changes are to:
- add f2fs_io_tracer and F2FS_IOC_GETVERSION
- fix wrong acl assignment from parent
- fix accessing wrong data blocks
- fix wrong condition check for f2fs_sync_fs
- align start block address for direct_io
- add and refactor the readahead flows of FS metadata
- refactor atomic and volatile write policies
But most of patches are for clean-ups and minor bug fixes. Some of
them refactor old code too"
* tag 'for-f2fs-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (64 commits)
f2fs: use spinlock for segmap_lock instead of rwlock
f2fs: fix accessing wrong indexed data blocks
f2fs: avoid variable length array
f2fs: fix sparse warnings
f2fs: allocate data blocks in advance for f2fs_direct_IO
f2fs: introduce macros to convert bytes and blocks in f2fs
f2fs: call set_buffer_new for get_block
f2fs: check node page contents all the time
f2fs: avoid data offset overflow when lseeking huge file
f2fs: fix to use highmem for pages of newly created directory
f2fs: introduce a batched trim
f2fs: merge {invalidate,release}page for meta/node/data pages
f2fs: show the number of writeback pages in stat
f2fs: keep PagePrivate during releasepage
f2fs: should fail mount when trying to recover data on read-only dev
f2fs: split UMOUNT and FASTBOOT flags
f2fs: avoid write_checkpoint if f2fs is mounted readonly
f2fs: support norecovery mount option
f2fs: fix not to drop mount options when retrying fill_super
f2fs: merge flags in struct f2fs_sb_info
...
rwlock can provide better concurrency when there are much more readers than
writers because readers can hold the rwlock simultaneously.
But now, for segmap_lock rwlock in struct free_segmap_info, there is only one
reader 'mount' from below call path:
->f2fs_fill_super
->build_segment_manager
->build_dirty_segmap
->init_dirty_segmap
->find_next_inuse
read_lock
...
read_unlock
Now that our concurrency can not be improved since there is no other reader for
this lock, we do not need to use rwlock_t type for segmap_lock, let's replace it
with spinlock_t type.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the following test.
This causes:
attempt to access beyond end of device
sdb2: rw=16384, want=14413962000, limit=16777216
The reason is:
- f2fs_write_begin
- f2fs_convert_inline_inode returns -ENOSPC
- f2fs_write_failed
- truncate_blocks
- truncate_partial_data_page
- find_data_page
- get_dnode_of_data returns wrong data index retrieved from inline_data
- f2fs_submit_page_bio(wrong data index)
- submit_bio(wrong data index)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch resolves the following warnings.
include/trace/events/f2fs.h:150:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:180:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:990:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:990:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:150:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:180:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:990:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:990:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/f2fs/checkpoint.c:27:19: warning: symbol 'inode_entry_slab' was not declared. Should it be static?
fs/f2fs/checkpoint.c:577:15: warning: cast to restricted __le32
fs/f2fs/checkpoint.c:592:15: warning: cast to restricted __le32
fs/f2fs/trace.c:19:1: warning: symbol 'pids' was not declared. Should it be static?
fs/f2fs/trace.c:21:21: warning: symbol 'last_io' was not declared. Should it be static?
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two macros for transition between byte and block offsets.
Currently, f2fs only supports 4KB blocks, so use the default size for now.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes wrong handling of buffer_new flag in get_block.
If f2fs allocates new blocks and mapped buffer_head, it needs to set buffer_new
for the bh_result.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In get_node_page, if the page is up-to-date, we assumed that the page was not
reclaimed at all.
But, sometimes it was reported that its contents was missing.
So, just for sure, let's check its mapping and contents.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
xfstest generic/285 complains our issue in lseeking huge file.
Here is the detail output of generic/285:
"./check -f2fs tests/generic/285
Ran: generic/285
Failures: generic/285
Failed 1 of 1 tests
10. Test a huge file for offset overflow
10.01 SEEK_HOLE expected 65536 or 8589934592, got 65536. succ
10.02 SEEK_HOLE expected 65536 or 8589934592, got 65536. succ
10.03 SEEK_DATA expected 0 or 0, got 0. succ
10.04 SEEK_DATA expected 1 or 1, got 1. succ
10.05 SEEK_HOLE expected 8589934592 or 8589934592, got 0. FAIL
10.06 SEEK_DATA expected 8589869056 or 8589869056, got 8589869056. succ
10.07 SEEK_DATA expected 8589869057 or 8589869057, got 8589869057. succ
10.08 SEEK_DATA expected 8589869056 or 8589869056, got 4294901760. FAIL"
The reason of this issue is:
We will calculate current offset through left shifting page-offset with
PAGE_CACHE_SHIFT bits, but our page-offset is a type of unsigned long, its size
is 4 bytes in 32-bits machine.
So if our page-offset is bigger than (1 << 32 / pagesize - 1), result of left
shifting will overflow.
Let's fix this issue by casting type of page-offset to type of current offset:
loff_t.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In commit a78186ebe5 ("f2fs: use highmem for directory pages"), we have set
__GFP_HIGHMEM into dir mapping's gfp flag in f2fs_iget, so high address memory
could be used for these existing dir's page.
But we forgot to set flag for newly created dir, due to this reason, our newly
created dir pages could not be allocated from high address memory. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a batched trimming feature, which submits split discard
commands.
This is to avoid long latency due to huge trim commands.
If fstrim was triggered ranging from 0 to the end of device, we should lock
all the checkpoint-related mutexes, resulting in very long latency.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch merges ->{invalidate,release}page function for meta/node/data pages.
After this, duplication of codes could be removed.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If device is read-only, we should not proceed data recovery.
But, if the previous checkpoint was done by normal clean shutdown, it's safe to
proceed the recovery, since there will be no data to be recovered.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds FASTBOOT flag into checkpoint as follows.
- CP_UMOUNT_FLAG is set when system is umounted.
- CP_FASTBOOT_FLAG is set when intermediate checkpoint having node summaries
was done.
So, if you get CP_UMOUNT_FLAG from checkpoint, the system was umounted cleanly.
Instead, if there was sudden-power-off, you can get CP_FASTBOOT_FLAG or nothing.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Do not change any partition when f2fs is changed to readonly mode.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a mount option, norecovery, which is mostly same as
disable_roll_forward. The only difference is that norecovery should be activated
with read-only mount option.
This can be used when user wants to check whether f2fs is mountable or not
without any recovery process. (e.g., xfstests/200)
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If wrong mount option was requested, f2fs tries to fill_super again.
But, during the next trial, f2fs has no valid mount options, since
parse_options deleted all the separators in the original string.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, there are several variables with Boolean type as below:
struct f2fs_sb_info {
...
int s_dirty;
bool need_fsck;
bool s_closing;
...
bool por_doing;
...
}
For this there are some issues:
1. there are some space of f2fs_sb_info is wasted due to aligning after Boolean
type variables by compiler.
2. if we continuously add new flag into f2fs_sb_info, structure will be messed
up.
So in this patch, we try to:
1. switch s_dirty to Boolean type variable since it has two status 0/1.
2. merge s_dirty/need_fsck/s_closing/por_doing variables into s_flag.
3. introduce an enum type which can indicate different states of sbi.
4. use new introduced universal interfaces is_sbi_flag_set/{set,clear}_sbi_flag
to operate flags for sbi.
After that, above issues will be fixed.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use pointer parameter @wait to pass result in {in,de}create_sleep_time for
cleanup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. make truncate_inline_date static;
2. remove parameter @from of truncate_inline_date as callers only pass zero.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
No modification in functionality, just clean codes with f2fs_radix_tree_insert.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch we add the FS_IOC_GETVERSION ioctl for getting i_generation from
inode, after that, users can list file's generation number by using "lsattr -v".
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During the recovery, any xattr blocks should not be found, since they are
written into cold log, not the warm node chain.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We will encounter deadloop in below scenario:
1. increase page count for F2FS_DIRTY_META type in following path:
->recover_fsync_data
->recover_data
->do_recover_data
->recover_data_page
->change_curseg
->write_sum_page
->set_page_dirty
2. fail in recover_data()
3. invalidate meta pages in truncate_inode_pages_final without decreasing page
count.
4. deadloop when sync_meta_pages as page count will always be non-zero.
message:
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
[<c1129a37>] pagevec_lookup_tag+0x27/0x30
[<f0e774c7>] sync_meta_pages+0x87/0x160 [f2fs]
[<f0e86dd9>] recover_fsync_data+0xeb9/0xf10 [f2fs]
[<f0e75398>] f2fs_fill_super+0x888/0x980 [f2fs]
[<c11733ca>] mount_bdev+0x16a/0x1a0
[<f0e7180f>] f2fs_mount+0x1f/0x30 [f2fs]
[<c1173da6>] mount_fs+0x36/0x170
[<c118b6f5>] vfs_kern_mount+0x55/0xe0
[<c118d63f>] do_mount+0x1df/0x9f0
[<c118e110>] SyS_mount+0x70/0xb0
[<c15a0c48>] sysenter_do_call+0x12/0x12
To avoid page count leak, let's add ->invalidatepage and ->releasepage in
f2fs_meta_aops as f2fs_node_aops to release meta page count correctly.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the previous checkpoint was done without CP_UMOUNT flag, it needs to do
checkpoint with CP_UMOUNT for the next fast boot.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to trigger checkpoint with umount flag when kill_sb was called.
In kill_sb, f2fs_sync_fs was finally called, but at this time, f2fs can't do
checkpoint with CP_UMOUNT.
After then, f2fs_put_super is not doing checkpoint, since it is not dirty.
So, this patch adds a flag to indicate f2fs_sync_fs is called during umount.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our value of memory footprint statistics showed in debugfs is not calculated
correctly. Fix it in this patch.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If cp_error is set, we should avoid all the infinite loop.
In f2fs_sync_file, there is a hole, and this patch fixes that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fs/f2fs/trace.c:19:12: sparse: symbol 'pids_lock' was not declared. Should it be static?
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the normal case, the radix_tree_nodes are freed successfully.
But, when cp_error was detected, we should destroy them forcefully.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We use kzalloc to allocate memory in __recover_inline_status, and use this
all-zero memory to check the inline date content of inode page by comparing
them. This is low effective and not needed, let's check inline date content
directly.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: make the code more neat]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch aligns the start block address of a file for direct io to the f2fs's
section size.
Some flash devices manage an over 4KB-sized page as a write unit, and if the
direct_io'ed data are written but not aligned to that unit, the performance can
be degraded due to the partial page copies.
Thus, since f2fs has a section that is well aligned to FTL units, we can align
the block address to the section size so that f2fs avoids this misalignment.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two slab cache inode_entry_slab and winode_slab using the same
structure as below:
struct dir_inode_entry {
struct list_head list; /* list head */
struct inode *inode; /* vfs inode pointer */
};
struct inode_entry {
struct list_head list;
struct inode *inode;
};
It's a little waste that the two cache can not share their memory space for each
other.
So in this patch we remove one redundant winode_slab slab cache, then use more
universal name struct inode_entry as remaining data structure name of slab,
finally we reuse the inode_entry_slab to store dirty dir item and gc item for
more effective.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cleanup parameters for trace_f2fs_submit_{read_,write_,page_,page_m}bio with fio
as one parameter.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds missing parameter _type_ for trace_f2fs_submit_page_bio, then
use DECLARE_EVENT_CLASS/DEFINE_EVENT_CONDITION pair to cleanup some trace event
code related to f2fs_submit_page_{m,}bio.
Additionally, after we remove redundant code, size of code can be reduced:
text data bss dec hex filename
176787 8712 56 185555 2d4d3 f2fs.ko.org
174408 8648 56 183112 2cb48 f2fs.ko
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In do_recover_data, we find and update previous node pages after updating
its new block addresses.
After then, we call fill_node_footer without reset field, we erase its
cold bit so that this new cold node block is written to wrong log area.
This patch fixes not to miss its old flag.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds block count by in-place-update in stat.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The __f2fs_add_link is covered by cp_rwsem all the time.
This calls init_inode_metadata, which conducts some acl operations including
memory allocation with GFP_KERNEL previously.
But, under memory pressure, f2fs_write_data_page can be called, which also
grabs cp_rwsem too.
In this case, this incurs a deadlock pointed by Chao.
Thread #1 Thread #2
down_read
down_write
down_read
-> here down_read should wait forever.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two key functions to trace process ids and IOs.
The basic idea is to
1. remain process ids, pids, in page->private.
2. show pids in IO traces.
So, later we can retrieve process information according to IO traces.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds:
o initial trace.c and trace.h with skeleton functions
o Kconfig and Makefile to activate this feature
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch cleans up parameters on IO paths.
The key idea is to use f2fs_io_info adding a parameter, block address, and then
use this structure as parameters.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use more common function ra_meta_pages() with META_POR to readahead node blocks
in restore_node_summary() instead of ra_sum_pages(), hence we can simplify the
readahead code there, and also we can remove unused function ra_sum_pages().
changes from v2:
o use invalidate_mapping_pages as before suggested by Changman Lee.
changes from v1:
o fix one bug when using truncate_inode_pages_range which is pointed out by
Jaegeuk Kim.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves one member of struct nat_entry: _flag_ to struct node_info,
so _version_ in struct node_info and _flag_ which are unsigned char type will
merge to one 32-bit space in register/memory. So the size of nat_entry will be
reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40
bytes to 32 bytes) and then slab memory using by f2fs will be reduced.
changes from v2:
o update description of memory usage gain for 64-bit machine suggested by
Changman Lee.
changes from v1:
o introduce inline copy_node_info() to copy valid data from node info suggested
by Jaegeuk Kim, it can avoid bug.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's add readahead code for reading contiguous compact/normal summary blocks
in checkpoint, then we will gain better performance in mount procedure.
Changes from v1
o remove inappropriate 'unlikely' in npages_for_summary_flush.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we use inmemory pages for atomic write only and provide abort procedure,
we don't need to truncate them explicitly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The ckpt_valid_map and cur_valid_map are synced by seg_info_to_raw_sit.
In the case of small discards, the candidates are selected before sync,
while fitrim selects candidates after sync.
So, for small discards, we need to add candidates only just being obsoleted.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two new ioctls to release inmemory pages grabbed by atomic
writes.
o f2fs_ioc_abort_volatile_write
- If transaction was failed, all the grabbed pages and data should be written.
o f2fs_ioc_release_volatile_write
- This is to enhance the performance of PERSIST mode in sqlite.
In order to avoid huge memory consumption which causes OOM, this patch changes
volatile writes to use normal dirty pages, instead blocked flushing to the disk
as long as system does not suffer from memory pressure.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to force to write dirty_exceeded for f2fs_balance_fs_bg.
This flag was only meaningful to write bypassing conditions.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
To improve recovery speed, f2fs try to readahead many contiguous blocks in warm
node segment, but for most time, abnormal power-off do not occur frequently, so
when mount a normal power-off f2fs image, by contrary ra so many blocks and then
invalid them will hurt the performance of mount.
It's better to just ra the first next-block for normal condition.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch does cleanup work, it introduces is_valid_blkaddr() to include
verification code for blkaddr with upper and down boundary value which were in
ra_meta_pages previous.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1.We use zero as upper boundary value for ra SSA/CP blocks, we will skip
readahead as verification failure with max number, it causes low performance.
2.Low boundary value is not accurate for SSA/CP/POR region verification, so
these values need to be redefined.
This patch fixes above issues.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As inline_{dir,inode} stat is increased/decreased concurrently by multi threads,
so the value is not so accurate, let's use atomic type for counting accurately.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added some commentaries for code readability and cleaned up if-statement
clearly.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inode state is dirty, go straight to write.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The inmemory pages should be handled by invalidate_page since it needs to be
released int the truncation path.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In do_read_inode, if we failed __recover_inline_status, the inode has inline
flag without increasing its count.
Later, f2fs_evict_inode will decrease the count, which causes -1.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch revists retrial paths in f2fs.
The basic idea is to use cond_resched instead of retrying from the very early
stage.
Suggested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to fix:
BUG: using smp_processor_id() in preemptible [00000000] code: f2fs_gc-254:0/384
(radix_tree_node_alloc+0x14/0x74) from [<c033d8a0>] (radix_tree_insert+0x110/0x200)
(radix_tree_insert+0x110/0x200) from [<c02e8264>] (gc_data_segment+0x340/0x52c)
(gc_data_segment+0x340/0x52c) from [<c02e8658>] (f2fs_gc+0x208/0x400)
(f2fs_gc+0x208/0x400) from [<c02e8a98>] (gc_thread_func+0x248/0x28c)
(gc_thread_func+0x248/0x28c) from [<c0139944>] (kthread+0xa0/0xac)
(kthread+0xa0/0xac) from [<c0105ef8>] (ret_from_fork+0x14/0x3c)
The reason is that f2fs calls radix_tree_insert under enabled preemption.
So, before calling it, we need to call radix_tree_preload.
Otherwise, we should use _GFP_WAIT for the radix tree, and use mutex or
semaphore to cover the radix tree operations.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previoulsy, we used rwlock for nat_entry lock.
But, now we have a lot of complex operations in set_node_addr.
(e.g., allocating kernel memories, handling radix_trees, and so on)
So, this patches tries to change spinlock to rw_semaphore to give CPUs to other
threads.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are many inodes that have data blocks in victim segment,
it takes long time to find a inode in gc_inode list.
Let's use radix_tree to reduce lookup time.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We've already made fi and sbi for inode. Let's avoid duplicated work.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix the wrong error number in error path of f2fs_write_begin.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Little cleanup to distinguish each phase easily
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: modify indentation for code readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an inode has converted inline_data which was written to the disk, we should
set its inode flag for further fsync so that this inline_data can be recovered
from sudden power off.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After flushing dirty nat entries, it has to be no more dirty nat
entries.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's meaningless to check dirty_nat_cnt after re-dirtying nat entries in
journal. And although there are rooms for dirty nat entires if dirty_nat_cnt
is zero, it's also meaningless to check __has_cursum_space.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
A deadlock can be occurred:
Thread 1] Thread 2]
- f2fs_write_data_pages - f2fs_write_begin
- lock_page(page #0)
- grab_cache_page(page #X)
- get_node_page(inode_page)
- grab_cache_page(page #0)
: to convert inline_data
- f2fs_write_data_page
- f2fs_write_inline_data
- get_node_page(inode_page)
In this case, trying to lock inode_page and page #0 causes deadlock.
In order to avoid this, this patch adds a rule for this locking policy,
which is that page #0 should be locked followed by inode_page lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Two jump labels were adjusted in the implementation of the
create_node_manager_caches() function because these identifiers
contained typos.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_evict_inode,
commit_inmemory_pages
f2fs_gc
f2fs_iget
iget_locked
-> wait for inode free
Here, if the inode is same as the one to be evicted, f2fs should wait forever.
Actually, we should not call f2fs_balance_fs during f2fs_evict_inode to avoid
this.
But, the commit_inmem_pages calls f2fs_balance_fs by default, even if
f2fs_evict_inode wants to free inmemory pages only.
Hence, this patch adds to trigger f2fs_balance_fs only when there is something
to write.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It used nat_entry_set when create slab for sit_entry_set.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a node page is request to be written during the reclaiming path, we should
submit the bio to avoid pending to recliam it.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now in f2fs, we have three inode cache: ORPHAN_INO, APPEND_INO, UPDATE_INO,
and we manage fields related to inode cache separately in struct f2fs_sb_info
for each inode cache type.
This makes codes a bit messy, so that this patch intorduce a new struct
inode_management to wrap inner fields as following which make codes more neat.
/* for inner inode cache management */
struct inode_management {
struct radix_tree_root ino_root; /* ino entry array */
spinlock_t ino_lock; /* for ino entry lock */
struct list_head ino_list; /* inode list head */
unsigned long ino_num; /* number of entries */
};
struct f2fs_sb_info {
...
struct inode_management im[MAX_INO_ENTRY]; /* manage inode cache */
...
}
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because we have checked the contrary condition in case of "if" judgment, we do
not need to check the condition again in case of "else" judgment. Let's remove
it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_remount, we will stop gc thread and set need_restart_gc as true when new
option is set without BG_GC, then if any error occurred in the following
procedure, we can restore to start the gc thread.
But after that, We will fail to restore gc thread in start_gc_thread as BG_GC is
not set in new option, so we'd better move this condition judgment out of
start_gc_thread to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The locked page should be released before returning the function.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If i_size becomes large outside of MAX_INLINE_DATA, we shoud convert the inode.
Otherwise, we can make some dirty pages during the truncation, and those pages
will be written through f2fs_write_data_page.
At that moment, the inode has still inline_data, so that it tries to write non-
zero pages into inline_data area.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The scenario is like this.
One trhead triggers:
f2fs_write_data_pages
lock_page
f2fs_write_data_page
f2fs_lock_op <- wait
The other thread triggers:
f2fs_truncate
truncate_blocks
f2fs_lock_op
truncate_partial_data_page
lock_page <- wait for locking the page
This patch resolves this bug by relocating truncate_partial_data_page.
This function is just to truncate user data page and not related to FS
consistency as well.
And, we don't need to call truncate_inline_data. Rather than that,
f2fs_write_data_page will finally update inline_data later.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The # of inline_data inode is decreased only when it has inline_data.
After clearing the flag, we can't decreased the number.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to write node pages if checkpoint is not doing in order to avoid
memory pressure.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to control the memory footprint used by ino entries.
This will conduct best effort, not strictly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a system wants to reduce the booting time as a top priority, now we can
use a mount option, -o fastboot.
With this option, f2fs conducts a little bit slow write_checkpoint, but
it can avoid the node page reads during the next mount time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__submit_merged_bio f2fs_write_end_io f2fs_write_end_io
wait_io = X wait_io = x
complete(X) complete(X)
wait_io = NULL
wait_for_completion()
free(X)
spin_lock(X)
kernel panic
In order to avoid this, this patch removes the wait_io facility.
Instead, we can use wait_on_all_pages_writeback(sbi) to wait for end_ios.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is a chance to make a huge sized discard command, we don't need
to split it out, since each blkdev_issue_discard should wait one at a
time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch simplifies the inline_data usage with the following rule.
1. inline_data is set during the file creation.
2. If new data is requested to be written ranges out of inline_data,
f2fs converts that inode permanently.
3. There is no cases which converts non-inline_data inode to inline_data.
4. The inline_data flag should be changed under inode page lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There's no point in using test_and_clear_bit_le() when we don't use the
return value of the function. Just use clear_bit_le() instead.
Coverity-id: 1016434
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_write_begin() doesn't initialize the 'dn' variable if the inode has
inline data. However it uses its contents to decide whether it should
just zero out the page or load data to it. Thus if we are unlucky we can
zero out page contents instead of loading inline data into a page.
CC: stable@vger.kernel.org
CC: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rename f2fs_set/clear_bit to f2fs_test_and_set/clear_bit, which mean
set/clear bit and return the old value, for better readability.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Set raw_super default to NULL to avoid the possibly used
uninitialized warning, though we may never hit it in fact.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce f2fs_change_bit to simplify the change bit logic in
function set_to_next_nat{sit}.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use clear_inode_flag to replace the redundant cond_clear_inode_flag.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the unneeded argument 'type' from __get_victim, use
NO_CHECK_TYPE directly when calling v_ops->get_victim().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user specifies too low end sector for trimming, f2fs_trim_fs() will
use uninitialized value as a number of trimmed blocks and returns it to
userspace. Initialize number of trimmed blocks early to avoid the
problem.
Coverity-id: 1248809
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch declares f2fs_convert_inline_dir as a static function, which was
reported by kbuild test robot.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_dentry_ptr structure for the use of a function
parameter in inline_dentry operations.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a core function, f2fs_fill_dentries, to remove
redundant code in f2fs_readdir and f2fs_read_inline_dir.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, init_inode_metadata does not hold any parent directory's inode
page. So, f2fs_init_acl can grab its parent inode page without any problem.
But, when we use inline_dentry, that page is grabbed during f2fs_add_link,
so that we can fall into deadlock condition like below.
INFO: task mknod:11006 blocked for more than 120 seconds.
Tainted: G OE 3.17.0-rc1+ #13
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mknod D ffff88003fc94580 0 11006 11004 0x00000000
ffff880007717b10 0000000000000002 ffff88003c323220 ffff880007717fd8
0000000000014580 0000000000014580 ffff88003daecb30 ffff88003c323220
ffff88003fc94e80 ffff88003ffbb4e8 ffff880007717ba0 0000000000000002
Call Trace:
[<ffffffff8173dc40>] ? bit_wait+0x50/0x50
[<ffffffff8173d4cd>] io_schedule+0x9d/0x130
[<ffffffff8173dc6c>] bit_wait_io+0x2c/0x50
[<ffffffff8173da3b>] __wait_on_bit_lock+0x4b/0xb0
[<ffffffff811640a7>] __lock_page+0x67/0x70
[<ffffffff810acf50>] ? autoremove_wake_function+0x40/0x40
[<ffffffff811652cc>] pagecache_get_page+0x14c/0x1e0
[<ffffffffa029afa9>] get_node_page+0x59/0x130 [f2fs]
[<ffffffffa02a63ad>] read_all_xattrs+0x24d/0x430 [f2fs]
[<ffffffffa02a6ca2>] f2fs_getxattr+0x52/0xe0 [f2fs]
[<ffffffffa02a7481>] f2fs_get_acl+0x41/0x2d0 [f2fs]
[<ffffffff8122d847>] get_acl+0x47/0x70
[<ffffffff8122db5a>] posix_acl_create+0x5a/0x150
[<ffffffffa02a7759>] f2fs_init_acl+0x29/0xcb [f2fs]
[<ffffffffa0286a8d>] init_inode_metadata+0x5d/0x340 [f2fs]
[<ffffffffa029253a>] f2fs_add_inline_entry+0x12a/0x2e0 [f2fs]
[<ffffffffa0286ea5>] __f2fs_add_link+0x45/0x4a0 [f2fs]
[<ffffffffa028b5b6>] ? f2fs_new_inode+0x146/0x220 [f2fs]
[<ffffffffa028b816>] f2fs_mknod+0x86/0xf0 [f2fs]
[<ffffffff811e3ec1>] vfs_mknod+0xe1/0x160
[<ffffffff811e4b26>] SyS_mknod+0x1f6/0x200
[<ffffffff81741d7f>] tracesys+0xe1/0xe6
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add inline dir functions into normal dir ops' function to handle inline ops.
Besides, we enable inline dir mode when a new dir inode is created if
inline_data option is on.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Adds Functions to implement inline dir init/lookup/insert/delete/convert ops.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: remove needless reserved area copy, pointed by Dan Carpenter]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch exports some dir operations for inline dir, additionally introduces
f2fs_drop_nlink from f2fs_delete_entry for reusing by inline dir function.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch defines macro/inline dentry structure, and adds some helpers for
inline dir infrastructure.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The sceanrio is like this.
inline_data i_size page write_begin/vm_page_mkwrite
X 30 dirty_page
X 30 write to #4096 position
X 30 get_dnode_of_data wait for get_dnode_of_data
O 30 write inline_data
O 30 get_dnode_of_data
O 30 reserve data block
..
In this case, we have #0 = NEW_ADDR and inline_data as well.
We should not allow this condition for further access.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's consider the following scenario.
blkaddr[0] inline_data i_size i_blocks writepage truncate
NEW X 4096 2 dirty page #0
NEW X 0 change i_size
NEW X 0 2 f2fs_write_inline_data
NEW X 0 2 get_dnode_of_data
NEW X 0 2 truncate_data_blocks_range
NULL O 0 1 memcpy(inline_data)
NULL O 0 1 f2fs_put_dnode
NULL O 0 1 f2fs_truncate
NULL O 0 1 get_dnode_of_data
NULL O 0 1 *invalid block addr*
This patch adds checking inline_data flag during f2fs_truncate not to refer
corrupted block indices.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When trying to write inline_data, we should truncate any data block allocated
and pointed by the inode block.
We should consider the data index is not 0.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds support for volatile writes which keep data pages in memory
until f2fs_evict_inode is called by iput.
For instance, we can use this feature for the sqlite database as follows.
While supporting atomic writes for main database file, we can keep its journal
data temporarily in the page cache by the following sequence.
1. open
-> ioctl(F2FS_IOC_START_VOLATILE_WRITE);
2. writes
: keep all the data in the page cache.
3. flush to the database file with atomic writes
a. ioctl(F2FS_IOC_START_ATOMIC_WRITE);
b. writes
c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
4. close
-> drop the cached data
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a very limited functionality for atomic write support.
In order to support atomic write, this patch adds two ioctls:
o F2FS_IOC_START_ATOMIC_WRITE
o F2FS_IOC_COMMIT_ATOMIC_WRITE
The database engine should be aware of the following sequence.
1. open
-> ioctl(F2FS_IOC_START_ATOMIC_WRITE);
2. writes
: all the written data will be treated as atomic pages.
3. commit
-> ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
: this flushes all the data blocks to the disk, which will be shown all or
nothing by f2fs recovery procedure.
4. repeat to #2.
The IO pattens should be:
,- START_ATOMIC_WRITE ,- COMMIT_ATOMIC_WRITE
CP | D D D D D D | FSYNC | D D D D | FSYNC ...
`- COMMIT_ATOMIC_WRITE
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
My static checker complains that segment is a u64 but only the lower 31
bits can be used before we hit a shift wrapping bug.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch relocates f2fs_unlock_op in every directory operations to be called
after any error was processed.
Otherwise, the checkpoint can be entered with valid node ids without its
dentry when -ENOSPC is occurred.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs tries to reorganize the dirty nat entries into multiple sets
according to its nid ranges. This can improve the flushing nat pages, however,
if there are a lot of cached nat entries, it becomes a bottleneck.
This patch introduces a new set management flow by removing dirty nat list and
adding a series of set operations when the nat entry becomes dirty.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces FITRIM in f2fs_ioctl.
In this case, f2fs will issue small discards and prefree discards as many as
possible for the given area.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add a new data structure to control checkpoint parameters.
Currently, it presents the reason of checkpoint such as is_umount and normal
sync.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs activates SSR if the # of free segments reaches to the # of
overprovisioned segments.
In this case, SSR starts to use dirty segments only, so that the overprovisoned
space cannot be selected for new data.
This means that we have no chance to utilizae the overprovisioned space at all.
This patch fixes that by allowing LFS allocations until the # of free segments
reaches to the last threshold, reserved space.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes the ipu_policy setting to use any combination of orthogonal policies.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In ->get_victim we get max_search value from dirty_i->nr_dirty without
protection of seglist_lock, after that, nr_dirty can be increased/decreased
before we hold seglist_lock lock.
Then in main loop we attempt to traverse all dirty section one time to find
victim section, but it's not accurate to use max_search as the total loop count,
because we might lose checking several sections or check sections redundantly
for the case of nr_dirty are increased or decreased previously.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In manual of mount, we descript remount as below:
"mount -o remount,rw /dev/foo /dir
After this call all old mount options are replaced and arbitrary stuff from
fstab is ignored, except the loop= option which is internally generated and
maintained by the mount command."
Previously f2fs do not clear up old mount options when remount_fs, so we have no
chance of disabling previous option (e.g. flush_merge). Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now punching hole in directory is not supported in f2fs, so let's limit file
type in punch_hole().
In addition, in punch_hole if offset is exceed file size, we should skip
punching hole.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Block size in f2fs is 4096 bytes, so theoretically, f2fs can support 4096 bytes
sector device at maximum. But now f2fs only support 512 bytes size sector, so
block device such as zRAM which uses page cache as its block storage space will
not be mounted successfully as mismatch between sector size of zRAM and sector
size of f2fs supported.
In this patch we support large sector size in f2fs, so block device with sector
size of 512/1024/2048/4096 bytes can be supported in f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
By using FALLOC_FL_KEEP_SIZE in ->fallocate of f2fs, we can fallocate block past
EOF without changing i_size of inode. These blocks past EOF will not be
truncated in ->setattr as we truncate them only when change the file size.
We should give a chance to truncate blocks out of filesize in setattr().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_direct_IO uses __allocate_data_block, but inside the allocation path,
we should update i_size at the changed time to update its inode page.
Otherwise, we can get wrong i_size after roll-forward recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If same data is updated multiple times, we don't need to redo whole the
operations.
Let's just update the lastest one.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_sync_file, if there is no written appended writes, it skips
to write its node blocks.
But, if there is up-to-date inode page, we should write it to update
its metadata during the roll-forward recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can summarize the roll forward recovery scenarios as follows.
[Term] F: fsync_mark, D: dentry_mark
1. inode(x) | CP | inode(x) | dnode(F)
-> Update the latest inode(x).
2. inode(x) | CP | inode(F) | dnode(F)
-> No problem.
3. inode(x) | CP | dnode(F) | inode(x)
-> Recover to the latest dnode(F), and drop the last inode(x)
4. inode(x) | CP | dnode(F) | inode(F)
-> No problem.
5. CP | inode(x) | dnode(F)
-> The inode(DF) was missing. Should drop this dnode(F).
6. CP | inode(DF) | dnode(F)
-> No problem.
7. CP | dnode(F) | inode(DF)
-> If f2fs_iget fails, then goto next to find inode(DF).
8. CP | dnode(F) | inode(x)
-> If f2fs_iget fails, then goto next to find inode(DF).
But it will fail due to no inode(DF).
So, this patch adds some missing points such as #1, #5, #7, and #8.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch revisited whole the recovery information during the f2fs_sync_file.
In this patch, there are three information to make a decision.
a) IS_CHECKPOINTED, /* is it checkpointed before? */
b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */
c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */
And, the scenarios for our rule are based on:
[Term] F: fsync_mark, D: dentry_mark
1. inode(x) | CP | inode(x) | dnode(F)
2. inode(x) | CP | inode(F) | dnode(F)
3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
4. inode(x) | CP | dnode(F) | inode(F)
5. CP | inode(x) | dnode(F) | inode(DF)
6. CP | inode(DF) | dnode(F)
7. CP | dnode(F) | inode(DF)
8. CP | dnode(F) | inode(x) | inode(DF)
For example, #3, the three conditions should be changed as follows.
inode(x) | CP | dnode(F) | inode(x) | inode(F)
a) x o o o o
b) x x x x o
c) x o o x o
If f2fs_sync_file stops ------^,
it should write inode(F) --------------^
So, the need_inode_block_update should return true, since
c) get_nat_flag(e, HAS_LAST_FSYNC), is false.
For example, #8,
CP | alloc | dnode(F) | inode(x) | inode(DF)
a) o x x x x
b) x x x o
c) o o x o
If f2fs_sync_file stops -------^,
it should write inode(DF) --------------^
Note that, the roll-forward policy should follow this rule, which means,
if there are any missing blocks, we doesn't need to recover that inode.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a flag in the nat entry structure to merge various
information such as checkpointed and fsync_done marks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, all the dnode pages should be read during the roll-forward recovery.
Even worsely, whole the chain was traversed twice.
This patch removes that redundant and costly read operations by using page cache
of meta_inode and readahead function as well.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the inode is same and its data index are needed to truncate, we can fall into
double lock for its inode page via get_dnode_of_data.
Error case is like this.
1. write data 1, 2, 3, 4, 5 in inode #4.
2. write data 100, 102, 103, 104, 105 in dnode #6 of inode #4.
3. sync
4. update data 100->106 in dnode #6.
5. fsync inode #4.
6. power-cut
-> Then,
1. go back to #3's checkpoint
2. in do_recover_data, get_dnode_of_data() gets inode #4.
3. detect 100->106 in dnode #6.
4. check_index_in_prev_nodes tries to truncate 100 in dnode #6.
5. to trigger truncate_hole, get_dnode_of_data should grab inode #4.
6. detect *kernel hang*
This patch should resolve that bug.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The nm_i->fcnt checking is executed before spin_lock, so if another
thread delete the last free_nid from the list, the wrong nid may be
gotten. So fix the race condition by moving the nm_i->fnct checking
into spin_lock.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, if there is no free nid in nm_i->free_nid_list, 0 may be saved
into next_free_nid of checkpoint, this may cause useless scanning for
next mount. nm_i->next_scan_nid should be a better default value than
0.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user wrote F2FS_IPU_FSYNC:4 in /sys/fs/f2fs/ipu_policy, f2fs_sync_file
only starts to try in-place-updates.
And, if the number of dirty pages is over /sys/fs/f2fs/min_fsync_blocks, it
keeps out-of-order manner. Otherwise, it triggers in-place-updates.
This may be used by storage showing very high random write performance.
For example, it can be used when,
Seq. writes (Data) + wait + Seq. writes (Node)
is pretty much slower than,
Rand. writes (Data)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously f2fs only counts dirty dentry pages, but there is no reason not to
expand the scope.
This patch changes the names on the management of dirty pages and to count
dirty pages in each inode info as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If application throws negative value of lseek with SEEK_DATA|SEEK_HOLE,
previous f2fs went into BUG_ON in get_dnode_of_data, which was reported
by Tommi Rantala.
He could make a simple code to detect this having:
lseek(fd, -17595150933902LL, SEEK_DATA);
This patch should resolve that bug.
Reported-by: Tommi Rentala <tt.rantala@gmail.com>
[Jaegeuk Kim: relocate the condition as suggested by Chao]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In gc_node_segment, if node page gc is run concurrently with node page
writeback, and check_valid_map and get_node_page run after page locked
and before cur_valid_map is updated as below, it is possible for the
page to be written twice unnecessarily.
sync_node_pages
try_lock_page
...
check_valid_map f2fs_write_node_page
...
write_node_page
do_write_page
allocate_data_block
...
refresh_sit_entry /* update cur_valid_map */
...
...
unlock_page
get_node_page
...
set_page_dirty
...
f2fs_put_page
unlock_page
This can be solved via calling check_valid_map after get_node_page again.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We use flush cmd control to collect many flush cmds, and flush them
together. In this case, we use two list to manage the flush cmds
(collect and dispatch), and one spin lock is used to protect this.
In fact, the lock-less list(llist) is very suitable to this case,
and we use simplify this routine.
-
v2:
-use llist_for_each_entry_safe to fix possible use-after-free issue.
-remove the unused field from struct flush_cmd.
Thanks for Yu's suggestion.
-
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In commit aec71382c6 ("f2fs: refactor flush_nat_entries codes for reducing NAT
writes"), we descripte the issue as below:
"Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
journal is full, then flush the left dirty entries to disk without merge
journaled entries, so these journaled entries may be flushed to disk at next
checkpoint but lost chance to flushed last time."
Actually, we have the same problem in using SIT journal area.
In this patch, firstly we will update sit journal with dirty entries as many as
possible. Secondly if there is no space in sit journal, we will remove all
entries in journal and walk through the whole dirty entry bitmap of sit,
accounting dirty sit entries located in same SIT block to sit entry set. All
entry sets are linked to list sit_entry_set in sm_info, sorted ascending order
by count of entries in set. Later we flush entries in set which have fewest
entries into journal as many as we can, and then flush dense set with merged
entries to disk.
In this way we can use sit journal area more effectively, also we will reduce
SIT update, result in gaining in performance and saving lifetime of flash
device.
In my testing environment, it shows this patch can help to reduce SIT block
update obviously.
virtual machine + hard disk:
fsstress -p 20 -n 400 -l 5
sit page num cp count sit pages/cp
based 2006.50 1349.75 1.486
patched 1566.25 1463.25 1.070
Our latency of merging op is small when handling a great number of dirty SIT
entries in flush_sit_entries:
latency(ns) dirty sit count
36038 2151
49168 2123
37174 2232
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
sit_i in macro SIT_BLOCK_OFFSET/START_SEGNO is not used, remove it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch replaces BUG cases with f2fs_bug_on to remain fsck.f2fs information.
And it implements some void functions to initiate fsck.f2fs too.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As the race condition on the inode cache, following scenario can appear:
[Thread a] [Thread b]
->f2fs_mkdir
->f2fs_add_link
->__f2fs_add_link
->init_inode_metadata failed here
->gc_thread_func
->f2fs_gc
->do_garbage_collect
->gc_data_segment
->f2fs_iget
->iget_locked
->wait_on_inode
->unlock_new_inode
->move_data_page
->make_bad_inode
->iput
When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
should be set as bad to avoid being accessed by other thread. But in above
scenario, it allows f2fs to access the invalid inode before this inode was set
as bad.
This patch fix the potential problem, and this issue was found by code review.
change log from v1:
o Add condition judgment in gc_data_segment() suggested by Changman Lee.
o use iget_failed to simplify code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The dentry name type is unsigned char *.
If we don't match this type, some character codes can be changed by signed bit.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can make the code a bit simpler because we know that "!retry" is
zero.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This verifies to truncate any allocated blocks, offset[0], by inline_data.
Not figured out, but for making sure.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces DEF_NIDS_PER_INODE/GET_ORPHAN_BLOCKS/F2FS_CP_PACKS macro
instead of numbers in code for readability.
change log from v1:
o fix typo pointed out by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduce need_do_checkpoint() to include numerous judgment condition
for readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Theoretically, our total inodes number is the same as total node number, but
there are three node ids are reserved in f2fs, they are 0, 1 (node nid), and 2
(meta nid), and they should never be used by user, so our total/free inode
number calculated in ->statfs is wrong.
This patch indroduces F2FS_RESERVED_NODE_NUM and then fixes this issue by
recalculating total/free inode number with the macro.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch checks inline_data one more time under the inode page lock whether
its inline_data is converted or not.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
I think we need to let the dirty node pages remain in the page cache instead
of rewriting them in their places.
So, after done with successful recovery, write_checkpoint will flush all of them
through the normal write path.
Through this, we can avoid potential error cases in terms of block allocation.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The init_inode_metadata calls truncate_blocks when error is occurred.
The callers holds f2fs_lock_op, so we should not call it again in
truncate_blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Any checkpoint should not be done during the core roll-forward procedure.
Especially, it includes error cases too.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two rules when EIO is occurred.
1. don't write any checkpoint data to preserve the previous checkpoint
2. don't lose the cached dentry/node/meta pages
So, at first, this patch adds set_page_dirty in f2fs_write_end_io's failure.
Then, writing checkpoint/dentry/node blocks is not allowed.
Note that, for the data pages, we can't just throw away by redirtying them.
Otherwise, kworker can fall into infinite loop to flush them.
(Ref. xfstests/019)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to check s_dirty under cp_mutex, since s_dirty is reset under that
mutex.
And previous condition was not correct, since we can omit doing checkpoint
when checkpoint was done followed by all the node pages were written back.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch gives another chance to try mount process when we encounter an error.
This makes an effect on the roll-forward recovery failures as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The generic_shutdown_super calls sync_filesystem, evict_inode, and then
f2fs_put_super. In f2fs_evict_inode, we remain some dirty inode information
so we should release them at f2fs_put_super.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is the errorneous scenario.
1. write data
2. do checkpoint
3. produce some dirty node pages by the gc thread
4. write back dirty node pages
5. f2fs_put_super will skip the checkpoint, since dirty count for node pages is
zero.
This patch removes such the wrong condition check.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes not to skip xattr recovery and inline xattr/data recovery
order.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During the recovery, we should clear the inline_xattr flag if its xattr node
block is recovered.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an inode are fsynced multiple times with fsync & dent marks, this inode will
set FI_INC_LINK at find_fsync_dnodes during the recovery.
But, in recover_inode, recover_dentry doesn't clear that flag when multiple hits
were occurred.
So this patch removes the flag for the further consistency.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a new inode page is needed for recover_dentry, we should assing i_inline
as zero.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a parentheses to make clear for condition check.
And also it changes the return type for better meanings.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If mkwrite is called to an inode having inline_data, it can overwrite the data
index space as NEW_ADDR. (e.g., the first 4 bytes are coincidently zero)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix typo and some grammatical errors.
The words "filesystem" and "readahead" are being used without the space treewide.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch uses for_each_set_bit to simplify some codes in f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_balance_fs in expand_inode_data to avoid allocation failure
with segment.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When inode is evicted, all the page cache belong to this inode should be
released including the xattr node page. But previously we didn't do this, this
patch fixed this issue.
v2:
o reposition invalidate_mapping_pages() to the right place suggested by
Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When we recover data of inode in roll-forward procedure, and the inode has both
inline data and inline xattr. We may skip recovering inline xattr if we recover
inline data form node page first.
This patch will fix the problem that we lost inline xattr data in above
scenario.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We do not need to block on ->node_write among different node page writers e.g.
fsync/flush, unless we have a node page writer from write_checkpoint.
So it's better use rw_semaphore instead of mutex type for ->node_write to
promote performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are redundant lines in allocate_data_block.
In this function, we call refresh_sit_entry with old seg and old curseg.
After that, we call locate_dirty_segment with old curseg.
But, the new address is always allocated from old curseg and
we call locate_dirty_segment with old curseg in refresh_sit_entry.
So, we do not need to call locate_dirty_segment with old curseg again.
We've discussed like below:
Jaegeuk said:
"When considering SSR, we need to take care of the following scenario.
- old segno : X
- new address : Z
- old curseg : Y
This means, a new block is supposed to be written to Z from X.
And Z is newly allocated in the same path from Y.
In that case, we should trigger locate_dirty_segment for Y, since
it was a current_segment and can be dirty owing to SSR.
But that was not included in the dirty list."
Changman said:
"We already choosed old curseg(Y) and then we allocate new address(Z) from old
curseg(Y). After that we call refresh_sit_entry(old address, new address).
In the funcation, we call locate_dirty_segment with old seg and old curseg.
So calling locate_dirty_segment after refresh_sit_entry again is redundant."
Jaegeuk said:
"Right. The new address is always allocated from old_curseg."
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Dongho Sim <dh.sim@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the bit is already set, we don't need to reset it, and vice versa.
Because we don't need to make the caches dirty for that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enforces in-place-updates only when fdatasync is requested.
If we adopt this in-place-updates for the fdatasync, we can skip to write the
recovery information.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch intends to improve the fsync performance by skipping remaining the
recovery information, only when there is no data that we should recover.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a inode number list in which represents inodes having
appended data writes or updated data writes after last checkpoint.
This will be used at fsync to determine whether the recovery information
should be written or not.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For better ino management, this patch replaces the data structure from list
to radix tree.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes the naming of orphan-related data structures to use as
inode numbers managed globally.
Later, we can use this facility for managing any inode number lists.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch punches out the core functions to manage the inode numbers.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a mount option, nobarrier, in f2fs.
The assumption in here is that file system keeps the IO ordering, but
doesn't care about cache flushes inside the storages.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should put root inode correctly in error path of fill_super, otherwise we
may encounter a leak case of inode resource.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now new interface ->rename2() is added to VFS, here are related description:
https://lkml.org/lkml/2014/2/7/873https://lkml.org/lkml/2014/2/7/758
This patch adds function f2fs_rename2() to support ->rename2() including
handling both RENAME_EXCHANGE and RENAME_NOREPLACE flag.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, if a large amount of direct IO writes were done, the
segment allocation may be failed because no enough segments are gced.
Changes:
v2: add f2fs_balance_fs into __get_data_block instead of f2fs_direct_IO.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __set_test_and_free we will check whether all segment are free in one section
When free one segment, in order to set section to free status.
But the searching region of segmap is from start segno to last segno of f2fs,
it's not necessary. So let's just only check all segment bitmap of target
section.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We assume that modification of some special application could result in zeroed
name_len, or it is consciously made by somebody. We will deadloop in
find_in_block when name_len of dir entry is zero.
This patch is added for preventing deadloop in above scenario.
change log from v1:
o use f2fs_bug_on rather than break out from searching dir entry suggested by
Jaegeuk Kim.
Jaegeuk describe:
"Well, IMO, it would be good to add f2fs_bug_on() here with a specific comment.
In the current phase of f2fs, it is more important to investigate the file
system bugs, rather than workarounds for any corrupted images.
And, definitely it needs to stop the kernel if any corrupted image was mounted,
so that we can figure out where the bugs are occurred."
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch we use below inner macro and function to clean up codes.
1. ADDRS_PER_PAGE
2. SM_I
3. f2fs_readonly
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When we fail in ->write_begin()/->direct_IO(), our allocated node block in disk
and page cache are still kept, despite these may not be used again.
This patch introduce f2fs_write_failed() to handle the error case of these two
interfaces, it will truncate page cache and blocks of this file according to
i_size.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
kernel side(xx_init_acl), the acl is get/cloned from the parent dir's,
which is credible. So remove the redundant validation check of acl
here.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In our rename process, region of f2fs_lock_op covered is too big as some of the
code like f2fs_empty_dir/f2fs_find_entry are not needed to protect by this lock.
So in the extreme case like doing checkpoint when we rename old inode to exist
inode in a large directory could cause lower concurrency.
Let's reduce the region of f2fs_lock_op to fix this.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
journal is full, then flush the left dirty entries to disk without merge
journaled entries, so these journaled entries may be flushed to disk at next
checkpoint but lost chance to flushed last time.
In this patch we merge dirty entries located in same NAT block to nat entry set,
and linked all set to list, sorted ascending order by entries' count of set.
Later we flush entries in sparse set into journal as many as we can, and then
flush merged entries to disk. In this way we can not only gain in performance,
but also save lifetime of flash device.
In my testing environment, it shows this patch can help to reduce NAT block
writes obviously. In hard disk test case: cost time of fsstress is stablely
reduced by about 5%.
1. virtual machine + hard disk:
fsstress -p 20 -n 200 -l 5
node num cp count nodes/cp
based 4599.6 1803.0 2.551
patched 2714.6 1829.6 1.483
2. virtual machine + 32g micro SD card:
fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
-f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
-f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S
node num cp count nodes/cp
based 84.5 43.7 1.933
patched 49.2 40.0 1.23
Our latency of merging op shows not bad when handling extreme case like:
merging a great number of dirty nats:
latency(ns) dirty nat count
3089219 24922
5129423 27422
4000250 24523
change log from v1:
o fix wrong logic in add_nat_entry when grab a new nat entry set.
o swith to create slab cache in create_node_manager_caches.
o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.
change log from v2:
o make comment position more appropriate suggested by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_do_tmpfile to eliminate the redundant init_inode_metadata
flow.
Throught this, we can provide the consistent lock usage, e.g., fi->i_sem, and
this will enable better debugging stuffs.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add function f2fs_tmpfile() to support O_TMPFILE file creation, and modify logic
of init_inode_metadata to enable linkat temp file.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After we call find_data_page in truncate_partial_data_page, we could not
guarantee this page is updated or not as error may occurred in lower layer.
We'd better check status of the page to avoid this no updated page be
writebacked to device.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We have already set page update in ->write_begin, so we should remove redundant
SetPageUptodate in ->write_end.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=75861
Denis 2014-05-10 11:28:59 UTC reported:
"F2FS-fs (mmcblk0p28): mounting..
Unable to handle kernel NULL pointer dereference at virtual address 00000018
...
[<c0a2f678>] (_raw_spin_lock+0x3c/0x70) from [<c03a0330>] (issue_flush_thread+0x50/0x17c)
[<c03a0330>] (issue_flush_thread+0x50/0x17c) from [<c01b4064>] (kthread+0x98/0xa4)
[<c01b4064>] (kthread+0x98/0xa4) from [<c0108060>] (kernel_thread_exit+0x0/0x8)"
This patch assign cmd_control_info in sm_info before issue_flush_thread is being
created, so this make sure that issue flush thread will have no chance to access
invalid info in fcc.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we don't check the current backing device status, balance_dirty_pages can
fall into infinite pausing routine.
This can be occurred when a lot of directories make a small number of dirty
dentry pages including files.
Reported-by: Brian Chadwick <brianchad@westnet.com.au>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an inode is renamed, it should be registered as file_lost_pino to conduct
checkpoint at f2fs_sync_file.
Otherwise, the inode cannot be recovered due to no dent_mark in the following
scenario.
Note that, this scenario is from xfstests/322.
1. create "a"
2. fsync "a"
3. rename "a" to "b"
4. fsync "b"
5. Sudden power-cut
After recovery is done, "b" should be seen.
However, the result shows "a", since the recovery procedure does not enter
recover_dentry due to no dent_mark.
The reason is like below.
- The nid of "a" is checkpointed during #2, f2fs_sync_file.
- The inode page for "b" produced by #3 is written without dent_mark by
sync_node_pages.
So, this patch fixes this bug by assinging file_lost_pino to the "a"'s inode.
If the pino is lost, f2fs_sync_file conducts checkpoint, and then recovers
the latest pino and its dentry information for further recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch correct releasing code of new_page to avoid BUG_ON in error patch of
f2fs_rename.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we fail in this path:
->init_inode_metadata
->make_empty_dir
->get_new_data_page
->grab_cache_page return -ENOMEM
We will bug on in error path of init_inode_metadata when call remove_inode_page
because i_block = 2 (one inode block will be released later & one dentry block).
We should release the dentry block in init_inode_metadata to avoid this BUG_ON,
and avoid leak of dentry block resource, because we never have second chance to
release that block in ->evict_inode as in upper error path we make this inode
'bad'.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add lower bound verification for nid in check_nid_range, so nids
reserved like 0, node, meta passed by caller could be checked there.
And then check_nid_range could be used in f2fs_nfs_get_inode for simplifying
code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the fallocate bug like below. (See xfstests/255)
In fallocate(fd, 0, 20480),
expand_inode_data processes
for (index = pg_start; index <= pg_end; index++) {
f2fs_reserve_block();
...
}
So, even though fallocate requests 20480, 5 blocks, f2fs allocates 6 blocks
including pg_end.
So, this patch adds one condition to avoid block allocation.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previous get_block in f2fs didn't report the newly allocated region which has
NEW_ADDR.
For reader, it should not report, but fiemap needs this.
So, this patch introduces two get_block sharing core function.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter(). A bunch of simple cases coverted to that...
[AV: fixed the braino spotted by Cyrill]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch-set includes the following major enhancement patches.
o enhance wait_on_page_writeback
o support SEEK_DATA and SEEK_HOLE
o enhance readahead flows
o enhance IO flushes
o support fiemap
o add some tracepoints
The other bug fixes are as follows.
o fix to support a large volume > 2TB correctly
o recovery bug fix wrt fallocated space
o fix recursive lock on xattr operations
o fix some cases on the remount flow
And, there are a bunch of cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTleLYAAoJEEAUqH6CSFDSdhEP/iY5hTZ02jH4ZiFPP/Nee4hS
l0BHeZvrMjoccaWUDu0ZvIPC8BiJ7kLOgK+/VWS7LAfY1PY11ALEtYQOrW+RM47+
jMfULegTod/F8WS2B6dk31QMhAZldtnsYvA5PS1VV3S0rht+qbOz+PDejZFgSMc3
VVQ7OZzq30gMmtsw7oh3FHfeTu4xe/bxygdMRXgljQQD2MvorJiOb4MA+ovEDd9z
CZMMTQvRKjc0d8LPf0gOkZEvG63GR6klCgFMuiappUsua0O8IPIEhCyEGFrE66vS
fObVKpqNAsR2ABhS2grgn6Q7UUvn4xrF6jOwDH5tuw2yzmkQiMAWINwBdgnbEy1c
D5S89PQ8TkQK9KBSoU0v8WKWC4HzWZF4ZEd6eq9VxVTS8iT0w8DtNHXK518FVC34
82iqrLc0EhrcGNiW/7Nrc6WzHrWqorylCFN7atB3ruhVqeMh3MZsDU4Gq0WgB2oh
pF9XVBEpJQpV35DYSAPzLkm+2+mwHVNqwdY3HcvUs7IP90+wZlrWSRG2FEfFt/e8
6nwvbORrHMTI5VfdYlEPgpjuesFmYPZqEvRGdaDa1YrHqhvvgStEPT9qiq2qVn9+
tr0HjpNRDD/etkaE6ujYOYqdxuk3mm6RY68h+KSbNcY1/VTvt2bN2telwWuDsxIq
8yOsxV2x3JB/euDAJsSU
=xqsO
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, there is no special interesting feature, but we've
investigated a couple of tuning points with respect to the I/O flow.
Several major bug fixes and a bunch of clean-ups also have been made.
This patch-set includes the following major enhancement patches:
- enhance wait_on_page_writeback
- support SEEK_DATA and SEEK_HOLE
- enhance readahead flows
- enhance IO flushes
- support fiemap
- add some tracepoints
The other bug fixes are as follows:
- fix to support a large volume > 2TB correctly
- recovery bug fix wrt fallocated space
- fix recursive lock on xattr operations
- fix some cases on the remount flow
And, there are a bunch of cleanups"
* tag 'for-f2fs-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (52 commits)
f2fs: support f2fs_fiemap
f2fs: avoid not to call remove_dirty_inode
f2fs: recover fallocated space
f2fs: fix to recover data written by dio
f2fs: large volume support
f2fs: avoid crash when trace f2fs_submit_page_mbio event in ra_sum_pages
f2fs: avoid overflow when large directory feathure is enabled
f2fs: fix recursive lock by f2fs_setxattr
MAINTAINERS: add a co-maintainer from samsung for F2FS
MAINTAINERS: change the email address for f2fs
f2fs: use inode_init_owner() to simplify codes
f2fs: avoid to use slab memory in f2fs_issue_flush for efficiency
f2fs: add a tracepoint for f2fs_read_data_page
f2fs: add a tracepoint for f2fs_write_{meta,node,data}_pages
f2fs: add a tracepoint for f2fs_write_{meta,node,data}_page
f2fs: add a tracepoint for f2fs_write_end
f2fs: add a tracepoint for f2fs_write_begin
f2fs: fix checkpatch warning
f2fs: deactivate inode page if the inode is evicted
f2fs: decrease the lock granularity during write_begin
...
There is an errorneous case during the recovery like below.
In recovery_dentry,
1) dir = f2fs_iget();
2) mark the dir with FI_DELAY_IPUT
3) goto unmap_out
After the end of recovery routine, there is no dirty dentries so the dir cannot
be released by iput in remove_dirty_dir_inode.
This patch fixes such the bug case by handling the iget and iput in the
recovery_dentry procedure.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)
The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.
samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If data are overwritten through dio, previous f2fs doesn't remain the fsync mark
due to no additional node writes.
Note that this patch should resolve the xfstests:311.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs's cp has one page which consists of struct f2fs_checkpoint and
version bitmap of sit and nat. To support lots of segments, we need more
blocks for sit bitmap. So let's arrange sit bitmap as following:
+-----------------+------------+
| f2fs_checkpoint | sit bitmap |
| + nat bitmap | |
+-----------------+------------+
0 4k N blocks
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: simple code change for readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously we allocate pages with no mapping in ra_sum_pages(), so we may
encounter a crash in event trace of f2fs_submit_page_mbio where we access
mapping data of the page.
We'd better allocate pages in bd_inode mapping and invalidate these pages after
we restore data from pages. It could avoid crash in above scenario.
Changes from V1
o remove redundant code in ra_sum_pages() suggested by Jaegeuk Kim.
Call Trace:
[<f1031630>] ? ftrace_raw_event_f2fs_write_checkpoint+0x80/0x80 [f2fs]
[<f10377bb>] f2fs_submit_page_mbio+0x1cb/0x200 [f2fs]
[<f103c5da>] restore_node_summary+0x13a/0x280 [f2fs]
[<f103e22d>] build_curseg+0x2bd/0x620 [f2fs]
[<f104043b>] build_segment_manager+0x1cb/0x920 [f2fs]
[<f1032c85>] f2fs_fill_super+0x535/0x8e0 [f2fs]
[<c115b66a>] mount_bdev+0x16a/0x1a0
[<f102f63f>] f2fs_mount+0x1f/0x30 [f2fs]
[<c115c096>] mount_fs+0x36/0x170
[<c1173635>] vfs_kern_mount+0x55/0xe0
[<c1175388>] do_mount+0x1e8/0x900
[<c1175d72>] SyS_mount+0x82/0xc0
[<c16059cc>] sysenter_do_call+0x12/0x22
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When large directory feathure is enable, We have one case which could cause
overflow in dir_buckets() as following:
special case: level + dir_level >= 32 and level < MAX_DIR_HASH_DEPTH / 2.
Here we define MAX_DIR_BUCKETS to limit the return value when the condition
could trigger potential overflow.
Changes from V1
o modify description of calculation in f2fs.txt suggested by Changman Lee.
Suggested-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch should resolve the following recursive lock.
[<ffffffff8135a9c3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffa01749dc>] f2fs_setxattr+0x5c/0xa0 [f2fs]
[<ffffffffa0174c99>] __f2fs_set_acl+0x1b9/0x340 [f2fs]
[<ffffffffa017515a>] f2fs_init_acl+0x4a/0xcb [f2fs]
[<ffffffffa0159abe>] __f2fs_add_link+0x26e/0x780 [f2fs]
[<ffffffffa015d4d8>] f2fs_mkdir+0xb8/0x150 [f2fs]
[<ffffffff811cebd7>] vfs_mkdir+0xb7/0x160
[<ffffffff811cf89b>] SyS_mkdir+0xab/0xe0
[<ffffffff817244bf>] tracesys+0xe1/0xe6
[<ffffffffffffffff>] 0xffffffffffffffff
The call path indicates:
- f2fs_add_link
: down_write(&fi->i_sem);
- init_inode_metadata
- f2fs_init_acl
- __f2fs_set_acl
- f2fs_setxattr
: down_write(&fi->i_sem);
Here we should not call f2fs_setxattr, but __f2fs_setxattr.
But __f2fs_setxattr is a static function in xattr.c, so that I found the other
generic approach to use f2fs_setxattr.
In f2fs_setxattr, the page pointer is only given from init_inode_metadata.
So, this patch adds this condition to avoid this in f2fs_setxattr.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch uses exported inode_init_owner() to simplify codes in
f2fs_new_inode().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If we use slab memory in f2fs_issue_flush(), we will face memory pressure and
latency time caused by racing of kmem_cache_{alloc,free}.
Let's alloc memory in stack instead of slab.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a tracepoint for f2fs_read_data_page to trace when page is
readed by user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a tracepoint for f2fs_write_{meta,node,data}_pages to trace when
pages are fsyncing/flushing.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a tracepoint for f2fs_write_{meta,node,data}_page to trace when
page is writting out.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a tracepoint for f2fs_write_end to trace write op of user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a tracepoint for f2fs_write_begin to trace write op of user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
fix the following checkpatch warning:
WARNING: do {} while (0) macros should not be semicolon terminated
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If the inode page is clean during its inode eviction, it'd better drop the page
to reduce further memory pressure.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch reduces the lock granularity during write_begin.
When the system is under memory pressure, it would be better to reduce
the locking time for the data pages.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We don't need to wait on page writeback for these cases.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch splits grab_cache_page_write_begin into grab_cache_page and
wait_on_page_writeback for node pages.
This patch intends to enhance the latency to get node pages by alleviating
unnecessary wait_on_page_writeback.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previous we do not truncate inline data in inode page when setattr, so following
case could still read the inline data which has already truncated:
1.write inline data
2.ftruncate size to 0
3.ftruncate size to max inline data size
4.read from offset 0
This patch introduces truncate_inline_data() to fix this problem.
change log from v1:
o fix a bug and do not truncate first page data after truncate inline data.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We have no so such readahead mechanism in ->iterate() path as the one in
->read() path, it cause low performance when we read large directory.
This patch add readahead in f2fs_readdir() for better performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should set the error number correctly when we fail in recover_dentry(), so
the recover flow could stop for the reason as error number shows instead of
continuing.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If an amount of data are allocated though fallocate and user writes a couple of
data among the space, f2fs should return the data offset made by user when
SEEK_DATA is requested.
For example, (N: NEW_ADDR by fallocate, X: NEW_ADDR by user)
1) fallocate 0 ~ 10MB
f -> N N N N N N N N N N N N ... N
2) write 4KB at 5MB offset
f -> N N N N N X N N N N N N ... N
3) SEEK_DATA from 0 should return 5MB offset
So, this patch adds a routine to search the first dirty page to handle that.
Then, the SEEK_DATA flow skips NEW_ADDR offsets until any dirty page is found.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When SEEK_HOLE is requeted, it should return i_size if the hole position is
found outside of i_size.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In This patch we introduce f2fs_seek_block to support SEEK_{DATA,HOLE} of
lseek(2).
change log from v1:
o fix bug when lseek from middle of page and fix wrong calculation of
PGOFS_OF_NEXT_DNODE macro.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help function {create,destroy}_flush_cmd_control to clean up
the create/destory flush merge operation.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Split the flush_merge fields from sm_i, and use the new struct flush_cmd_control
to wrap it, so that we can igonre these fileds if flush_merge is disable, and
it alse can the structs more neat.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help macro ADDRS_PER_PAGE() to get the number of address pointers in
direct node or inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If f2fs_write_data_page is called through the reclaim path, we should submit
the bio right away.
This patch resolves the following issue that Marc Dietrich reported.
"It took me a while to bisect a problem which causes my ARM (tegra2) netbook to
frequently stall for 5-10 seconds when I enable EXA acceleration (opentegra
experimental ddx)."
And this patch fixes that.
Reported-by: Marc Dietrich <marvin24@gmx.de>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch passes the "flags" field to the low level setxattr functions
to use XATTR_REPLACE in the following patches.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We'd better handle inline data case independently in f2fs_bmap().
It can reduce our handling time in f2fs_bmap().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If so many dirty dentry blocks are cached, not reached to the flush condition,
we should fall into livelock in balance_dirty_pages.
So, let's consider the mem size for the condition.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If the disk has some garbage blocks, F2FS is able to face with BUG_ON when
recovering direct node blocks.
This patch detects the error case and avoids that prior to reaching BUG_ON.
Alexey Khoroshilov addressed the potential security issues as follows.
"An ability to trigger a BUG_ON assert by mounting a crafted image is
usually considered as a local denial of service [1-3]. As far as I
understand, the reason is that some kernel data may become inconsistent
that can lead to further problems.
[1] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-3353
[2] http://www.openwall.com/lists/oss-security/2011/06/24/4
[3] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-2928
etc."
Reported-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces available_nids for alloc_nids() and fixes max_nid for
build_free_nids() and scan_nat_pages().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
inline get_max_meta_blks is only used in checkpoint.c
Use standard static inline format.
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduce raw_nat_from_node_info() to simplfy some codes, and also
use exist function node_info_from_raw_nat() to do the same job.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add the *remount* handle of flush_merge option, so that the users
can enable flush_merge in the runtime, such as the underlying device
handles the cache_flush command relatively slowly.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Use set_mask_bits() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
FS_IMMUTABLE_FL, FS_APPEND_FL, etc. flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Make recover_inline_xattr() static, because this function is
used only in this file.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes list opeations in handling dirty dir inodes.
Previously, F2FS traverses whole the list of dirty dir inodes to check whether
there is an existing inode or not, resulting in heavy CPU overheads.
So this patch removes such the traverse operations by adding FI_DIRTY_DIR to
indicate the inode lies on the list or not.
Through this simple flag, we can remove redundant operations gracefully.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch replace some general codes with redirty_page_for_writepage, which
can be enabled after consideration on additional procedure like counting dirty
pages appropriately.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The f2fs always scans the next chain of direct node blocks.
But some garbage blocks are able to be remained due to no discard support or
SSR triggers.
This occasionally wreaks recovering wrong inodes that were used or BUG_ONs
due to reallocating node ids as follows.
When mount this f2fs image:
http://linuxtesting.org/downloads/f2fs_fault_image.zip
BUG_ON is triggered in f2fs driver (messages below are generated on
kernel 3.13.2; for other kernels output is similar):
kernel BUG at fs/f2fs/node.c:215!
Call Trace:
[<ffffffffa032ebad>] recover_inode_page+0x1fd/0x3e0 [f2fs]
[<ffffffff811446e7>] ? __lock_page+0x67/0x70
[<ffffffff81089990>] ? autoremove_wake_function+0x50/0x50
[<ffffffffa0337788>] recover_fsync_data+0x1398/0x15d0 [f2fs]
[<ffffffff812b9e5c>] ? selinux_d_instantiate+0x1c/0x20
[<ffffffff811cb20b>] ? d_instantiate+0x5b/0x80
[<ffffffffa0321044>] f2fs_fill_super+0xb04/0xbf0 [f2fs]
[<ffffffff811b861e>] ? mount_bdev+0x7e/0x210
[<ffffffff811b8769>] mount_bdev+0x1c9/0x210
[<ffffffffa0320540>] ? validate_superblock+0x210/0x210 [f2fs]
[<ffffffffa031cf8d>] f2fs_mount+0x1d/0x30 [f2fs]
[<ffffffff811b9497>] mount_fs+0x47/0x1c0
[<ffffffff81166e00>] ? __alloc_percpu+0x10/0x20
[<ffffffff811d4032>] vfs_kern_mount+0x72/0x110
[<ffffffff811d6763>] do_mount+0x493/0x910
[<ffffffff811615cb>] ? strndup_user+0x5b/0x80
[<ffffffff811d6c70>] SyS_mount+0x90/0xe0
[<ffffffff8166f8d9>] system_call_fastpath+0x16/0x1b
Found by Linux File System Verification project (linuxtesting.org).
Reported-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Enable flush_merge only in f2fs is not read-only, so does the mount
option show.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Merge second patch-bomb from Andrew Morton:
- the rest of MM
- zram updates
- zswap updates
- exit
- procfs
- exec
- wait
- crash dump
- lib/idr
- rapidio
- adfs, affs, bfs, ufs
- cris
- Kconfig things
- initramfs
- small amount of IPC material
- percpu enhancements
- early ioremap support
- various other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits)
MAINTAINERS: update Intel C600 SAS driver maintainers
fs/ufs: remove unused ufs_super_block_third pointer
fs/ufs: remove unused ufs_super_block_second pointer
fs/ufs: remove unused ufs_super_block_first pointer
fs/ufs/super.c: add __init to init_inodecache()
doc/kernel-parameters.txt: add early_ioremap_debug
arm64: add early_ioremap support
arm64: initialize pgprot info earlier in boot
x86: use generic early_ioremap
mm: create generic early_ioremap() support
x86/mm: sparse warning fix for early_memremap
lglock: map to spinlock when !CONFIG_SMP
percpu: add preemption checks to __this_cpu ops
vmstat: use raw_cpu_ops to avoid false positives on preemption checks
slub: use raw_cpu_inc for incrementing statistics
net: replace __this_cpu_inc in route.c with raw_cpu_inc
modules: use raw_cpu_write for initialization of per cpu refcount.
mm: use raw_cpu ops for determining current NUMA node
percpu: add raw_cpu_ops
slub: fix leak of 'name' in sysfs_slab_add
...
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.
It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch-set includes the following major enhancement patches.
o introduce large directory support
o introduce f2fs_issue_flush to merge redundant flush commands
o merge write IOs as much as possible aligned to the segment
o add sysfs entries to tune the f2fs configuration
o use radix_tree for the free_nid_list to reduce in-memory operations
o remove costly bit operations in f2fs_find_entry
o enhance the readahead flow for CP/NAT/SIT/SSA blocks
The other bug fixes are as follows.
o recover xattr node blocks correctly after sudden-power-cut
o fix to calculate the maximum number of node ids
o enhance to handle many error cases
And, there are a bunch of cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJTQiQrAAoJEEAUqH6CSFDSlbIP/iq06BrUeMDLoQFhA2GQFKFD
wd0A5h9hCiFcKBcI/u/aAQqj/a5wdwzDl9XzH2PzJ45IM6sVGQZ0lv+kdLhab6rk
ipNbV7G0yLAX+8ygS6GZF7pSKfMzGSGTrRvfdtoiunIip1jCY1IkUxv1XMgBSPza
wnWYrE5HXEqRUDCqPXJyxrPmx0/0jw8/V82Ng9stnY34ySs+l/3Pvg65Kh0QuSSy
BRjJUGlOCF68KUBKd+6YB2T5KlbQde3/5lhP+GMOi+xm5sFB+j+59r/WpJpF2Nxs
ImxQs5GkiU01ErH/rn5FgHY/zzddQenBKwOvrjEeUA1eVpBurdsIr1JN0P6qDbgB
ho5U8LzCQq+HZiW444eQGkXSOagpUKqDhTVJO7Fji/wG88Atc9gLX3ix8TH2skxT
C5CvvrJM7DKBtkZyTzotKY/cWorOZhge6E/EkbGaM1sSHdK5b1Rg4YlFi9TDyz0n
QjGD1uuvEeukeKGdIG9pjc7o5ledbMDYwLpT2RuRXenLOTsn8BqDOo9aRTg+5Kag
tJNJLFumjPR2mEBNKjicJMUf381J/SKDwZszAz9mgvCZXldMza/Ax0LzJDJCVmkP
UuBiVzGxVzpd33IsESUDr0J9hc+t8kS10jfAeKnE3cpb6n7/RYxstHh6CHOFKNXM
gPUSYPN3CYiP47DnSfzA
=eSW+
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch-set includes the following major enhancement patches.
- introduce large directory support
- introduce f2fs_issue_flush to merge redundant flush commands
- merge write IOs as much as possible aligned to the segment
- add sysfs entries to tune the f2fs configuration
- use radix_tree for the free_nid_list to reduce in-memory operations
- remove costly bit operations in f2fs_find_entry
- enhance the readahead flow for CP/NAT/SIT/SSA blocks
The other bug fixes are as follows:
- recover xattr node blocks correctly after sudden-power-cut
- fix to calculate the maximum number of node ids
- enhance to handle many error cases
And, there are a bunch of cleanups"
* tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
f2fs: fix wrong statistics of inline data
f2fs: check the acl's validity before setting
f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
f2fs: fix to cover io->bio with io_rwsem
f2fs: fix error path when fail to read inline data
f2fs: use list_for_each_entry{_safe} for simplyfying code
f2fs: avoid free slab cache under spinlock
f2fs: avoid unneeded lookup when xattr name length is too long
f2fs: avoid unnecessary bio submit when wait page writeback
f2fs: return -EIO when node id is not matched
f2fs: avoid RECLAIM_FS-ON-W warning
f2fs: skip unnecessary node writes during fsync
f2fs: introduce fi->i_sem to protect fi's info
f2fs: change reclaim rate in percentage
f2fs: add missing documentation for dir_level
f2fs: remove unnecessary threshold
f2fs: throttle the memory footprint with a sysfs entry
f2fs: avoid to drop nat entries due to the negative nr_shrink
f2fs: call f2fs_wait_on_page_writeback instead of native function
f2fs: introduce nr_pages_to_write for segment alignment
...
If we remove a file that has inline data after mount, our statistics turns to
inaccurate.
cat /sys/kernel/debug/f2fs/status
- Inline_data Inode: 4294967295
Let's add stat_inc_inline_inode() to stat inline info of the file when lookup.
Change log from v1:
o stat in f2fs_lookup() instead of in do_read_inode() for excluding wrong stat.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Before setting the acl, call posix_acl_valid() to check if it is
valid or not.
Signed-off-by: zhangzhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Some storage devices show relatively high latencies to complete cache_flush
commands, even though their normal IO speed is prettry much high. In such
the case, it needs to merge cache_flush commands as much as possible to avoid
issuing them redundantly.
So, this patch introduces a mount option, "-o flush_merge", to mitigate such
the overhead.
If this option is enabled by user, F2FS merges the cache_flush commands and then
issues just one cache_flush on behalf of them. Once the single command is
finished, F2FS sends a completion signal to all the pending threads.
Note that, this option can be used under a workload consisting of very intensive
concurrent fsync calls, while the storage handles cache_flush commands slowly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
+165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
DrPKlXwIgva7zq5/S0Vr
=4s1Z
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
ext4: fix premature freeing of partial clusters split across leaf blocks
ext4: remove unneeded test of ret variable
ext4: fix comment typo
ext4: make ext4_block_zero_page_range static
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
ext4: optimize Hurd tests when reading/writing inodes
ext4: kill i_version support for Hurd-castrated file systems
ext4: each filesystem creates and uses its own mb_cache
fs/mbcache.c: doucple the locking of local from global data
fs/mbcache.c: change block and index hash chain to hlist_bl_node
ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
ext4: refactor ext4_fallocate code
ext4: Update inode i_size after the preallocation
ext4: fix partial cluster handling for bigalloc file systems
ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
ext4: only call sync_filesystm() when remounting read-only
fs: push sync_filesystem() down to the file system's remount_fs()
jbd2: improve error messages for inconsistent journal heads
jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
jbd2: minimize region locked by j_list_lock in journal_get_create_access()
...
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.
Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the f2fs_wait_on_page_writeback, io->bio should be covered by io_rwsem.
Otherwise, the bio pointer can become a dangling pointer due to data races.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should unlock page in ->readpage() path and also should unlock & release page
in error path of ->write_begin() to avoid deadlock or memory leak.
So let's add release code to fix the problem when we fail to read inline data.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch use list_for_each_entry{_safe} instead of list_for_each{_safe} for
simplfying code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Move kmem_cache_free out of spinlock protection region for better performance.
Change log from v1:
o remove spinlock protection for kmem_cache_free in destroy_node_manager
suggested by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In f2fs_setxattr we have limit this attribute name length, so we should also
check it in f2fs_getxattr to avoid useless lookup caused by invalid name length.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduce is_merged_page() to check whether current page is merged
in f2fs bio cache. When page is not in cache, we can avoid submitting bio cache,
resulting in having more chance to merge pages.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
During the cleaing of node segments, F2FS can get errored node blocks due to
data race between node page lock and its valid bitmap operations.
In that case, it needs to return an error to skip such the obsolete block copy.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If multiple redundant fsync calls are triggered, we don't need to write its
node pages with fsync mark continuously.
So, this patch adds FI_NEED_FSYNC to track whether the latest node block is
written with the fsync mark or not.
If the mark was set, a new fsync doesn't need to write a node block.
Otherwise, we should do a new node block with the mark for roll-forward
recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces fi->i_sem to protect fi's info that includes xattr_ver,
pino, i_nlink.
This enables to remove i_mutex during f2fs_sync_file, resulting in performance
improvement when a number of fsync calls are triggered from many concurrent
threads.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
It is more reasonable to determine the reclaiming rate of prefree segments
according to the volume size, which is set to 5% by default.
For example, if the volume is 128GB, the prefree segments are reclaimed
when the number reaches to 6.4GB.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The NM_WOUT_THRESHOLD is now obsolete since f2fs starts to control on a basis
of the memory footprint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces ram_thresh, a sysfs entry, which controls the memory
footprint used by the free nid list and the nat cache.
Previously, the free nid list was controlled by MAX_FREE_NIDS, while the nat
cache was managed by NM_WOUT_THRESHOLD.
However, this approach cannot be applied dynamically according to the system.
So, this patch adds ram_thresh that users can specify the threshold, which is
in order of 1 / 1024.
For example, if the total ram size is 4GB and the value is set to 10 by default,
f2fs tries to control the number of free nids and nat caches not to consume over
10 * (4GB / 1024) = 10MB.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The try_to_free_nats should not receive the negative nr_shrink.
Otherwise, it can drop all the nat entries by the while loop.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a page is on writeback, f2fs can face with deadlock due to under writepages.
This is caused by merging IOs inside f2fs, so if it comes to detect, let's throw
merged IOs, which is implemented by f2fs_wait_on_page_writeback.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces nr_pages_to_write to align page writes to the segment
or other operational unit size, which can be tuned according to the system
environment.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces nr_pages_to_skip(sbi, type) to determine writepages can
be skipped.
The dentry, node, and meta pages can be conrolled by F2FS without breaking the
FS consistency.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously 'background_gc={on***,off***}' is being parsed as correct option,
with this patch we cloud fix the trivial bug in mount process.
Change log from v1:
o need to check length of parameter suggested by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should return error number of read_normal_summaries instead of -EINVAL when
read_normal_summaries failed.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a help function f2fs_has_xattr_block for better
readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Upper bound checking of ino should be added to f2fs_nfs_get_inode, so unneeded
process before do_read_inode in f2fs_iget could be avoided when ino is invalid.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a help function f2fs_has_inline_xattr for better
readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously we do not recover inline xattr data of inode after power-cut, so
inline xattr data may be lost.
We should recover the data during the roll-forward process.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, we ra_sum_pages to pre-read contiguous pages as more
as possible, and if we fail to alloc more pages, an ENOMEM error
will be reported upstream, even though we have alloced some pages
yet. In fact, we can use the available pages to do the job partly,
and continue the rest in the following circle. Only reporting ENOMEM
upstream if we really can not alloc any available page.
And another fix is ignoring dealing with the following pages if an
EIO occurs when reading page from page_list.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: modify the flow for better neat code]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Integrated a couple of minor changes for better readability suggested by
Chao Yu.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes performance regression of dbench reported by
Alex <hbx7d@yandex.com>.
This issue was revealed by Phoronix tests results:
http://www.phoronix.com/scan.php?page=article&item=linux_314_ssdfs&num=2
It turns out that we need to assign WRITE_SYNC to the node writes, if
fsync is triggered.
The performance numbers are like below, which is measured by Alex.
1. 355MB/s ext4
2. 225MB/s f2fs : WRITE for node writes
3. 525MB/s f2fs : WRITE_SYNC for node writes
Reported-And-Tested-by: Alex <hbx7d@yandex.com>.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should de-account dirty counters for page when redirty in ->writepage().
Wu Fengguang described in 'commit 971767caf632190f77a40b4011c19948232eed75':
"writeback: fix dirtied pages accounting on redirty
De-account the accumulative dirty counters on page redirty.
Page redirties (very common in ext4) will introduce mismatch between
counters (a) and (b)
a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
b) NR_WRITTEN, BDI_WRITTEN
This will introduce systematic errors in balanced_rate and result in
dirty page position errors (ie. the dirty pages are no longer balanced
around the global/bdi setpoints)."
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch use existing macro F2FS_INODE/NEXT_FREE_BLKADDR to clean up some
codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If there are multi segments in one section, we will read those SSA blocks which
have contiguous address one by one in f2fs_gc. It may lost performance, let's
read ahead SSA blocks by merge multi read request.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds an sysfs entry to control dir_level used by the large directory.
The description of this entry is:
dir_level This parameter controls the directory level to
support large directory. If a directory has a
number of files, it can reduce the file lookup
latency by increasing this dir_level value.
Otherwise, it needs to decrease this value to
reduce the space overhead. The default value is 0.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces an i_dir_level field to support large directory.
Previously, f2fs maintains multi-level hash tables to find a dentry quickly
from a bunch of chiild dentries in a directory, and the hash tables consist of
the following tree structure as below.
In Documentation/filesystems/f2fs.txt,
----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------
level #0 | A(2B)
|
level #1 | A(2B) - A(2B)
|
level #2 | A(2B) - A(2B) - A(2B) - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
But, if we can guess that a directory will handle a number of child files,
we don't need to traverse the tree from level #0 to #N all the time.
Since the lower level tables contain relatively small number of dentries,
the miss ratio of the target dentry is likely to be high.
In order to avoid that, we can configure the hash tables sparsely from level #0
like this.
level #0 | A(2B) - A(2B) - A(2B) - A(2B)
level #1 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
With this structure, we can skip the ineffective tree searches in lower level
hash tables.
This patch adds just a facility for this by introducing i_dir_level in
f2fs_inode.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
It turns out that a bit operation like find_next_bit is not always fast enough
for f2fs_find_entry.
Instead, it is pretty much simple and fast to traverse each dentries.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The stat_show is just to show the current status of f2fs.
So, we can remove all the there-in locks.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a radix tree for the list of free_nids, which enhances
the performance on free nid management.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help macro on_build_free_nids() which just uses build_lock
to judge whether the building free nid is going, so that we can remove
the on_build_free_nids field from f2fs_sb_info.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: remove an unnecessary white line removal]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The nat cache entry maintains a status whether it is checkpointed or not.
So, if a new cache entry is loaded from the last checkpoint,
nat_entry->checkpointed should be true.
If the cache entry is modified as being dirty, nat_entry->checkpoint should
be false.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
At the end of the recovery procedure, write_checkpoint is called and updates
the cp count which is managed by f2fs stat.
But, previously build_stat() is called after the recovery procedure, which
results in:
BUG: unable to handle kernel NULL pointer dereference at 000000000000012c
IP: [<ffffffffa03b1030>] write_checkpoint+0x720/0xbc0 [f2fs]
Call Trace:
[<ffffffff810a6b44>] ? mark_held_locks+0x74/0x140
[<ffffffff8109a3e0>] ? __init_waitqueue_head+0x60/0x60
[<ffffffffa03bf036>] recover_fsync_data+0x656/0xf20 [f2fs]
[<ffffffff812ee3eb>] ? security_d_instantiate+0x1b/0x30
[<ffffffffa03aeb4d>] f2fs_fill_super+0x94d/0xa00 [f2fs]
[<ffffffff811a9825>] mount_bdev+0x1a5/0x1f0
[<ffffffff8114915e>] ? __get_free_pages+0xe/0x40
[<ffffffffa03ae200>] ? f2fs_remount+0x130/0x130 [f2fs]
[<ffffffffa03aa575>] f2fs_mount+0x15/0x20 [f2fs]
[<ffffffff811aa713>] mount_fs+0x43/0x1b0
[<ffffffff811c7124>] vfs_kern_mount+0x74/0x160
[<ffffffff811c5cb1>] ? __get_fs_type+0x51/0x60
[<ffffffff811c9727>] do_mount+0x237/0xb50
[<ffffffff811c936a>] ? copy_mount_options+0x3a/0x170
So, this patche changes the order of recovery_fsync_data() and
f2fs_build_stats().
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Even if f2fs_write_data_page is called by the page reclaiming path, we should
not write the page to provide enough free segments for the worst case scenario.
Otherwise, f2fs can face with no free segment while gc is conducted, resulting
in:
------------[ cut here ]------------
kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/segment.c:565!
RIP: 0010:[<ffffffffa02c3b11>] [<ffffffffa02c3b11>] new_curseg+0x331/0x340 [f2fs]
Call Trace:
allocate_segment_by_default+0x204/0x280 [f2fs]
allocate_data_block+0x108/0x210 [f2fs]
write_data_page+0x8a/0xc0 [f2fs]
do_write_data_page+0xe1/0x2a0 [f2fs]
move_data_page+0x8a/0xf0 [f2fs]
f2fs_gc+0x446/0x970 [f2fs]
f2fs_balance_fs+0xb6/0xd0 [f2fs]
f2fs_write_begin+0x50/0x350 [f2fs]
? unlock_page+0x27/0x30
? unlock_page+0x27/0x30
generic_file_buffered_write+0x10a/0x280
? file_update_time+0xa3/0xf0
__generic_file_aio_write+0x1c8/0x3d0
? generic_file_aio_write+0x52/0xb0
? generic_file_aio_write+0x52/0xb0
generic_file_aio_write+0x65/0xb0
do_sync_write+0x5a/0x90
vfs_write+0xc5/0x1f0
SyS_write+0x55/0xa0
system_call_fastpath+0x16/0x1b
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch shows the counts of checkpoint in f2fs' status.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch help us to cleanup the readahead code by merging ra_{sit,nat}_pages
function into ra_meta_pages.
Additionally the new function is used to readahead cp block in
recover_orphan_inodes.
Change log from v1:
o fix a deadloop bug pointed by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously without protection of inode mutex, f2fs_falloc and other data
correlated operations will interfere with each other.
So let's use inode mutex to keep atomicity of f2fs_falloc.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If f2fs entered errorneous checkpoint status, it should skip writing meta
pages instead of redirtying the pages out.
Otherwise, it cannot unmount the partition even though f2fs is under read-only
status.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When a new directory is allocated, if an error is occurred, we should truncate
preallocated dentry pages too.
This bug was reported by Andrey Tsyvarev after a while as follows.
mkdir()->
f2fs_add_link()->
init_inode_metadata()->
f2fs_init_acl()->
f2fs_get_acl()->
f2fs_getxattr()->
read_all_xattrs() fails.
Also there was a BUG_ON triggered after the fault in
mkdir()->
f2fs_add_link()->
init_inode_metadata()->
remove_inode_page() ->
f2fs_bug_on(inode->i_blocks != 0 && inode->i_blocks != 1);
But, previous patch wasn't perfect to resolve that bug, so the following bug
report was also submitted.
kernel BUG at fs/f2fs/inode.c:274!
Call Trace:
[<ffffffff811fde03>] evict+0xa3/0x1a0
[<ffffffff811fe615>] iput+0xf5/0x180
[<ffffffffa01c7f63>] f2fs_mkdir+0xf3/0x150 [f2fs]
[<ffffffff811f2a77>] vfs_mkdir+0xb7/0x160
[<ffffffff811f36bf>] SyS_mkdir+0x5f/0xc0
[<ffffffff81680769>] system_call_fastpath+0x16/0x1b
Finally, this patch resolves all the issues like below.
If an error is occurred after make_empty_dir(),
1. truncate_inode_pages()
The make_bad_inode() prior to iput() will change i_mode to S_IFREG, which
means that f2fs will not decrement fi->dirty_dents during f2fs_evict_inode.
But, by calling it here, we can do that.
2. truncate_blocks()
Preallocated dentry pages are trucated here to sync i_blocks.
3. remove_dirty_dir_inode()
Remove this directory inode from the list.
Reported-and-Tested-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch modifies flow a little bit to avoid the following build warnings.
src/fs/f2fs/recovery.c: In function ‘check_index_in_prev_nodes’:
src/fs/f2fs/recovery.c:288:51: warning: ‘sum.<U5390>.<U52f8>.ofs_in_node’ may
be used uninitialized in this function [-Wmaybe-uninitialized]
src/fs/f2fs/recovery.c:260:23: warning: ‘sum.nid’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This is the erroneous scenario.
i_size on-disk i_size i_blocks
__f2fs_add_link() 4096 4096 2
get_new_data_page 8192 4096 3
-ENOSPC = init_inode_metadata
checkpoint - 4096 3
POR and reboot
__f2fs_add_link() 4096 4096 3
page = get_new_data_page (page->index = 1 by NEW_ADDR)
add a dentry to the page successfully
f2fs_rmdir()
f2fs_empty_dir() 4096 4096 3
f2fs_unlink() goes, since there is no valid dentry due to i_size = 4096.
But, still there is one dentry in page->index = 1.
So this patch moves the code to write dir->i_size into on-disk i_size in order
to sync dir's i_size, on-disk i_size, and its i_blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch modifies the use of bi_private to remove pointer chasing for sbi.
Previously, we had a bi_private structure, but it needs memory allocation.
So this patch uses bi_private by the sbi pointer and adds a completion pointer
into the sbi.
This can achieve no memory allocation and nice use of the bi_private.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a new xattr node page was allocated and its inode is fsynced, we should
recover the xattr node page during the roll-forward process after power-cut.
But, previously, f2fs didn't handle that case, resulting in kernel panic as
follows reported by Tom Li.
BUG: unable to handle kernel paging request at ffffc9001c861a98
IP: [<ffffffffa0295236>] check_index_in_prev_nodes+0x86/0x2d0 [f2fs]
Call Trace:
[<ffffffff815ece9b>] ? printk+0x48/0x4a
[<ffffffffa029626a>] recover_fsync_data+0xdca/0xf50 [f2fs]
[<ffffffffa02873ae>] f2fs_fill_super+0x92e/0x970 [f2fs]
[<ffffffff8112c9f8>] mount_bdev+0x1b8/0x200
[<ffffffffa0286a80>] ? f2fs_remount+0x130/0x130 [f2fs]
[<ffffffffa0285e40>] f2fs_mount+0x10/0x20 [f2fs]
[<ffffffff8112d4de>] mount_fs+0x3e/0x1b0
[<ffffffff810ef4eb>] ? __alloc_percpu+0xb/0x10
[<ffffffff8114761f>] vfs_kern_mount+0x6f/0x120
[<ffffffff811497b9>] do_mount+0x259/0xa90
[<ffffffff810ead1d>] ? memdup_user+0x3d/0x80
[<ffffffff810eadb3>] ? strndup_user+0x53/0x70
[<ffffffff8114a2c9>] SyS_mount+0x89/0xd0
[<ffffffff815feae2>] system_call_fastpath+0x16/0x1b
This patch adds a recovery function of xattr node pages.
Reported-by: Tom Li <biergaizi@members.fsf.org>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In order to make fs consistency, update_inode_page should not be failed all
the time. Otherwise, it is possible to lose some metadata in the inode like
a link count.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
Pull vfs updates from Al Viro:
"Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
assorted cleanups and fixes all over the place...
There will be another pile later this week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
__dentry_path() fixes
vfs: Remove second variable named error in __dentry_path
vfs: Is mounted should be testing mnt_ns for NULL or error.
Fix race when checking i_size on direct i/o read
hfsplus: remove can_set_xattr
nfsd: use get_acl and ->set_acl
fs: remove generic_acl
nfs: use generic posix ACL infrastructure for v3 Posix ACLs
gfs2: use generic posix ACL infrastructure
jfs: use generic posix ACL infrastructure
xfs: use generic posix ACL infrastructure
reiserfs: use generic posix ACL infrastructure
ocfs2: use generic posix ACL infrastructure
jffs2: use generic posix ACL infrastructure
hfsplus: use generic posix ACL infrastructure
f2fs: use generic posix ACL infrastructure
ext2/3/4: use generic posix ACL infrastructure
btrfs: use generic posix ACL infrastructure
fs: make posix_acl_create more useful
fs: make posix_acl_chmod more useful
...
f2fs has some weird mode bit handling, so still using the old
chmod code for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a node page is trucated, we'd better drop the page in the node_inode's page
cache for better memory footprint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds NODE_MAPPING which is similar as META_MAPPING introduced by
Gu Zheng.
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As the orphan_blocks may be max to 504, so it is not security
and rigorous to store such a large array in the kernel stack
as Dan Carpenter said.
In fact, grab_meta_page has locked the page in the page cache,
and we can use find_get_page() to fetch the page safely in the
downstream, so we can remove the page array directly.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help function META_MAPPING() to get the cache meta blocks'
address space.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a dentry page is updated, we should call mark_inode_dirty to add the inode
into the dirty list, so that its dentry pages are flushed to the disk.
Otherwise, the inode can be evicted without flush.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Fixed a variety of trivial checkpatch warnings. The only delta should
be some minor formatting on log strings that were split / too long.
Signed-off-by: Chris Fries <cfries@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Doing sync_meta_pages with META_FLUSH when checkpoint, we overide rw
using WRITE_FLUSH_FUA. At this time, we also should set
REQ_META|REQ_PRIO.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch should resolve the following bug.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.13.0-rc5.f2fs+ #6 Not tainted
---------------------------------------------------------
kswapd0/41 just changed the state of lock:
(&sbi->gc_mutex){+.+.-.}, at: [<ffffffffa030503e>] f2fs_balance_fs+0xae/0xd0 [f2fs]
but this lock took another, RECLAIM_FS-READ-unsafe lock in the past:
(&sbi->cp_rwsem){++++.?}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&sbi->gc_mutex --> &sbi->cp_mutex --> &sbi->cp_rwsem
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sbi->cp_rwsem);
local_irq_disable();
lock(&sbi->gc_mutex);
lock(&sbi->cp_mutex);
<Interrupt>
lock(&sbi->gc_mutex);
*** DEADLOCK ***
This bug is due to the f2fs_balance_fs call in f2fs_write_data_page.
If f2fs_write_data_page is triggered by wbc->for_reclaim via kswapd, it should
not call f2fs_balance_fs which tries to get a mutex grabbed by original syscall
flow.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Support for f2fs-tools/tools/f2stat to monitor
/sys/kernel/debug/f2fs/status
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
With the 2 previous changes, all the long time operations are moved out
of the protection region, so here we can use spinlock rather than mutex
(orphan_inode_mutex) for lower overhead.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Move alloc new orphan node out of lock protection region.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
"boo sync" parameter is never referenced in f2fs_wait_on_page_writeback.
We should remove this parameter.
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously during SSR and GC, the maximum number of retrials to find a victim
segment was hard-coded by MAX_VICTIM_SEARCH, 4096 by default.
This number makes an effect on IO locality, when SSR mode is activated, which
results in performance fluctuation on some low-end devices.
If max_victim_search = 4, the victim will be searched like below.
("D" represents a dirty segment, and "*" indicates a selected victim segment.)
D1 D2 D3 D4 D5 D6 D7 D8 D9
[ * ]
[ * ]
[ * ]
[ ....]
This patch adds a sysfs entry to control the number dynamically through:
/sys/fs/f2fs/$dev/max_victim_search
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When considering a bunch of data writes with very frequent fsync calls, we
are able to think the following performance regression.
N: Node IO, D: Data IO, IO scheduler: cfq
Issue pending IOs
D1 D2 D3 D4
D1 D2 D3 D4 N1
D2 D3 D4 N1 N2
N1 D3 D4 N2 D1
--> N1 can be selected by cfq becase of the same priority of N and D.
Then D3 and D4 would be delayed, resuling in performance degradation.
So, when processing the fsync call, it'd better give higher priority to data IOs
than node IOs by assigning WRITE and WRITE_SYNC respectively.
This patch improves the random wirte performance with frequent fsync calls by up
to 10%.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Here is a case which could read inline page data not from first page.
1. write inline data
2. lseek to offset 4096
3. read 4096 bytes from offset 4096
(read_inline_data read inline data page to non-first page,
And previously VFS has add this page to page cache)
4. ftruncate offset 8192
5. read 4096 bytes from offset 4096
(we meet this updated page with inline data in cache)
So we should leave this page with inited data and uptodate flag
for this case.
Change log from v1:
o fix a deadlock bug
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Change log from v1:
o reduce unneeded memset in __f2fs_convert_inline_data
>From 58796be2bd2becbe8d52305210fb2a64e7dd80b6 Mon Sep 17 00:00:00 2001
From: Chao Yu <chao2.yu@samsung.com>
Date: Mon, 30 Dec 2013 09:21:33 +0800
Subject: [PATCH] f2fs: avoid to left uninitialized data in page when read
inline data
We left uninitialized data in the tail of page when we read an inline data
page. So let's initialize left part of the page excluding inline data region.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The truncate_partial_nodes puts pages incorrectly in the following two cases.
Note that the value for argc 'depth' can only be 2 or 3.
Please see truncate_inode_blocks() and truncate_partial_nodes().
1) An err is occurred in the first 'for' loop
When err is occurred with depth = 2, pages[0] is invalid, so this page doesn't
need to be put. There is no problem, however, when depth is 3, it doesn't put
the pages correctly where pages[0] is valid and pages[1] is invalid.
In this case, depth is set to 2 (ref to statemnt depth = i + 1), and then
'goto fail'.
In label 'fail', for (i = depth - 3; i >= 0; i--) cannot meet the condition
because i = -1, so pages[0] cann't be put.
2) An err happened in the second 'for' loop
Now we've got pages[0] with depth = 2, or we've got pages[0] and pages[1]
with depth = 3. When an err is detected, we need 'goto fail' to put such
the pages.
When depth is 2, in label 'fail', for (i = depth - 3; i >= 0; i--) cann't
meet the condition because i = -1, so pages[0] cann't be put.
When depth is 3, in label 'fail', for (i = depth - 3; i >= 0; i--) can
only put pages[0], pages[1] also cann't be put.
Note that 'depth' has been changed before first 'goto fail' (ref to statemnt
depth = i + 1), so passing this modified 'depth' to the tracepoint,
trace_f2fs_truncate_partial_nodes, is also incorrect.
Signed-off-by: Shifei Ge <shifei10.ge@samsung.com>
[Jaegeuk Kim: modify the description and fix one bug]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The get_dnode_of_data nullifies inode and node page when error is occurred.
There are two cases that passes inode page into get_dnode_of_data().
1. make_empty_dir()
-> get_new_data_page()
-> f2fs_reserve_block(ipage)
-> get_dnode_of_data()
2. f2fs_convert_inline_data()
-> __f2fs_convert_inline_data()
-> f2fs_reserve_block(ipage)
-> get_dnode_of_data()
This patch adds correct error handling codes when get_dnode_of_data() returns
an error.
At first, f2fs_reserve_block() calls f2fs_put_dnode() whenever reserve_new_block
returns an error.
So, the rule of f2fs_reserve_block() is to nullify inode page when there is any
error internally.
Finally, two callers of f2fs_reserve_block() should call f2fs_put_dnode()
appropriately if they got an error since successful f2fs_reserve_block().
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a inline_data recovery routine with the following policy.
[prev.] [next] of inline_data flag
o o -> recover inline_data
o x -> remove inline_data, and then recover data blocks
x o -> remove inline_data, and then recover inline_data
x x -> recover data blocks
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds the number of inline_data files into the status information.
Note that the number is reset whenever the filesystem is newly mounted.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Change log from v1:
o handle NULL pointer of grab_cache_page_write_begin() pointed by Chao Yu.
This patch refactors f2fs_convert_inline_data to check a couple of conditions
internally for deciding whether it needs to convert inline_data or not.
So, the new f2fs_convert_inline_data initially checks:
1) f2fs_has_inline_data(), and
2) the data size to be changed.
If the inode has inline_data but the size to fill is less than MAX_INLINE_DATA,
then we don't need to convert the inline_data with data allocation.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In f2fs_write_begin(), if f2fs_conver_inline_data() returns an error like
-ENOSPC, f2fs should call f2fs_put_page().
Otherwise, it is remained as a locked page, resulting in the following bug.
[<ffffffff8114657e>] sleep_on_page+0xe/0x20
[<ffffffff81146567>] __lock_page+0x67/0x70
[<ffffffff81157d08>] truncate_inode_pages_range+0x368/0x5d0
[<ffffffff81157ff5>] truncate_inode_pages+0x15/0x20
[<ffffffff8115804b>] truncate_pagecache+0x4b/0x70
[<ffffffff81158082>] truncate_setsize+0x12/0x20
[<ffffffffa02a1842>] f2fs_setattr+0x72/0x270 [f2fs]
[<ffffffff811cdae3>] notify_change+0x213/0x400
[<ffffffff811ab376>] do_truncate+0x66/0xa0
[<ffffffff811ab541>] vfs_truncate+0x191/0x1b0
[<ffffffff811ab5bc>] do_sys_truncate+0x5c/0xa0
[<ffffffff811ab78e>] SyS_truncate+0xe/0x10
[<ffffffff81756052>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In the punch_hole(), let's convert inline_data all the time for simplicity and
to avoid potential deadlock conditions.
It is pretty much not a big deal to do this.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch locates checking the inline_data prior to calling f2fs_lock_op()
in truncate_blocks(), since getting the lock is unnecessary.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Hook inline data read/write, truncate, fallocate, setattr, etc.
Files need meet following 2 requirement to inline:
1) file size is not greater than MAX_INLINE_DATA;
2) file doesn't pre-allocate data blocks by fallocate().
FI_INLINE_DATA will not be set while creating a new regular inode because
most of the files are bigger than ~3.4K. Set FI_INLINE_DATA only when
data is submitted to block layer, ranther than set it while creating a new
inode, this also avoids converting data from inline to normal data block
and vice versa.
While writting inline data to inode block, the first data block should be
released if the file has a block indexed by i_addr[0].
On the other hand, when a file operation is appied to a file with inline
data, we need to test if this file can remain inline by doing this
operation, otherwise it should be convert into normal file by reserving
a new data block, copying inline data to this new block and clear
FI_INLINE_DATA flag. Because reserve a new data block here will make use
of i_addr[0], if we save inline data in i_addr[0..872], then the first
4 bytes would be overwriten. This problem can be avoided simply by
not using i_addr[0] for inline data.
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Functions to implement inline data read/write, and move inline data to
normal data block when file size exceeds inline data limitation.
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, we need to calculate the max orphan num when we try to acquire an
orphan inode, but it's a stable value since the super block was inited. So
converting it to a field of f2fs_sb_info and use it directly when needed seems
a better choose.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The f2fs supports 4KB block size. If user requests dwrite with under 4KB data,
it allocates a new 4KB data block.
However, f2fs doesn't add zero data into the untouched data area inside the
newly allocated data block.
This incurs an error during the xfstest #263 test as follow.
263 12s ... [failed, exit status 1] - output mismatch (see 263.out.bad)
--- 263.out 2013-03-09 03:37:15.043967603 +0900
+++ 263.out.bad 2013-12-27 04:20:39.230203114 +0900
@@ -1,3 +1,976 @@
QA output created by 263
fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
+fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
+truncating to largest ever: 0x12a00
+truncating to largest ever: 0x75400
+fallocating to largest ever: 0x79cbf
...
(Run 'diff -u 263.out 263.out.bad' to see the entire diff)
Ran: 263
Failures: 263
Failed 1 of 1 tests
It turns out that, when the test tries to write 2KB data with dio, the new dio
path allocates 4KB data block without filling zero data inside the remained 2KB
area. Finally, the output file contains a garbage data for that region.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When get_dnode_of_data() in get_data_block() returns a successful dnode, we
should put the dnode.
But, previously, if its data block address is equal to NEW_ADDR, we didn't do
that, resulting in a deadlock condition.
So, this patch splits original error conditions with this case, and then calls
f2fs_put_dnode before finishing the function.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces F2FS_INODE that returns struct f2fs_inode * from the inode
page.
By using this macro, we can remove unnecessary casting codes like below.
struct f2fs_inode *ri = &F2FS_NODE(inode_page)->i;
-> struct f2fs_inode *ri = F2FS_INODE(inode_page);
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In current flow, we will get Null return value of f2fs_find_entry in
recover_dentry when name.len is bigger than F2FS_NAME_LEN, and then we
still add this inode into its dir entry.
To avoid this situation, we must check filename length before we use it.
Another point is that we could remove the code of checking filename length
In f2fs_find_entry, because f2fs_lookup will be called previously to ensure of
validity of filename length.
V2:
o add WARN_ON() as Jaegeuk Kim suggested.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When we rename a dir to new name which is not exist previous,
we will set pino of parent inode with ino of child inode in f2fs_set_link.
It destroy consistency of pino, it should be fixed.
Thanks for previous work of Shu Tan.
Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Update several comments:
1. use f2fs_{un}lock_op install of mutex_{un}lock_op.
2. update comment of get_data_block().
3. update description of node offset.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When using the f2fs_io_info in the low level, we still need to merge the
rw and rw_flag, so use the rw to hold all the io flags directly,
and remove the rw_flag field.
ps.It is based on the previous patch:
f2fs: move all the bio initialization into __bio_alloc
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Move all the bio initialization into __bio_alloc, and some minor cleanups are
also added.
v3:
Use 'bool' rather than 'int' as Kim suggested.
v2:
Use 'is_read' rather than 'rw' as Yu Chao suggested.
Remove the needless initialization of bio->bi_private.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enhances writing dirty meta pages collectively in background.
During the file data writes, it'd better avoid to write small dirty meta pages
frequently.
So let's give a chance to collect a number of dirty meta pages for a while.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs doesn't support direct IOs with high performance, which throws
every write requests via the buffered write path, resulting in highly
performance degradation due to memory opeations like copy_from_user.
This patch introduces a new direct IO path in which every write requests are
processed by generic blockdev_direct_IO() with enhanced get_block function.
The get_data_block() in f2fs handles:
1. if original data blocks are allocates, then give them to blockdev.
2. otherwise,
a. preallocate requested block addresses
b. do not use extent cache for better performance
c. give the block addresses to blockdev
This policy induces that:
- new allocated data are sequentially written to the disk
- updated data are randomly written to the disk.
- f2fs gives consistency on its file meta, not file data.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces new sysfs entries for users to control the policy of
in-place-updates, namely IPU, in f2fs.
Sometimes f2fs suffers from performance degradation due to its out-of-place
update policy that produces many additional node block writes.
If the storage performance is very dependant on the amount of data writes
instead of IO patterns, we'd better drop this out-of-place update policy.
This patch suggests 5 polcies and their triggering conditions as follows.
[sysfs entry name = ipu_policy]
0: F2FS_IPU_FORCE all the time,
1: F2FS_IPU_SSR if SSR mode is activated,
2: F2FS_IPU_UTIL if FS utilization is over threashold,
3: F2FS_IPU_SSR_UTIL if SSR mode is activated and FS utilization is over
threashold,
4: F2FS_IPU_DISABLE disable IPU. (=default option)
[sysfs entry name = min_ipu_util]
This parameter controls the threshold to trigger in-place-updates.
The number indicates percentage of the filesystem utilization, and used by
F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies.
For more details, see need_inplace_update() in segment.h.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces f2fs_io_info to mitigate the complex parameter list.
struct f2fs_io_info {
enum page_type type; /* contains DATA/NODE/META/META_FLUSH */
int rw; /* contains R/RS/W/WS */
int rw_flag; /* contains REQ_META/REQ_PRIO */
}
1. f2fs_write_data_pages
- DATA
- WRITE_SYNC is set when wbc->WB_SYNC_ALL.
2. sync_node_pages
- NODE
- WRITE_SYNC all the time
3. sync_meta_pages
- META
- WRITE_SYNC all the time
- REQ_META | REQ_PRIO all the time
** f2fs_submit_merged_bio() handles META_FLUSH.
4. ra_nat_pages, ra_sit_pages, ra_sum_pages
- META
- READ_SYNC
Cc: Fan Li <fanofcode.li@samsung.com>
Cc: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously f2fs submits most of write requests using WRITE_SYNC, but f2fs_write_data_pages
submits last write requests by sync_mode flags callers pass.
This causes a performance problem since continuous pages with different sync flags
can't be merged in cfq IO scheduler(thanks yu chao for pointing it out), and synchronous
requests often take more time.
This patch makes the following modifies to DATA writebacks:
1. every page will be written back using the sync mode caller pass.
2. only pages with the same sync mode can be merged in one bio request.
These changes are restricted to DATA pages.Other types of writebacks are modified
To remain synchronous.
In my test with tiotest, f2fs sequence write performance is improved by about 7%-10% ,
and this patch has no obvious impact on other performance tests.
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds unlikely() macro into the most of codes.
The basic rule is to add that when:
- checking unusual errors,
- checking page mappings,
- and the other unlikely conditions.
Change log from v1:
- Don't add unlikely for the NULL test and error test: advised by Andi Kleen.
Cc: Chao Yu <chao2.yu@samsung.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As we know, some of our branch condition will rarely be true. So we could add
'unlikely' to let compiler optimize these code, by this way we could drop
unneeded 'jump' assemble code to improve performance.
change log:
o add *unlikely* as many as possible across the whole source files at once
suggested by Jaegeuk Kim.
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In find_fsync_dnodes() and recover_data(), our flow is like this:
->f2fs_submit_page_bio()
-> f2fs_put_page()
-> page_cache_release() ---- page->_count declined to zero.
->__free_pages()
-> put_page_testzero() ---- page->_count will be declined again.
We will get a segment fault in put_page_testzero when CONFIG_DEBUG_VM
is on, or return MM with a bad page with wrong _count num.
So let's just release this page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Use inner macro GFP_F2FS_ZERO to instead of GFP_NOFS | __GFP_ZERO for
simplification of code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This minor change for the naming conventions of debugfs_root
to avoid any possible conflicts to the other filesystem.
Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
[Jaegeuk Kim: change the patch name]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When debugfs_create_file() failed in f2fs_create_root_stats(),
debugfs_root should be remove.
Signed-off-by: Younger Liu <liuyiyang@hisense.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If cp has no CP_UMOUNT_FLAG, we will read all pages in whole node segment
one by one, it makes low performance. So let's merge contiguous pages and
readahead for better performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: adjust the new bio operations]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch integrates redundant bio operations on read and write IOs.
1. Move bio-related codes to the top of data.c.
2. Replace f2fs_submit_bio with f2fs_submit_merged_bio, which handles read
bios additionally.
3. Introduce __submit_merged_bio to submit the merged bio.
4. Change f2fs_readpage to f2fs_submit_page_bio.
5. Introduce f2fs_submit_page_mbio to integrate previous submit_read_page and
submit_write_page.
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com >
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously f2fs allocates its own bi_private data structure all the time even
though we don't use it. But, can we remove this bi_private allocation?
This patch removes such the additional bi_private allocation.
1. Retrieve f2fs_sb_info from its page->mapping->host->i_sb.
- This removes the usecases of bi_private in end_io.
2. Use bi_private only when we really need it.
- The bi_private is used only when the checkpoint procedure is conducted.
- When conducting the checkpoint, f2fs submits a META_FLUSH bio to wait its bio
completion.
- Since we have no dependancies to remove bi_private now, let's just use
bi_private pointer as the completion pointer.
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The recover_orphan_inodes() returns no error all the time, so we don't need to
check its errors.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should return error if we do not get an updated page in find_date_page
when f2fs_readpage failed.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The inode_page_locked should be a boolean variable.
struct dnode_of_data {
struct inode *inode; /* vfs inode pointer */
struct page *inode_page; /* its inode page, NULL is possible */
struct page *node_page; /* cached direct node page */
nid_t nid; /* node id of the direct node block */
unsigned int ofs_in_node; /* data offset in the node page */
==> bool inode_page_locked; /* inode page is locked or not */
block_t data_blkaddr; /* block address of the node block */
};
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The void *wait in bio_private is used for waiting completion of checkpoint bio.
So we don't need to use its type as void, but declare it as completion type.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Because we will write node summaries when do_checkpoint with umount flag,
our number of max orphan blocks should minus NR_CURSEG_NODE_TYPE additional.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Because FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE
in fallocate, so we could remove the useless 'keep size' branch code which
will never be excuted in punch_hole.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
[Jaegeuk Kim: remove an unnecessary parameter togather]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes the unnecessary condition checks on:
fs/f2fs/gc.c:667 do_garbage_collect() warn: 'sum_page' isn't an ERR_PTR
fs/f2fs/f2fs.h:795 f2fs_put_page() warn: 'page' isn't an ERR_PTR
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes some bit overflows by the shift operations.
Dan Carpenter reported potential bugs on bit overflows as follows.
fs/f2fs/segment.c:910 submit_write_page()
warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/checkpoint.c:429 get_valid_checkpoint()
warn: should '1 << ()' be a 64 bit type?
fs/f2fs/data.c:408 f2fs_readpage()
warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/data.c:457 submit_read_page()
warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/data.c:525 get_data_block_ro()
warn: should 'i << blkbits' be a 64 bit type?
Bug-Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Fix a potential out of range issue introduced by commit:
22fb72225a
f2fs: simplify write_orphan_inodes for better readable
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add a mount option: inline_data. If the mount option is set,
data of New created small files can be stored in their inode.
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add new inode flags F2FS_INLINE_DATA and FI_INLINE_DATA to indicate
whether the inode has inline data.
Inline data makes use of inode block's data indices region to save small
file. Currently there are 923 data indices in an inode block. Since
inline xattr has made use of the last 50 indices to save its data, there
are 873 indices left which can be used for inline data. When
FI_INLINE_DATA is set, the layout of inode block's indices region is
like below:
+-----------------+
| | Reserved. reserve_new_block() will make use of
| i_addr[0] | i_addr[0] when we need to reserve a new data block
| | to convert inline data into regular one's.
|-----------------|
| | Used by inline data. A file whose size is less than
| i_addr[1~872] | 3488 bytes(~3.4k) and doesn't reserve extra
| | blocks by fallocate() can be saved here.
|-----------------|
| |
| i_addr[873~922] | Reserved for inline xattr
| |
+-----------------+
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Let's send REQ_META or REQ_PRIO when reading meta area such as NAT/SIT
etc.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch inserts information of bio types in more detail.
So, we can now see REQ_META and REQ_PRIO too.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add the function f2fs_reserve_block() to easily reserve new blocks, and
use it to clean up more codes.
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously we read sit entries page one by one, this method lost the chance
of reading contiguous page together. So we read pages as contiguous as
possible for better mount performance.
change log:
o merge judgements/use 'Continue' or 'Break' instead of 'Goto' as Gu Zheng
suggested.
o add mark_page_accessed() before release page to delay VM reclaiming.
o remove '*order' for simplification of function as Jaegeuk Kim suggested.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix a bug on the block address calculation]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a tracepoint for f2fs_submit_read_bio.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: integrate tracepoints of f2fs_submit_read(_write)_bio]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a tracepoint for submit_read_page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: integrate tracepoints of f2fs_submit_read(_write)_page]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
It is not efficient comparing each segment type to find node or data.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: remove unnecessary white spaces]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Change log from v1:
o add mark_page_accessed() not to reclaim the nat pages.
This patch changes the policy of submitting read bios at ra_nat_pages.
Previously, f2fs submits small read bios with block plugging.
But, with this patch, f2fs itself merges read bios first and then submits a
large bio, which can reduce the bio handling overheads.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
For better read performance, we add a new function to support for merging
contiguous read as the one for write.
v1-->v2:
o add declarations here as Gu Zheng suggested.
o use new structure f2fs_bio_info introduced by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Acked-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Simplify write_orphan_inodes for better readable. Because we hold the
orphan_inode_mutex, so it's safe to use list_for_each_entry instead of
list_for_each_safe.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The f2fs has three bio types, NODE, DATA, and META, and manages some data
structures per each bio types.
The codes are a little bit messy, thus, this patch introduces a bio array
which groups individual data structures as follows.
struct f2fs_bio_info {
struct bio *bio; /* bios to merge */
sector_t last_block_in_bio; /* last block number */
struct mutex io_mutex; /* mutex for bio */
};
struct f2fs_sb_info {
...
struct f2fs_bio_info write_io[NR_PAGE_TYPE]; /* for write bios */
...
};
The code changes from this new data structure are trivial.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The f2fs manages an extent cache to search a number of consecutive data blocks
very quickly.
However it conducts unnecessary cache operations if the file is highly
fragmented with no valid extent cache.
In such the case, we don't need to handle the extent cache, but just can disable
the cache facility.
Nevertheless, this patch gives one more chance to enable the extent cache.
For example,
1. create a file
2. write data sequentially which produces a large valid extent cache
3. update some data, resulting in a fragmented extent
4. if the fragmented extent is too small, then drop extent cache
5. close the file
6. open the file again
7. give another chance to make a new extent cache
8. write data sequentially again which creates another big extent cache.
...
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes an unnecessary semaphore (i.e., sbi->bio_sem).
There is no reason to use the semaphore when f2fs submits read and write IOs.
Instead, let's use a write mutex and cover the sbi->bio[] by the lock.
Change log from v1:
o split write_mutex suggested by Chao Yu
Chao described,
"All DATA/NODE/META bio buffers in superblock is protected by
'sbi->write_mutex', but each bio buffer area is independent, So we
should split write_mutex to three for DATA/NODE/META."
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should use f2fs_put_page to release page for uniform style of f2fs code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Change log from v1:
o fix 32bit drops reported by Dan Carpenter
This patch adds f2fs_issue_discard() to clean up blkdev_issue_discard() flows.
Dan carpenter reported:
"block_t is a 32 bit type and sector_t is a 64 bit type. The upper 32
bits of the sector_t are not used because the shift will wrap."
Bug-Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If frequent small discards are issued to the device, the performance would
be degraded significantly.
So, this patch adds a sysfs entry to control the number of discards to be
issued during a checkpoint procedure.
By default, f2fs does not issue any small discards, which means max_discards
is zero.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds key functions to activate the small discard feature.
Note that this procedure is conducted during the checkpoint only.
In flush_sit_entries(), when a new dirty sit entry is flushed, f2fs calls
add_discard_addrs() which searches candidates to be discarded.
The candidates should be marked *invalidated* and also previous checkpoint
recognizes it as *valid*.
At the end of a checkpoint procedure, f2fs throws discards based on the
discard entry list.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a slab cache entry for small discards.
Each entry consists of:
struct discard_entry {
struct list_head list; /* list head */
block_t blkaddr; /* block address to be discarded */
int len; /* # of consecutive blocks of the discard */
};
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
To find a zero bit using the result of OR operation between ckpt_valid_map
and cur_valid_map is more fast than find a zero bit in each bitmap.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: adjust changed function name]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When f2fs_set_bit is used, in a byte MSB and LSB is reversed,
in that case we can use __find_rev_next_bit or __find_rev_next_zero_bit.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: change the function names]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
o Changes from v1
Use find_next(_zero)_bit suggested by jg.kim
When f2fs issues discard command, if segment is contiguous,
let's issue more large segment to gather adjacent segments.
** blktrace **
179,1 0 5859 42.619023770 971 C D 131072 + 2097152 [0]
179,1 0 33665 108.840475468 971 C D 2228224 + 2494464 [0]
179,1 0 33671 109.131616427 971 C D 14909440 + 344064 [0]
179,1 0 33677 109.137100677 971 C D 15261696 + 4096 [0]
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If we failed to init&add kobject when fill_super, stats info and proc object of
f2fs will not be released.
We should free them before we finish fill_super.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
use genernal method supported by kernel
o changes from v1
If any waiter exists at end io, wake up it.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
A NULL point should avoid to be used in destroy_segment_manager after allocating
memory fail for f2fs_sm_info.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In wait_on_node_pages_writeback we will test and clear error flag for all
pages in radix tree, but not necessary.
So we only do this for pages belong to the specified inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs_sync_file() waits for all the node blocks to be written.
But, we don't need to do that, but wait only the inode-related node blocks.
This patch adds wait_on_node_pages_writeback() in which waits inode-related
node blocks that are on writeback.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, check_block_count check valid_map with bit data type in common
scenario that sit has all ones or zeros bitmap, it makes low mount performance.
So let's check the special bitmap with integer data type instead of the bit one.
v1-->v2:
o use find_next_{zero_}bit_le for better performance and readable as Jaegeuk
suggested.
o use neat logogram in comment as Gu Zheng suggested.
o search continuous ones or zeros for better performance when checking mixed
bitmap.
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
npages_for_summary_flush uses (SUMMARY_SIZE + 1) as the size of a f2fs_summary
while its actual size is SUMMARY_SIZE. So the result sometimes is bigger than
actual number by one, which causes checkpoint can't be written into disk
contiguously, and sometimes summary blocks can't be compacted like they should.
Besides, when writing summary blocks into pages, if remain space in a page
isn't big enough for one f2fs_summary, it will be left unused, current code
seems not to take it into account.
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
During xattr updating, free size should be corrected to remainder free size
+ old entry size.
It can avoid ENOSPC error when we update old entry with the same size new
entry at fully filled xattr.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If you want to remove unnecessary BUG_ONs, you can just turn off F2FS_CHECK_FS
in your kernel config.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This config will support an option to remove so many BUG_ONs that degrade
the performance potentially.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The deadlock is found through the following scenario.
sys_mkdir()
-> f2fs_add_link()
-> __f2fs_add_link()
-> init_inode_metadata()
: lock_page(inode);
-> f2fs_init_acl()
-> f2fs_set_acl()
-> f2fs_setxattr(..., NULL)
: This NULL page incurs a deadlock at update_inode_page().
So, likewise f2fs_init_security(), this patch adds a parameter to transfer the
locked inode page to f2fs_setxattr().
Found by Linux File System Verification project (linuxtesting.org).
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Only one dirty type is set in __locate_dirty_segment and we can know
dirty type of segment. So we don't need to check other dirty types.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, set_page_dirty is called every time after writting one summary info
into compacted summary page,
To avoid redundant set_page_dirty, we only call set_page_dirty before release
page.
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a control method in sysfs to reclaim prefree segments.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch merges some background jobs into this new function.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs postpones reclaiming prefree segments into free segments
as much as possible.
However, if user writes and deletes a bunch of data without any sync or fsync
calls, some flash storages can suffer from garbage collections.
So, this patch adds the reclaiming codes to f2fs_write_node_pages and background
GC thread.
If there are a lot of prefree segments, let's do checkpoint so that f2fs
submits discard commands for the prefree regions to the flash storage.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce the unfailed version of kmem_cache_alloc named f2fs_kmem_cache_alloc
to hide the retry routine and make the code a bit cleaner.
v2:
Fix the wrong use of 'retry' tag pointed out by Gao feng.
Use more neat code to remove redundant tag suggested by Haicheng Li.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Because one dirty seg can only be mapped to one dirty_type. Otherwise, it's a bug.
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
[Jaegeuk Kim: modify a comment related to this patch]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enhances the recovery routine not to write any data/node/meta until
its completion.
If any writes are sent to the disk, it could contaminate the written history
that will be used for further recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, do_checkpoint() will call congestion_wait() for waiting the pages
(previous submitted node/meta/data pages) to be written back.
Because congestion_wait() will set a regular period (e.g. HZ / 50 ) for waiting, and
no additional wake up mechanism was introduced if IO ends up before regular period costed.
Yuan Zhong found there is a situation that after the pages have been written back,
but the checkpoint thread still wait for congestion_wait to exit.
So here we store checkpoint task into f2fs_sb when doing checkpoint, it'll wait for IO completes
if there's IO going on, and in the end IO path, wake up checkpoint task when IO ends up.
Thanks to Yuan Zhong's pre work about this problem.
Reported-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce function read_raw_super_block() to hide reading raw super block and
the retry routine if the first sb is invalid.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes the logic previously introduced to address the starvation
on cp_rwsem.
One potential there-in bug is that we should cover the wait.list with spin_lock,
but the previous code broke this rule.
And, actually current rwsem handles this starvation issue reasonably, so that we
didn't need to do this before neither.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, there was a erroneous scenario like below.
thread 1: thread 2:
f2fs_unlink
- acquire_orphan_inode
: sbi->n_orphans++ write_checkpoint
- block_operations
: f2fs_lock_all
- do_checkpoint
: write orphan blocks with sbi->n_orphans
- unblock_operations
- f2fs_lock_op
- release_orphan_inode
- f2fs_unlock_op
During the checkpoint by thread 2, f2fs stores a wrong orphan block according
to the wrong sbi->n_orphans.
To avoid this, simply we should make cover acquire_orphan_inode too with
f2fs_lock_op.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
During the f2fs_put_super procedure, we don't need to conduct checkpoint all
the time, since we don't need to do that if superblock is clean.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The current f2fs code errors if the xattr or acl options are passed when
remounting. This is important in a typical scenario where f2fs is mounted
as a "ro" root file-system by the boot loader and then the init process wants
to remount it "rw" with the "remount,rw" option.
Signed-off-by: Kelly Anderson <kelly@xilka.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
And each other operate routine(besides checkpoint) needs to acquire a fs_lock,
there is a terrible problem here, if these are too many concurrency threads acquiring
fs_lock, so that they will block each other and may lead to some performance problem,
but this is not the phenomenon we want to see.
Though there are some optimization patches introduced to enhance the usage of fs_lock,
but the thorough solution is using a *rw_sem* to replace the fs_lock.
Checkpoint routine takes write_sem, and other ops take read_sem, so that we can block
other ops(ex, recovery) when doing checkpoint, and other ops will not disturb each other,
this can avoid the problem described above completely.
Because of the weakness of rw_sem, the above change may introduce a potential problem
that the checkpoint thread might get starved if other threads are intensively locking
the read semaphore for I/O.(Pointed out by Xu Jin)
In order to avoid this, a wait_list is introduced, the appending read semaphore ops
will be dropped into the wait_list if checkpoint thread is waiting for write semaphore,
and will be waked up when checkpoint thread gives up write semaphore.
Thanks to Kim's previous review and test, and will be very glad to see other guys'
performance tests about this patch.
V2:
-fix the potential starvation problem.
-use more suitable func name suggested by Xu Jin.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: adjust minor coding standard]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
During recovery, orphan inodes are deleted via truncate_hole().
These orphans are added by recover_dentry() via f2fs_delete_entry().
However, f2fs_delete_entry() adds them via add_orphan_inode()
without calling acquire_orphan_inode() first. This prevents the
counters from being incremented properly, which causes them to
underflow when remove_orphan_inode() is called later on.
Signed-off-by: Russ Knize <rknize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
f2fs_initxattrs() is called internally from within F2FS and should
not call functions that are used by VFS handlers. This avoids
certain deadlocks:
- vfs_create()
- f2fs_create() <-- takes an fs_lock
- f2fs_add_link()
- __f2fs_add_link()
- init_inode_metadata()
- f2fs_init_security()
- security_inode_init_security()
- f2fs_initxattrs()
- f2fs_setxattr() <-- also takes an fs_lock
If the caller happens to grab the same fs_lock from the pool in both
places, they will deadlock. There are also deadlocks involving
multiple threads and mutexes:
- f2fs_write_begin()
- f2fs_balance_fs() <-- takes gc_mutex
- f2fs_gc()
- write_checkpoint()
- block_operations()
- mutex_lock_all() <-- blocks trying to grab all fs_locks
- f2fs_mkdir() <-- takes an fs_lock
- __f2fs_add_link()
- f2fs_init_security()
- security_inode_init_security()
- f2fs_initxattrs()
- f2fs_setxattr()
- f2fs_balance_fs() <-- blocks trying to take gc_mutex
Signed-off-by: Russ Knize <Russ.Knize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Accounting errors from buggy code calling the acquire/release/remove
orphan inode interfaces can cause n_orphans to underflow, which will
then cause acquire_orphan_inode() to return -ENOSPC on the next
operation. This commit guards against that condition.
Signed-off-by: Russ Knize <rknize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, recover_fsync_data still to write checkpoint when there is
nothing to recover with normal umount image.
It may reduce mount performance and flash memory lifetime, so let's remove
it.
Signed-off-by: Tan Shu <shu.tan@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch add macro MAX_BIO_BLOCKS to limit value of npages in
f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by
npages is larger than BIO_MAX_PAGES.
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Since the MAX_VICTIM_SEARCH has been enlarged from 20 to 4096,
the victim searching overhead will be increased much than before,
especially for SSR that searches victim for use quiet often.
This patch intends to reduce the overhead a little bit by:
- make the get_gc_cost a inline routine to reduce function call
overhead
- reduce multiplication and division operations
- reduce unnecessary comparison operation
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
There is a performance problem: when all sbi->fs_lock are holded, then
all the following threads may get the same next_lock value from sbi->next_lock_num
in function mutex_lock_op, and wait for the same lock(fs_lock[next_lock]),
it may cause performance reduce.
So we move the sbi->next_lock_num++ before getting lock, this will average the
following threads if all sbi->fs_lock are holded.
v1-->v2:
Drop the needless spin_lock as Jaegeuk suggested.
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch improves the gc efficiency by optimizing the victim
selection policy. With this optimization, the random re-write
performance could increase up to 20%.
For f2fs, when disk is in shortage of free spaces, gc will selects
dirty segments and moves valid blocks around for making more space
available. The gc cost of a segment is determined by the valid blocks
in the segment. The less the valid blocks, the higher the efficiency.
The ideal victim segment is the one that has the most garbage blocks.
Currently, it searches up to 20 dirty segments for a victim segment.
The selected victim is not likely the best victim for gc when there
are much more dirty segments. Why not searching more dirty segments
for a better victim? The cost of searching dirty segments is
negligible in comparison to moving blocks.
In this patch, it enlarges the MAX_VICTIM_SEARCH to 4096 to make
the search more aggressively for a possible better victim. Since
it also applies to victim selection for SSR, it will likely improve
the SSR efficiency as well.
The test case is simple. It creates as many files until the disk full.
The size for each file is 32KB. Then it writes as many as 100000
records of 4KB size to random offsets of random files in sync mode.
The testing was done on a 2GB partition of a SDHC card. Let's see the
test result of f2fs without and with the patch.
---------------------------------------
2GB partition, SDHC
create 52023 files of size 32768 bytes
random re-write 100000 records of 4KB
---------------------------------------
| file creation (s) | rewrite time (s) | gc count | gc garbage blocks |
[no patch] 341 4227 1174 174840
[patched] 324 2958 645 106682
It's obvious that, with the patch, f2fs finishes the test in 20+% less
time than without the patch. And internally it does much less gc with
higher efficiency than before.
Since the performance improvement is related to gc, it might not be so
obvious for other tests that do not trigger gc as often as this one (
This is because f2fs selects dirty segments for SSR use most of the
time when free space is in shortage). The well-known iozone test tool
was not used for benchmarking the patch becuase it seems do not have
a test case that performs random re-write on a full disk.
This patch is the revised version based on the suggestion from
Jaegeuk Kim.
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
[Jaegeuk Kim: suggested simpler solution]
Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, we experience bio traces as follows when running simple sequential
write test.
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500104928, size = 4K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499922208, size = 368K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499914752, size = 140K
-> total 512K
The first one is to write an indirect node block, and the others are to write
direct node blocks.
The reason why there are two separate bios for direct node blocks is:
0. initial state
------------------ ------------------
| | |xxxxxxxx |
------------------ ------------------
1. write 368K
------------------ ------------------
| | |xxxxxxxxWWWWWWWW|
------------------ ------------------
2. write 140K
------------------ ------------------
|WWWWWWW | |xxxxxxxxWWWWWWWW|
------------------ ------------------
This is because f2fs_write_node_pages tries to write just 512K totally, so that
we can lose the chance to merge more bios nicely.
After this patch is applied, we can get the following bio traces.
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500103168, size = 8K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500111368, size = 4K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500107272, size = 512K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500108296, size = 512K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500109320, size = 500K
And finally, we can improve the sequential write performance,
from 458.775 MB/s to 479.945 MB/s on SSD.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The current f2fs uses all the block counts with 32 bit numbers, which is able to
cover about 15TB volume.
But in calculation of utilization, f2fs multiplies the count by 100 which can
induce overflow.
This patch fixes this.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs conducts SSR when free_sections() < overprovision_sections.
But, even though there are a lot of prefree segments, it can consider SSR only.
So, let's consider the number of prefree segments too for triggering SSR.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The f2fs_set_link updates its parent inode number, so we should sync this to
the inode block.
Otherwise, the data can be lost after sudden-power-off.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
0. modified inode structure
--------------------------------------
metadata (e.g., i_mtime, i_ctime, etc)
--------------------------------------
direct pointers [0 ~ 873]
inline xattrs (200 bytes by default)
indirect pointers [0 ~ 4]
--------------------------------------
node footer
--------------------------------------
1. setxattr flow
- read_all_xattrs copies all the xattrs from inline and xattr node block.
- handle xattr entries
- write_all_xattrs copies modified xattrs into inline and xattr node block.
2. getxattr flow
- read_all_xattrs copies all the xattrs from inline and xattr node block.
- check target entries
3. Usage
# mount -t f2fs -o inline_xattr $DEV $MNT
Once mounted with the inline_xattr option, f2fs marks all the newly created
files to reserve an amount of inline xattr space explicitly inside the inode
block. Without the mount option, f2fs will not touch any existing files and
newly created files as well.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The __find_xattr is to search the wanted xattr entry starting from the
base_addr.
If not found, the returned entry is the last empty xattr entry that can be
allocated newly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enables the number of direct pointers inside on-disk inode block to
be changed dynamically according to the size of inline xattr space.
The number of direct pointers, ADDRS_PER_INODE, can be changed only if the file
has inline xattr flag.
The number of direct pointers that will be used by inline xattrs is defined as
F2FS_INLINE_XATTR_ADDRS.
Current patch assigns F2FS_INLINE_XATTR_ADDRS to 0 temporarily.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds basic inode flags for inline xattrs, F2FS_INLINE_XATTR,
and add a mount option, inline_xattr, which is enabled when xattr is set.
If the mount option is enabled, all the files are marked with the inline_xattrs
flag.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Fix to return -ENOMEM in the kset create and add error handling
case instead of 0, as done elsewhere in this function.
Introduced by commit b59d0bae6c.
(f2fs: add sysfs support for controlling the gc_thread)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Namjae Jeon <namjae.jeon@samsung.com>
[Jaegeuk Kim: merge the patch with previous modification]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes a false-alaramed BUG_ON.
The previous BUG_ON condition didn't cover the following true scenario.
In f2fs_add_link, 1) get_new_data_page gives an uptodate page successfully,
and then, 2) init_inode_metadata returns -ENOSPC.
At this moment, a new clean data page is remained in the page cache, but its
block address still indicates NEW_ADDR.
After then, even if sync is called, this clean data page cannot be written to
the disk due to the clean state.
So this means that get_lock_data_page should make a new empty page when its
block address is NEW_ADDR and its page is not uptodated.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When any of the caches create fails in init_f2fs_fs(), the other caches which are
create successful should be free.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
An error "label at end of compound statement" will occur if CONFIG_F2FS_STAT_FS
disabled.
fs/f2fs/segment.c:556:1: error: label at end of compound statement
So clean up the 'out' label to fix it.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In f2fs_write_inode, updating inode after f2fs_balance_fs is not
a optimized way in the case that f2fs_gc is performed ahead. The
inode page will be unnecessarily written out twice, one of which
is in f2fs_gc->...->sync_node_pages and the other is in
update_inode_page.
Let's update the inode page in prior to f2fs_balance_fs to avoid
this.
To reproduce it,
$ touch file (before this step, should make the device need f2fs_gc)
$ sync (or wait the bdi to write dirty inode)
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
alloc_page() returns a NULL on failure, it never returns an ERR_PTR.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs_setxattr assigns i_xattr_nid in the inode page inconsistently.
The scenario is:
= Thread 1 = = Thread 2 = = fi->i_xattr_nid = = on-disk nid =
f2fs_setxattr 0 0
new_node_page X 0
sync_inode_page X X
checkpoint X X -.
grab_cache_page X X |
--> allocate a new xattr node block or -ENOSPC <----------------'
At this moment, the checkpoint stores inconsistent data where the inode has
i_xattr_nid but actual xattr node block is not allocated yet.
So, we should assign the real i_xattr_nid only after its xattr node block is
allocated.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously xattr node blocks are stored to the COLD_NODE log, which means that
our roll-forward mechanism doesn't recover the xattr node blocks at all.
Only the direct node blocks in the WARM_NODE log can be recovered.
So, let's resolve the issue simply by conducting checkpoint during fsync when a
file has a modified xattr node block.
This approach is able to degrade the performance, but normally the checkpoint
overhead is shown at the initial fsync call after the xattr entry changes.
Once the checkpoint is done, no additional overhead would be occurred.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes the use of XATTR_NODE_OFFSET.
o The offset should not use several MSB bits which are used by marking node
blocks.
o IS_DNODE should handle XATTR_NODE_OFFSET to avoid potential abnormality
during the fsync call.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch should resolve the following error reported by kbuild test robot.
All error/warnings:
In file included from fs/f2fs/dir.c:13:0:
>> fs/f2fs/f2fs.h:435:17: error: field 's_kobj' has incomplete type
struct kobject s_kobj;
The failure was caused by missing the kobject header file in dir.c.
So, this patch move the header file to the right location, f2fs.h.
CC: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes a deadlock bug that occurs quite often when there are
concurrent write and fsync on a same file.
Following is the simplified call trace when tasks get hung.
fsync thread:
- f2fs_sync_file
...
- f2fs_write_data_pages
...
- update_extent_cache
...
- update_inode
- wait_on_page_writeback
bdi writeback thread
- __writeback_single_inode
- f2fs_write_data_pages
- mutex_lock(sbi->writepages)
The deadlock happens when the fsync thread waits on a inode page that has
been added to the f2fs' cached bio sbi->bio[NODE], and unfortunately,
no one else could be able to submit the cached bio to block layer for
writeback. This is because the fsync thread already hold a sbi->fs_lock and
the sbi->writepages lock, causing the bdi thread being blocked when attempt
to write data pages for the same inode. At the same time, f2fs_gc thread
does not notice the situation and could not help. Even the sync syscall
gets blocked.
To fix it, we could submit the cached bio first before waiting on a inode page
that is being written back.
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
[Jaegeuk Kim: add more cases to use f2fs_wait_on_page_writeback]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This code is being used for nobh_write_end() function.
But since now f2fs_write_end function is added so
there is no need for this code.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add sysfs entry gc_idle to control the gc policy. Where
gc_idle = 1 corresponds to selecting a cost benefit approach,
while gc_idle = 2 corresponds to selecting a greedy approach
to garbage collection. The selection is mutually exclusive one
approach will work at any point. If gc_idle = 0, then this
option is disabled.
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: change the select_gc_type() flow slightly]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add sysfs entries to control the timing parameters for
f2fs gc thread.
Various Sysfs options introduced are:
gc_min_sleep_time: Min Sleep time for GC in ms
gc_max_sleep_time: Max Sleep time for GC in ms
gc_no_gc_sleep_time: Default Sleep time for GC in ms
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: fix an umount bug and some minor changes]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This kfree() is no longer needed after a79dc083d7 "f2fs: move
bio_private allocation out of f2fs_bio_alloc()". The "bio->bi_private"
is NULL here so it's a no-op.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes mishandling of the sbi->n_orphans variable.
If users request lots of f2fs_unlink(), check_orphan_space() could be contended.
In such the case, sbi->n_orphans can be read incorrectly so that f2fs_unlink()
would fall into the wrong state which results in the failure of
add_orphan_inode().
So, let's increment sbi->n_orphans virtually prior to the actual orphan inode
stuffs. After that, let's release sbi->n_orphans by calling release_orphan_inode
or remove_orphan_inode.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
bio->bi_private is not always needed. As in the reading data path,
end_read_io does not need bio_private for further using, so moving
bio_private allocation out of f2fs_bio_alloc(). Alloc it in the
submit_write_page(), and ignore it in the f2fs_readpage().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As we remove the target single node, so list_for_each is enought, in order to
clean up, we use list_for_each_entry instead.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
For string without format specifiers, using seq_puts()/seq_putc()
instead of seq_printf().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As similar as the i_pino fix, i_name also should be fixed when i_nlink is 1.
The errorneous scenario is like this.
1. touch test1
2. link test1 test2
3. unlink test2
4. fsync test1
After this, i_name should be test1.
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The error is reproducible by:
0. mkfs.f2fs /dev/sdb1 & mount
1. touch test1
2. touch test2
3. mv test1 test2
4. umount
5. dumpt.f2fs -i 4 /dev/sdb1
After this, when we retrieve the inode->i_name of test2 by dump.f2fs, we get
test1 instead of test2.
This is because f2fs didn't update the file name during the f2fs_rename.
So, this patch fixes that.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help function F2FS_NODE() to simplify the conversion of node_page to
f2fs_node.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add a help func F2FS_STAT() to get the f2fs_stat_info.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In order to support SQLite that uses fdatasync instead of fsync, we should
guarantee the data requested by fdatasync can be recovered after sudden-power-
off.
So, let's remove the fdatasync condition in f2fs_sync_file.
Otherwise, we can restore the data after sudden-power-off due to nonexistence
of any fsync mark'ed node blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>