This patch lets ext4_init_inode_table() handle errors right.
ext4_init_inode_table() should down_write() alloc_sem which
has been up_write()ed and stop the started journal handle.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We added some more error handling in b40971426a "ext4: add error
checking to calls to ext4_handle_dirty_metadata()". But we need to
call kfree() as well to avoid a memory leak.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As requested by Al Viro, since umode_t may be changing to a u32 for
some architectures.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Fix problems if fsync() races against a rename of a parent directory
as pointed out by Al Viro in his own inimitable way:
>While we are at it, could somebody please explain what the hell is ext4
>doing in
>static int ext4_sync_parent(struct inode *inode)
>{
> struct writeback_control wbc;
> struct dentry *dentry = NULL;
> int ret = 0;
>
> while (inode && ext4_test_inode_state(inode, EXT4_STATE_NEWENTRY)) {
> ext4_clear_inode_state(inode, EXT4_STATE_NEWENTRY);
> dentry = list_entry(inode->i_dentry.next,
> struct dentry, d_alias);
> if (!dentry || !dentry->d_parent || !dentry->d_parent->d_inode)
> break;
> inode = dentry->d_parent->d_inode;
> ret = sync_mapping_buffers(inode->i_mapping);
> ...
>Note that dentry obviously can't be NULL there. dentry->d_parent is never
>NULL. And dentry->d_parent would better not be negative, for crying out
>loud! What's worse, there's no guarantees that dentry->d_parent will
>remain our parent over that sync_mapping_buffers() *and* that inode won't
>just be freed under us (after rename() and memory pressure leading to
>eviction of what used to be our dentry->d_parent)......
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The logical block number in map.l_blk is a __u32, and so before we
shift it left, by the block size, we neeed cast it to a 64-bit size.
Otherwise i_size can be corrupted on an ENOSPC.
# df -T /mnt/mp1
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/sda6 ext4 9843276 153056 9190200 2% /mnt/mp1
# fallocate -o 0 -l 2199023251456 /mnt/mp1/testfile
fallocate: /mnt/mp1/testfile: fallocate failed: No space left on device
# stat /mnt/mp1/testfile
File: `/mnt/mp1/testfile'
Size: 4293656576 Blocks: 19380440 IO Block: 4096 regular file
Device: 806h/2054d Inode: 12 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2011-07-25 13:01:31.414490496 +0900
Modify: 2011-07-25 13:01:31.414490496 +0900
Change: 2011-07-25 13:01:31.454490495 +0900
Signed-off-by: Utako Kusaka <u-kusaka@wm.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
--
fs/ext4/extents.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
The old function ext4_ext_rm_idx is used only for truncate case
because it just remove last index in extent-index-block. When punching
hole, it usually needed to remove "middle" index, therefore we must
move indexes which after it forward.
(I create a file with 1 depth extent tree and punch hole in the middle
of it, the last index in index-block strangly gone, so I find out this
bug)
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The reserve_backup_gdb() function only needs the block group number;
there's no need to pass a pointer to struct ext4_new_group_data to it.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
add_new_gdb() only needs the block group number; there is no need to
pass a pointer to struct ext4_new_group_data to add_new_gdb().
Instead of filling in a pointer the struct buffer_head in
add_new_gdb(), it's simpler to have the caller fetch it from the
s_group_desc[] array.
[Fixed error path to handle the case where struct buffer_head *primary
hasn't been set yet. -- Ted]
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There is no need to lock the buffers since no one else should be
touching these buffers besides the file system.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch simplifies journal handling in setup_new_group_blocks().
In previous code, block bitmap is modified everywhere in
setup_new_group_blocks(), ext4_get_write_access() in
extend_or_restart_transaction() is used to guarantee that the block
bitmap stays in the new handle, this makes things complicated.
The previous commit changed things so that the modifications on the
block bitmap are batched and done by ext4_set_bits() at the end of the
for loop. This allows us to simplify things.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Rename mb_set_bits() to ext4_set_bits() and make it a global function
so that setup_new_group_blocks() can use it.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If ext4_group_add_blocks() is called with 0 block, make it return 0
without doing any extra work.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch lets ext4_group_add_blocks() return an error code if it
fails, so that upper functions can handle error correctly.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A filesystem with errors is not allowed to being resized, otherwise,
it is easy to destroy the filesystem.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Before this patch, parallel resizers are allowed and protected by a
mutex lock, actually, there is no need to support parallel resizer, so
this patch prevents parallel resizers by atmoic bit ops, like
lock_page() and unlock_page() do.
To do this, the patch removed the mutex lock s_resize_lock from struct
ext4_sb_info and added a unsigned long field named s_resize_flags
which inidicates if there is a resizer.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When journalling data for an inode (either because it is a symlink or
because the filesystem is mounted in data=journal mode), ext4_evict_inode()
can discard unwritten data by calling truncate_inode_pages(). This is
because we don't mark the buffer / page dirty when journalling data but only
add the buffer to the running transaction and thus mm does not know there
are still unwritten data.
Fix the problem by carefully tracking transaction containing inode's data,
committing this transaction, and writing uncheckpointed buffers when inode
should be reaped.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The debug message in ext4_ext_insert_extent before moving extent
is incorrect (the "from xx to xx").
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The argument "inode" in function ext4_ext_next_allocated_block looks useless,
so clean it.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ac_repeats isn't referenced in the mballoc code. So remove it.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_mb_release, we use s_mb_buddies_generated++. Although
the output is OK, but I don't think we need this extra ++.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_mb_load_buddy() calls ext4_get_group_info() for setting both
"grp" and "e4b->bd_info", but it could do "e4b->bd_info = grp".
Reported-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If eh_entries is equal to (or greater than) eh_max, the operation of
inserting new extent_idx will make number of entries overflow.
So check eh_entries before inserting the new extent_idx.
Although there is no bug case according the code (function
ext4_ext_insert_index is called by ext4_ext_split and ext4_ext_split
is called only if the index block has free space), the right logic
should be "lookup the capacity before insertion".
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch avoids an extraneous lookup of the extent cache
in ext4_ext_map_blocks() when the flag
EXT4_GET_BLOCKS_PUNCH_OUT_EXT is absent.
The existing logic was performing the lookup but not making
use of the result. The patch simply reverses the order of evaluation
in the condition.
Since ext4_ext_in_cache() does not initialize newex on misses, bypassing
its invocation does not introduce any new issue in this regard.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Eric Gouriou <egouriou@google.com>
This patch removes the extra parameter in ext4_ext_remove_space()
which is no longer needed.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch optimizes the punch hole operation by skipping the
tree walking code that is used by truncate. Since punch hole
is done through map blocks, the path to the extent is already
known in this function, so we do not need to look it up again.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the stripe width was set to 1, then this patch will ignore
that stripe width and ext4 will act as if the stripe width
were 0 with respect to optimizing allocations.
Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Previously, if a stripe width was provided, then it would be used
as the preallocation granularity, with no santiy checking and no
way to override this. Now, mb_prealloc_size defaults to the smallest
multiple of stripe size that is greater than or equal to the old
default mb_prealloc_size, and this can be overridden with the sysfs
interface.
Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Compilation of ext4/namei.c brought up an error and warning messages
when compiled with -DDX_DEBUG
Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The comment from Al Viro about possible race in the ext4_orphan_add() is
not justified. There is no race possible as we always have either i_mutex
locked, or the inode can not be referenced from outside hence the
J_ASSERS should not be hit from the reason described in comment.
This commit replaces it with notion that we are holding i_mutex so it
should not be possible for i_nlink to be changed while waiting for
s_orphan_lock.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we meet with an error in ext4_mb_add_groupinfo, we kfree
sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)], but fail to
reset it to NULL. So the caller ext4_mb_init_backend will try to kfree
it again and causes a double free. So fix it by resetting it to NULL.
Some typo in comments of mballoc.c are also changed.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_groupinfo_create_slab, we create ext4_groupinfo_caches within
ext4_grpinfo_slab_create_mutex, but set it outside the lock, and there
does exist some case that we may create it twice and causes a memory
leak. So set it before we call mutex_unlock.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Optimize ext4_ext_insert_extent() by avoiding
ext4_ext_next_leaf_block() when the result is not used/needed.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If eh->eh_entries is smaller than eh->eh_max, the routine will
go to the "repeat" and then go to "has_space" directlly ,
since argument "depth" and "eh" are not even changed.
Therefore, goto "has_space" directly and remove redundant "repeat" tag.
Signed-off-by: Robin Dong <sanbai@taobao.com>
at ext4_trim_all_free() comment, there is no longer an @e4b parameter,
instead it is @group.
Reported-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4, when FITRIM is called every time, we iterate all the
groups and do trim one by one. It is a bit time wasting if the
group has been trimmed and there is no change since the last
trim.
So this patch adds a new flag in ext4_group_info->bb_state to
indicate that the group has been trimmed, and it will be cleared
if some blocks is freed(in release_blocks_on_commit). Another
trim_minlen is added in ext4_sb_info to record the last minlen
we use to trim the volume, so that if the caller provide a small
one, we will go on the trim regardless of the bb_state.
A simple test with my intel x25m ssd:
df -h shows:
/dev/sdb1 40G 21G 17G 56% /mnt/ext4
Block size: 4096
run the FITRIM with the following parameter:
range.start = 0;
range.len = UINT64_MAX;
range.minlen = 1048576;
without the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.505s
user 0m0.000s
sys 0m1.224s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.359s
user 0m0.000s
sys 0m1.178s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.228s
user 0m0.000s
sys 0m1.151s
with the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.625s
user 0m0.000s
sys 0m1.269s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m0.002s
user 0m0.000s
sys 0m0.001s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m0.002s
user 0m0.000s
sys 0m0.001s
A big improvement for the 2nd and 3rd run.
Even after I delete some big image files, it is still much
faster than iterating the whole disk.
[root@boyu-tm test]# time ./ftrim /mnt/ext4/a
real 0m1.217s
user 0m0.000s
sys 0m0.196s
Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we trim some free blocks in a group of ext4, we need to
calculate the free blocks properly and check whether there are
enough freed blocks left for us to trim. Current solution will
only calculate free spaces if they are large for a trim which
isn't appropriate.
Let us see a small example:
a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k.
And minblocks is 1M. With current solution, we have to iterate
the whole group since these 300k will never be subtracted from
1.5M. But actually we should exit after we find the first 2
free spaces since the left 3 chunks only sum up to 900K if we
subtract the first 600K although they can't be trimed.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In 0f0a25b, we adjust 'len' with s_first_data_block - start, but
it could underflow in case blocksize=1K, fstrim_range.len=512 and
fstrim_range.start = 0. In this case, when we run the code:
len -= first_data_blk - start; len will be underflow to -1ULL.
In the end, although we are safe that last_group check later will limit
the trim to the whole volume, but that isn't what the user really want.
So this patch fix it. It also adds the check for 'start' like ext3 so that
we can break immediately if the start is invalid.
Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Using function calls in TP_printk causes perf heartburn, so print the
MAJOR/MINOR device numbers instead.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Upon corrupted inode or disk failures, we may fail after we already
allocate some blocks from the inode or take some blocks from the
inode's preallocation list, but before we successfully insert the
corresponding extent to the extent tree. In this case, we should free
any allocated blocks and discard the inode's preallocated blocks
because the entries in the inode's preallocation list may be in an
inconsistent state.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
The current implementation of ext4_free_blocks() always calls
dquot_free_block This looks quite sensible in the most cases: blocks
to be freed are associated with inode and were accounted in quota and
i_blocks some time ago.
However, there is a case when blocks to free were not accounted by the
time calling ext4_free_blocks() yet:
1. delalloc is on, write_begin pre-allocated some space in quota
2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks()
3. then ext4_ext_map_blocks() gets an error (e.g. ENOSPC) from
ext4_ext_insert_extent() and calls ext4_free_blocks().
In this scenario, ext4_free_blocks() calls dquot_free_block() who, in
turn, decrements i_blocks for blocks which were not accounted yet (due
to delalloc) After clean umount, e2fsck reports something like:
> Inode 21, i_blocks is 5080, should be 5128. Fix<y>?
because i_blocks was erroneously decremented as explained above.
The patch fixes the problem by passing the new flag
EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request
that the dquot_free_block() call be skipped.
Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
These days, bio_alloc() is guaranteed to never fail (as long as nvecs
is less than BIO_MAX_PAGES), so we don't need the loop around the
struct bio allocation.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I found that ext4_ext_find_goal() and ext4_find_near()
share the same code for returning a coloured start block
based on i_block_group.
We can refactor this into a common function so that they
don't diverge in the future.
Thanks to adilger for suggesting the new function name.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch moves functions from inode.c to indirect.c.
The moved functions are ext4_ind_* functions and their helpers.
Functions called from inode.c are declared extern.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move two functions that will be needed by the indirect functions to be
moved to indirect.c as well as inode.c to truncate.h as inline
functions, so that we can avoid having duplicate copies of the
function (which can be a maintenance problem) without having to expose
them as globally functions.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>